Git Mailing List Archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] reftable: improvements for the `binsearch()` mechanism
@ 2024-03-22 12:22 Patrick Steinhardt
  2024-03-22 12:22 ` [PATCH 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
                   ` (9 more replies)
  0 siblings, 10 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-22 12:22 UTC (permalink / raw
  To: git

[-- Attachment #1: Type: text/plain, Size: 1532 bytes --]

Hi,

this patch series contains various improvements for the `binsearch()`
mechanism used by the reftable library:

  - Callsites are refactored to make it more obvious what exactly they
    are doing. This mostly involves improved names for callback
    functions.

  - Error handling is improved such that binsearch() knows to abort when
    the callback returns a negative value.

  - Error handling for binary searching over restart counters is fixed
    so that we check for errors at the correct point in time.

  - Decoding of record keys is refactored when searching over restart
    counters to not require allocations anymore.

Thanks!

Patrick

Patrick Steinhardt (7):
  reftable/basics: fix return type of `binsearch()` to be `size_t`
  reftable/basics: improve `binsearch()` test
  reftable/refname: refactor binary search over refnames
  reftable/block: refactor binary search over restart points
  reftable/block: fix error handling when searching restart points
  reftable/record: extract function to decode key lengths
  reftable/block: avoid decoding keys when searching restart points

 reftable/basics.c      |   7 ++-
 reftable/basics.h      |   7 +--
 reftable/basics_test.c |  55 +++++++++++---------
 reftable/block.c       | 114 ++++++++++++++++++++++++++++++-----------
 reftable/record.c      |  34 ++++++++----
 reftable/record.h      |   6 +++
 reftable/refname.c     |  53 +++++++++----------
 7 files changed, 179 insertions(+), 97 deletions(-)

-- 
2.44.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t`
  2024-03-22 12:22 [PATCH 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
@ 2024-03-22 12:22 ` Patrick Steinhardt
  2024-03-22 12:22 ` [PATCH 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-22 12:22 UTC (permalink / raw
  To: git

[-- Attachment #1: Type: text/plain, Size: 4576 bytes --]

The `binsearch()` function can be used to find the first element for
which a callback functions returns a truish value. But while the array
size is of type `size_t`, the function in fact returns an `int` that is
supposed to index into that array.

Fix the function signature to return a `size_t`. This conversion does
not change any semantics given that the function would only ever return
a value in the range `[0, sz]` anyway.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/basics.c      |  2 +-
 reftable/basics.h      |  2 +-
 reftable/basics_test.c |  6 +++---
 reftable/block.c       |  3 ++-
 reftable/refname.c     | 17 +++++++----------
 5 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/reftable/basics.c b/reftable/basics.c
index 0785aff941..2c5f34b39e 100644
--- a/reftable/basics.c
+++ b/reftable/basics.c
@@ -27,7 +27,7 @@ void put_be16(uint8_t *out, uint16_t i)
 	out[1] = (uint8_t)(i & 0xff);
 }
 
-int binsearch(size_t sz, int (*f)(size_t k, void *args), void *args)
+size_t binsearch(size_t sz, int (*f)(size_t k, void *args), void *args)
 {
 	size_t lo = 0;
 	size_t hi = sz;
diff --git a/reftable/basics.h b/reftable/basics.h
index 91f3533efe..2672520e76 100644
--- a/reftable/basics.h
+++ b/reftable/basics.h
@@ -28,7 +28,7 @@ void put_be16(uint8_t *out, uint16_t i);
  * Contrary to bsearch(3), this returns something useful if the argument is not
  * found.
  */
-int binsearch(size_t sz, int (*f)(size_t k, void *args), void *args);
+size_t binsearch(size_t sz, int (*f)(size_t k, void *args), void *args);
 
 /*
  * Frees a NULL terminated array of malloced strings. The array itself is also
diff --git a/reftable/basics_test.c b/reftable/basics_test.c
index 1fcd229725..dc1c87c5df 100644
--- a/reftable/basics_test.c
+++ b/reftable/basics_test.c
@@ -34,15 +34,15 @@ static void test_binsearch(void)
 
 	int i = 0;
 	for (i = 1; i < 11; i++) {
-		int res;
+		size_t res;
+
 		args.key = i;
 		res = binsearch(sz, &binsearch_func, &args);
 
 		if (res < sz) {
 			EXPECT(args.key < arr[res]);
-			if (res > 0) {
+			if (res > 0)
 				EXPECT(args.key >= arr[res - 1]);
-			}
 		} else {
 			EXPECT(args.key == 10 || args.key == 11);
 		}
diff --git a/reftable/block.c b/reftable/block.c
index e2a2cee58d..422885bddb 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -382,7 +382,8 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 	};
 	struct block_iter next = BLOCK_ITER_INIT;
 	struct reftable_record rec;
-	int err = 0, i;
+	int err = 0;
+	size_t i;
 
 	if (args.error) {
 		err = REFTABLE_FORMAT_ERROR;
diff --git a/reftable/refname.c b/reftable/refname.c
index 7570e4acf9..64eba1b886 100644
--- a/reftable/refname.c
+++ b/reftable/refname.c
@@ -33,10 +33,9 @@ static int modification_has_ref(struct modification *mod, const char *name)
 			.names = mod->add,
 			.want = name,
 		};
-		int idx = binsearch(mod->add_len, find_name, &arg);
-		if (idx < mod->add_len && !strcmp(mod->add[idx], name)) {
+		size_t idx = binsearch(mod->add_len, find_name, &arg);
+		if (idx < mod->add_len && !strcmp(mod->add[idx], name))
 			return 0;
-		}
 	}
 
 	if (mod->del_len > 0) {
@@ -44,10 +43,9 @@ static int modification_has_ref(struct modification *mod, const char *name)
 			.names = mod->del,
 			.want = name,
 		};
-		int idx = binsearch(mod->del_len, find_name, &arg);
-		if (idx < mod->del_len && !strcmp(mod->del[idx], name)) {
+		size_t idx = binsearch(mod->del_len, find_name, &arg);
+		if (idx < mod->del_len && !strcmp(mod->del[idx], name))
 			return 1;
-		}
 	}
 
 	err = reftable_table_read_ref(&mod->tab, name, &ref);
@@ -77,7 +75,7 @@ static int modification_has_ref_with_prefix(struct modification *mod,
 			.names = mod->add,
 			.want = prefix,
 		};
-		int idx = binsearch(mod->add_len, find_name, &arg);
+		size_t idx = binsearch(mod->add_len, find_name, &arg);
 		if (idx < mod->add_len &&
 		    !strncmp(prefix, mod->add[idx], strlen(prefix)))
 			goto done;
@@ -96,11 +94,10 @@ static int modification_has_ref_with_prefix(struct modification *mod,
 				.names = mod->del,
 				.want = ref.refname,
 			};
-			int idx = binsearch(mod->del_len, find_name, &arg);
+			size_t idx = binsearch(mod->del_len, find_name, &arg);
 			if (idx < mod->del_len &&
-			    !strcmp(ref.refname, mod->del[idx])) {
+			    !strcmp(ref.refname, mod->del[idx]))
 				continue;
-			}
 		}
 
 		if (strncmp(ref.refname, prefix, strlen(prefix))) {
-- 
2.44.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 2/7] reftable/basics: improve `binsearch()` test
  2024-03-22 12:22 [PATCH 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
  2024-03-22 12:22 ` [PATCH 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
@ 2024-03-22 12:22 ` Patrick Steinhardt
  2024-03-22 18:46   ` Justin Tobler
  2024-03-22 12:22 ` [PATCH 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-22 12:22 UTC (permalink / raw
  To: git

[-- Attachment #1: Type: text/plain, Size: 2524 bytes --]

The `binsearch()` test is somewhat weird in that it doesn't explicitly
spell out its expectations. Instead it does so in a rather ad-hoc way
with some hard-to-understand computations.

Refactor the test to spell out the needle as well as expected index for
all testcases. This refactoring highlights that the `binsearch_func()`
is written somewhat weirdly to find the first integer smaller than the
needle, not smaller or equal to it. Adjust the function accordingly.

While at it, rename the callback function to better convey its meaning.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/basics_test.c | 55 ++++++++++++++++++++++++------------------
 1 file changed, 31 insertions(+), 24 deletions(-)

diff --git a/reftable/basics_test.c b/reftable/basics_test.c
index dc1c87c5df..85c4d1621c 100644
--- a/reftable/basics_test.c
+++ b/reftable/basics_test.c
@@ -12,40 +12,47 @@ license that can be found in the LICENSE file or at
 #include "test_framework.h"
 #include "reftable-tests.h"
 
-struct binsearch_args {
-	int key;
-	int *arr;
+struct integer_needle_lesseq {
+	int needle;
+	int *haystack;
 };
 
-static int binsearch_func(size_t i, void *void_args)
+static int integer_needle_lesseq(size_t i, void *_args)
 {
-	struct binsearch_args *args = void_args;
-
-	return args->key < args->arr[i];
+	struct integer_needle_lesseq *args = _args;
+	return args->needle <= args->haystack[i];
 }
 
 static void test_binsearch(void)
 {
-	int arr[] = { 2, 4, 6, 8, 10 };
-	size_t sz = ARRAY_SIZE(arr);
-	struct binsearch_args args = {
-		.arr = arr,
+	int haystack[] = { 2, 4, 6, 8, 10 };
+	struct {
+		int needle;
+		size_t expected_idx;
+	} testcases[] = {
+		{-9000, 0},
+		{-1, 0},
+		{0, 0},
+		{2, 0},
+		{3, 1},
+		{4, 1},
+		{7, 3},
+		{9, 4},
+		{10, 4},
+		{11, 5},
+		{9000, 5},
 	};
+	size_t i = 0;
 
-	int i = 0;
-	for (i = 1; i < 11; i++) {
-		size_t res;
-
-		args.key = i;
-		res = binsearch(sz, &binsearch_func, &args);
+	for (i = 0; i < ARRAY_SIZE(testcases); i++) {
+		struct integer_needle_lesseq args = {
+			.haystack = haystack,
+			.needle = testcases[i].needle,
+		};
+		size_t idx;
 
-		if (res < sz) {
-			EXPECT(args.key < arr[res]);
-			if (res > 0)
-				EXPECT(args.key >= arr[res - 1]);
-		} else {
-			EXPECT(args.key == 10 || args.key == 11);
-		}
+		idx = binsearch(ARRAY_SIZE(haystack), &integer_needle_lesseq, &args);
+		EXPECT(idx == testcases[i].expected_idx);
 	}
 }
 
-- 
2.44.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 3/7] reftable/refname: refactor binary search over refnames
  2024-03-22 12:22 [PATCH 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
  2024-03-22 12:22 ` [PATCH 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
  2024-03-22 12:22 ` [PATCH 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
@ 2024-03-22 12:22 ` Patrick Steinhardt
  2024-03-22 18:55   ` Justin Tobler
  2024-03-22 12:22 ` [PATCH 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-22 12:22 UTC (permalink / raw
  To: git

[-- Attachment #1: Type: text/plain, Size: 3244 bytes --]

It is comparatively hard to understand how exactly the binary search
over refnames works given that the function and variable names are not
exactly easy to grasp. Rename them to make this more obvious. This
should not result in any change in behaviour.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/refname.c | 44 ++++++++++++++++++++++----------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/reftable/refname.c b/reftable/refname.c
index 64eba1b886..9ec488d727 100644
--- a/reftable/refname.c
+++ b/reftable/refname.c
@@ -12,15 +12,15 @@
 #include "refname.h"
 #include "reftable-iterator.h"
 
-struct find_arg {
-	char **names;
-	const char *want;
+struct refname_needle_lesseq_args {
+	char **haystack;
+	const char *needle;
 };
 
-static int find_name(size_t k, void *arg)
+static int refname_needle_lesseq(size_t k, void *arg)
 {
-	struct find_arg *f_arg = arg;
-	return strcmp(f_arg->names[k], f_arg->want) >= 0;
+	struct refname_needle_lesseq_args *f_arg = arg;
+	return strcmp(f_arg->needle, f_arg->haystack[k]) <= 0;
 }
 
 static int modification_has_ref(struct modification *mod, const char *name)
@@ -29,21 +29,21 @@ static int modification_has_ref(struct modification *mod, const char *name)
 	int err = 0;
 
 	if (mod->add_len > 0) {
-		struct find_arg arg = {
-			.names = mod->add,
-			.want = name,
+		struct refname_needle_lesseq_args arg = {
+			.haystack = mod->add,
+			.needle = name,
 		};
-		size_t idx = binsearch(mod->add_len, find_name, &arg);
+		size_t idx = binsearch(mod->add_len, refname_needle_lesseq, &arg);
 		if (idx < mod->add_len && !strcmp(mod->add[idx], name))
 			return 0;
 	}
 
 	if (mod->del_len > 0) {
-		struct find_arg arg = {
-			.names = mod->del,
-			.want = name,
+		struct refname_needle_lesseq_args arg = {
+			.haystack = mod->del,
+			.needle = name,
 		};
-		size_t idx = binsearch(mod->del_len, find_name, &arg);
+		size_t idx = binsearch(mod->del_len, refname_needle_lesseq, &arg);
 		if (idx < mod->del_len && !strcmp(mod->del[idx], name))
 			return 1;
 	}
@@ -71,11 +71,11 @@ static int modification_has_ref_with_prefix(struct modification *mod,
 	int err = 0;
 
 	if (mod->add_len > 0) {
-		struct find_arg arg = {
-			.names = mod->add,
-			.want = prefix,
+		struct refname_needle_lesseq_args arg = {
+			.haystack = mod->add,
+			.needle = prefix,
 		};
-		size_t idx = binsearch(mod->add_len, find_name, &arg);
+		size_t idx = binsearch(mod->add_len, refname_needle_lesseq, &arg);
 		if (idx < mod->add_len &&
 		    !strncmp(prefix, mod->add[idx], strlen(prefix)))
 			goto done;
@@ -90,11 +90,11 @@ static int modification_has_ref_with_prefix(struct modification *mod,
 			goto done;
 
 		if (mod->del_len > 0) {
-			struct find_arg arg = {
-				.names = mod->del,
-				.want = ref.refname,
+			struct refname_needle_lesseq_args arg = {
+				.haystack = mod->del,
+				.needle = ref.refname,
 			};
-			size_t idx = binsearch(mod->del_len, find_name, &arg);
+			size_t idx = binsearch(mod->del_len, refname_needle_lesseq, &arg);
 			if (idx < mod->del_len &&
 			    !strcmp(ref.refname, mod->del[idx]))
 				continue;
-- 
2.44.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 4/7] reftable/block: refactor binary search over restart points
  2024-03-22 12:22 [PATCH 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
                   ` (2 preceding siblings ...)
  2024-03-22 12:22 ` [PATCH 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
@ 2024-03-22 12:22 ` Patrick Steinhardt
  2024-03-22 12:22 ` [PATCH 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-22 12:22 UTC (permalink / raw
  To: git

[-- Attachment #1: Type: text/plain, Size: 5818 bytes --]

When seeking a record in our block reader we perform a binary search
over the block's restart points so that we don't have to do a linear
scan over the whole block. The logic to do so is quite intricate though,
which makes it hard to understand.

Improve documentation and rename some of the functions and variables so
that the code becomes easier to understand overall. This refactoring
should not result in any change in behaviour.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/block.c | 97 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 70 insertions(+), 27 deletions(-)

diff --git a/reftable/block.c b/reftable/block.c
index 422885bddb..6cd4c053df 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -273,34 +273,36 @@ void block_reader_start(struct block_reader *br, struct block_iter *it)
 	it->next_off = br->header_off + 4;
 }
 
-struct restart_find_args {
+struct restart_needle_less_args {
 	int error;
-	struct strbuf key;
-	struct block_reader *r;
+	struct strbuf needle;
+	struct block_reader *reader;
 };
 
-static int restart_key_less(size_t idx, void *args)
+static int restart_needle_less(size_t idx, void *_args)
 {
-	struct restart_find_args *a = args;
-	uint32_t off = block_reader_restart_offset(a->r, idx);
+	struct restart_needle_less_args *args = _args;
+	uint32_t off = block_reader_restart_offset(args->reader, idx);
 	struct string_view in = {
-		.buf = a->r->block.data + off,
-		.len = a->r->block_len - off,
+		.buf = args->reader->block.data + off,
+		.len = args->reader->block_len - off,
 	};
-
-	/* the restart key is verbatim in the block, so this could avoid the
-	   alloc for decoding the key */
-	struct strbuf rkey = STRBUF_INIT;
+	struct strbuf kth_restart_key = STRBUF_INIT;
 	uint8_t unused_extra;
-	int n = reftable_decode_key(&rkey, &unused_extra, in);
-	int result;
+	int result, n;
+
+	/*
+	 * TODO: The restart key is verbatim in the block, so we can in theory
+	 * avoid decoding the key and thus save some allocations.
+	 */
+	n = reftable_decode_key(&kth_restart_key, &unused_extra, in);
 	if (n < 0) {
-		a->error = 1;
+		args->error = 1;
 		return -1;
 	}
 
-	result = strbuf_cmp(&a->key, &rkey);
-	strbuf_release(&rkey);
+	result = strbuf_cmp(&args->needle, &kth_restart_key);
+	strbuf_release(&kth_restart_key);
 	return result < 0;
 }
 
@@ -376,9 +378,9 @@ void block_iter_close(struct block_iter *it)
 int block_reader_seek(struct block_reader *br, struct block_iter *it,
 		      struct strbuf *want)
 {
-	struct restart_find_args args = {
-		.key = *want,
-		.r = br,
+	struct restart_needle_less_args args = {
+		.needle = *want,
+		.reader = br,
 	};
 	struct block_iter next = BLOCK_ITER_INIT;
 	struct reftable_record rec;
@@ -390,7 +392,35 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 		goto done;
 	}
 
-	i = binsearch(br->restart_count, &restart_key_less, &args);
+	/*
+	 * Perform a binary search over the block's restart points, which
+	 * avoids doing a linear scan over the whole block. Like this, we
+	 * identify the section of the block that should contain our key.
+	 *
+	 * Note that we explicitly search for the first restart point _greater_
+	 * than the sought-after record, not _greater or equal_ to it. In case
+	 * the sought-after record is located directly at the restart point we
+	 * would otherwise start doing the linear search at the preceding
+	 * restart point. While that works alright, we would end up scanning
+	 * too many record.
+	 */
+	i = binsearch(br->restart_count, &restart_needle_less, &args);
+
+	/*
+	 * Now there are multiple cases:
+	 *
+	 *   - `i == 0`: The wanted record must be contained before the first
+	 *     restart point. We will thus start searching for the record in
+	 *     that section after accounting for the header offset.
+	 *
+	 *   - `i == restart_count`: The wanted record was not found at any of
+	 *     the restart points. As there is no restart point at the end of
+	 *     the section the record may thus be contained in the last block.
+	 *
+	 *   - `i > 0`: The wanted record must be contained in the section
+	 *     before the found restart point. We thus do a linear search
+	 *     starting from the preceding restart point.
+	 */
 	if (i > 0)
 		it->next_off = block_reader_restart_offset(br, i - 1);
 	else
@@ -399,21 +429,34 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 
 	reftable_record_init(&rec, block_reader_type(br));
 
-	/* We're looking for the last entry less/equal than the wanted key, so
-	   we have to go one entry too far and then back up.
-	*/
+	/*
+	 * We're looking for the last entry less than the wanted key so that
+	 * the next call to `block_reader_next()` would yield the wanted
+	 * record. We thus don't want to position our reader at the sought
+	 * after record, but one before. To do so, we have to go one entry too
+	 * far and then back up.
+	 */
 	while (1) {
 		block_iter_copy_from(&next, it);
 		err = block_iter_next(&next, &rec);
 		if (err < 0)
 			goto done;
-
-		reftable_record_key(&rec, &it->last_key);
-		if (err > 0 || strbuf_cmp(&it->last_key, want) >= 0) {
+		if (err > 0) {
 			err = 0;
 			goto done;
 		}
 
+		/*
+		 * Check whether the current key is greater or equal to the
+		 * sought-after key. In case it is greater we know that the
+		 * record does not exist in the block and can thus abort early.
+		 * In case it is equal to the sought-after key we have found
+		 * the desired record.
+		 */
+		reftable_record_key(&rec, &it->last_key);
+		if (strbuf_cmp(&it->last_key, want) >= 0)
+			goto done;
+
 		block_iter_copy_from(it, &next);
 	}
 
-- 
2.44.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 5/7] reftable/block: fix error handling when searching restart points
  2024-03-22 12:22 [PATCH 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
                   ` (3 preceding siblings ...)
  2024-03-22 12:22 ` [PATCH 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
@ 2024-03-22 12:22 ` Patrick Steinhardt
  2024-03-22 12:22 ` [PATCH 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-22 12:22 UTC (permalink / raw
  To: git

[-- Attachment #1: Type: text/plain, Size: 2754 bytes --]

When doing the binary search over restart points in a block we need to
decode the record keys. This decoding step can result in an error when
the block is corrupted, which we indicate to the caller of the binary
search by setting `args.error = 1`. But the only caller that exists
mishandles this because it in fact performs the error check before
calling `binsearch()`.

Fix this bug by checking for errors at the right point in time.
Furthermore, refactor `binsearch()` so that it aborts the search in case
the callback function returns a negative value so that we don't
needlessly continue to search the block.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/basics.c | 5 ++++-
 reftable/basics.h | 5 +++--
 reftable/block.c  | 9 ++++-----
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/reftable/basics.c b/reftable/basics.c
index 2c5f34b39e..fea711db7e 100644
--- a/reftable/basics.c
+++ b/reftable/basics.c
@@ -39,8 +39,11 @@ size_t binsearch(size_t sz, int (*f)(size_t k, void *args), void *args)
 	 */
 	while (hi - lo > 1) {
 		size_t mid = lo + (hi - lo) / 2;
+		int ret = f(mid, args);
+		if (ret < 0)
+			return sz;
 
-		if (f(mid, args))
+		if (ret > 0)
 			hi = mid;
 		else
 			lo = mid;
diff --git a/reftable/basics.h b/reftable/basics.h
index 2672520e76..523ecd5307 100644
--- a/reftable/basics.h
+++ b/reftable/basics.h
@@ -22,8 +22,9 @@ uint32_t get_be24(uint8_t *in);
 void put_be16(uint8_t *out, uint16_t i);
 
 /*
- * find smallest index i in [0, sz) at which f(i) is true, assuming
- * that f is ascending. Return sz if f(i) is false for all indices.
+ * find smallest index i in [0, sz) at which `f(i) > 0`, assuming that f is
+ * ascending. Return sz if `f(i) == 0` for all indices. The search is aborted
+ * and `sz` is returned in case `f(i) < 0`.
  *
  * Contrary to bsearch(3), this returns something useful if the argument is not
  * found.
diff --git a/reftable/block.c b/reftable/block.c
index 6cd4c053df..ca80a05e21 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -387,11 +387,6 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 	int err = 0;
 	size_t i;
 
-	if (args.error) {
-		err = REFTABLE_FORMAT_ERROR;
-		goto done;
-	}
-
 	/*
 	 * Perform a binary search over the block's restart points, which
 	 * avoids doing a linear scan over the whole block. Like this, we
@@ -405,6 +400,10 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 	 * too many record.
 	 */
 	i = binsearch(br->restart_count, &restart_needle_less, &args);
+	if (args.error) {
+		err = REFTABLE_FORMAT_ERROR;
+		goto done;
+	}
 
 	/*
 	 * Now there are multiple cases:
-- 
2.44.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 6/7] reftable/record: extract function to decode key lengths
  2024-03-22 12:22 [PATCH 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
                   ` (4 preceding siblings ...)
  2024-03-22 12:22 ` [PATCH 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
@ 2024-03-22 12:22 ` Patrick Steinhardt
  2024-03-22 12:22 ` [PATCH 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-22 12:22 UTC (permalink / raw
  To: git

[-- Attachment #1: Type: text/plain, Size: 2612 bytes --]

We're about to refactor the binary search over restart points so that it
does not need to fully decode the record keys anymore. To do so we will
need to decode the record key lengths, which is non-trivial logic.

Extract the logic to decode these lengths from `refatble_decode_key()`
so that we can reuse it.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/record.c | 34 +++++++++++++++++++++++++---------
 reftable/record.h |  6 ++++++
 2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/reftable/record.c b/reftable/record.c
index 23b497adab..5506f3e913 100644
--- a/reftable/record.c
+++ b/reftable/record.c
@@ -159,26 +159,42 @@ int reftable_encode_key(int *restart, struct string_view dest,
 	return start.len - dest.len;
 }
 
-int reftable_decode_key(struct strbuf *last_key, uint8_t *extra,
-			struct string_view in)
+int reftable_decode_keylen(struct string_view in,
+			   uint64_t *prefix_len,
+			   uint64_t *suffix_len,
+			   uint8_t *extra)
 {
-	int start_len = in.len;
-	uint64_t prefix_len = 0;
-	uint64_t suffix_len = 0;
+	size_t start_len = in.len;
 	int n;
 
-	n = get_var_int(&prefix_len, &in);
+	n = get_var_int(prefix_len, &in);
 	if (n < 0)
 		return -1;
 	string_view_consume(&in, n);
 
-	n = get_var_int(&suffix_len, &in);
+	n = get_var_int(suffix_len, &in);
 	if (n <= 0)
 		return -1;
 	string_view_consume(&in, n);
 
-	*extra = (uint8_t)(suffix_len & 0x7);
-	suffix_len >>= 3;
+	*extra = (uint8_t)(*suffix_len & 0x7);
+	*suffix_len >>= 3;
+
+	return start_len - in.len;
+}
+
+int reftable_decode_key(struct strbuf *last_key, uint8_t *extra,
+			struct string_view in)
+{
+	int start_len = in.len;
+	uint64_t prefix_len = 0;
+	uint64_t suffix_len = 0;
+	int n;
+
+	n = reftable_decode_keylen(in, &prefix_len, &suffix_len, extra);
+	if (n < 0)
+		return -1;
+	string_view_consume(&in, n);
 
 	if (in.len < suffix_len ||
 	    prefix_len > last_key->len)
diff --git a/reftable/record.h b/reftable/record.h
index 826ee1c55c..d778133e6e 100644
--- a/reftable/record.h
+++ b/reftable/record.h
@@ -86,6 +86,12 @@ int reftable_encode_key(int *is_restart, struct string_view dest,
 			struct strbuf prev_key, struct strbuf key,
 			uint8_t extra);
 
+/* Decode a record's key lengths. */
+int reftable_decode_keylen(struct string_view in,
+			   uint64_t *prefix_len,
+			   uint64_t *suffix_len,
+			   uint8_t *extra);
+
 /*
  * Decode into `last_key` and `extra` from `in`. `last_key` is expected to
  * contain the decoded key of the preceding record, if any.
-- 
2.44.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 7/7] reftable/block: avoid decoding keys when searching restart points
  2024-03-22 12:22 [PATCH 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
                   ` (5 preceding siblings ...)
  2024-03-22 12:22 ` [PATCH 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
@ 2024-03-22 12:22 ` Patrick Steinhardt
  2024-03-25 10:10 ` [PATCH v2 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-22 12:22 UTC (permalink / raw
  To: git

[-- Attachment #1: Type: text/plain, Size: 2059 bytes --]

When searching over restart points in a block we decode the key of each
of the records, which results in a memory allocation. This is quite
pointless though given that records it restart points will never use
prefix compression and thus store their keys verbatim in the block.

Refactor the code so that we can avoid decoding the keys, which saves us
some allocations.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/block.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/reftable/block.c b/reftable/block.c
index ca80a05e21..8bb4e43cec 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -287,23 +287,32 @@ static int restart_needle_less(size_t idx, void *_args)
 		.buf = args->reader->block.data + off,
 		.len = args->reader->block_len - off,
 	};
-	struct strbuf kth_restart_key = STRBUF_INIT;
-	uint8_t unused_extra;
-	int result, n;
+	uint64_t prefix_len, suffix_len;
+	uint8_t extra;
+	int n;
 
 	/*
-	 * TODO: The restart key is verbatim in the block, so we can in theory
-	 * avoid decoding the key and thus save some allocations.
+	 * Records at restart points are stored without prefix compression, so
+	 * there is no need to fully decode the record key here. This removes
+	 * the need for allocating memory.
 	 */
-	n = reftable_decode_key(&kth_restart_key, &unused_extra, in);
-	if (n < 0) {
+	n = reftable_decode_keylen(in, &prefix_len, &suffix_len, &extra);
+	if (n < 0 || prefix_len) {
 		args->error = 1;
 		return -1;
 	}
 
-	result = strbuf_cmp(&args->needle, &kth_restart_key);
-	strbuf_release(&kth_restart_key);
-	return result < 0;
+	string_view_consume(&in, n);
+	if (suffix_len > in.len) {
+		args->error = 1;
+		return -1;
+	}
+
+	n = memcmp(args->needle.buf, in.buf,
+		   args->needle.len < suffix_len ? args->needle.len : suffix_len);
+	if (n)
+		return n < 0;
+	return args->needle.len < suffix_len;
 }
 
 void block_iter_copy_from(struct block_iter *dest, struct block_iter *src)
-- 
2.44.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/7] reftable/basics: improve `binsearch()` test
  2024-03-22 12:22 ` [PATCH 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
@ 2024-03-22 18:46   ` Justin Tobler
  2024-03-25 10:07     ` Patrick Steinhardt
  0 siblings, 1 reply; 45+ messages in thread
From: Justin Tobler @ 2024-03-22 18:46 UTC (permalink / raw
  To: Patrick Steinhardt; +Cc: git

On 24/03/22 01:22PM, Patrick Steinhardt wrote:
> The `binsearch()` test is somewhat weird in that it doesn't explicitly
> spell out its expectations. Instead it does so in a rather ad-hoc way
> with some hard-to-understand computations.
> 
> Refactor the test to spell out the needle as well as expected index for
> all testcases. This refactoring highlights that the `binsearch_func()`
> is written somewhat weirdly to find the first integer smaller than the
> needle, not smaller or equal to it. Adjust the function accordingly.
> 
> While at it, rename the callback function to better convey its meaning.
> 
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  reftable/basics_test.c | 55 ++++++++++++++++++++++++------------------
>  1 file changed, 31 insertions(+), 24 deletions(-)
> 
> diff --git a/reftable/basics_test.c b/reftable/basics_test.c
> index dc1c87c5df..85c4d1621c 100644
> --- a/reftable/basics_test.c
> +++ b/reftable/basics_test.c
> @@ -12,40 +12,47 @@ license that can be found in the LICENSE file or at
>  #include "test_framework.h"
>  #include "reftable-tests.h"
>  
> -struct binsearch_args {
> -	int key;
> -	int *arr;
> +struct integer_needle_lesseq {
> +	int needle;
> +	int *haystack;
>  };

This is probably just personal preference, but I think `key` and `arr`
in this case are a bit more straightforward. I do like that we rename
the args to be more specific. Do we want to also append `_args` to
denote that it is an argument set? Maybe `integer_lesseq_args`?

>  
> -static int binsearch_func(size_t i, void *void_args)
> +static int integer_needle_lesseq(size_t i, void *_args)
>  {
> -	struct binsearch_args *args = void_args;
> -
> -	return args->key < args->arr[i];
> +	struct integer_needle_lesseq *args = _args;
> +	return args->needle <= args->haystack[i];
>  }

-Justin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/7] reftable/refname: refactor binary search over refnames
  2024-03-22 12:22 ` [PATCH 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
@ 2024-03-22 18:55   ` Justin Tobler
  2024-03-25 10:07     ` Patrick Steinhardt
  0 siblings, 1 reply; 45+ messages in thread
From: Justin Tobler @ 2024-03-22 18:55 UTC (permalink / raw
  To: Patrick Steinhardt; +Cc: git

On 24/03/22 01:22PM, Patrick Steinhardt wrote:
> It is comparatively hard to understand how exactly the binary search
> over refnames works given that the function and variable names are not
> exactly easy to grasp. Rename them to make this more obvious. This
> should not result in any change in behaviour.
> 
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  reftable/refname.c | 44 ++++++++++++++++++++++----------------------
>  1 file changed, 22 insertions(+), 22 deletions(-)
> 
> diff --git a/reftable/refname.c b/reftable/refname.c
> index 64eba1b886..9ec488d727 100644
> --- a/reftable/refname.c
> +++ b/reftable/refname.c
> @@ -12,15 +12,15 @@
>  #include "refname.h"
>  #include "reftable-iterator.h"
>  
> -struct find_arg {
> -	char **names;
> -	const char *want;
> +struct refname_needle_lesseq_args {
> +	char **haystack;
> +	const char *needle;
>  };

I agree that the previous `names` and `want` are a bit ambiguous. What
do you think about `refnames` and `target_refname` instead?

-Justin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/7] reftable/basics: improve `binsearch()` test
  2024-03-22 18:46   ` Justin Tobler
@ 2024-03-25 10:07     ` Patrick Steinhardt
  0 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-25 10:07 UTC (permalink / raw
  To: git

[-- Attachment #1: Type: text/plain, Size: 1911 bytes --]

On Fri, Mar 22, 2024 at 01:46:56PM -0500, Justin Tobler wrote:
> On 24/03/22 01:22PM, Patrick Steinhardt wrote:
> > The `binsearch()` test is somewhat weird in that it doesn't explicitly
> > spell out its expectations. Instead it does so in a rather ad-hoc way
> > with some hard-to-understand computations.
> > 
> > Refactor the test to spell out the needle as well as expected index for
> > all testcases. This refactoring highlights that the `binsearch_func()`
> > is written somewhat weirdly to find the first integer smaller than the
> > needle, not smaller or equal to it. Adjust the function accordingly.
> > 
> > While at it, rename the callback function to better convey its meaning.
> > 
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> >  reftable/basics_test.c | 55 ++++++++++++++++++++++++------------------
> >  1 file changed, 31 insertions(+), 24 deletions(-)
> > 
> > diff --git a/reftable/basics_test.c b/reftable/basics_test.c
> > index dc1c87c5df..85c4d1621c 100644
> > --- a/reftable/basics_test.c
> > +++ b/reftable/basics_test.c
> > @@ -12,40 +12,47 @@ license that can be found in the LICENSE file or at
> >  #include "test_framework.h"
> >  #include "reftable-tests.h"
> >  
> > -struct binsearch_args {
> > -	int key;
> > -	int *arr;
> > +struct integer_needle_lesseq {
> > +	int needle;
> > +	int *haystack;
> >  };
> 
> This is probably just personal preference, but I think `key` and `arr`
> in this case are a bit more straightforward.

I was trying to make this consistent across all the callsites, and here
I personally think that `haystack` and `needle` are well understood by
most folks and generic enough.

> I do like that we rename
> the args to be more specific. Do we want to also append `_args` to
> denote that it is an argument set? Maybe `integer_lesseq_args`?

Oh, yeah, that's an oversight indeed.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/7] reftable/refname: refactor binary search over refnames
  2024-03-22 18:55   ` Justin Tobler
@ 2024-03-25 10:07     ` Patrick Steinhardt
  0 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-25 10:07 UTC (permalink / raw
  To: git

[-- Attachment #1: Type: text/plain, Size: 1359 bytes --]

On Fri, Mar 22, 2024 at 01:55:17PM -0500, Justin Tobler wrote:
> On 24/03/22 01:22PM, Patrick Steinhardt wrote:
> > It is comparatively hard to understand how exactly the binary search
> > over refnames works given that the function and variable names are not
> > exactly easy to grasp. Rename them to make this more obvious. This
> > should not result in any change in behaviour.
> > 
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> >  reftable/refname.c | 44 ++++++++++++++++++++++----------------------
> >  1 file changed, 22 insertions(+), 22 deletions(-)
> > 
> > diff --git a/reftable/refname.c b/reftable/refname.c
> > index 64eba1b886..9ec488d727 100644
> > --- a/reftable/refname.c
> > +++ b/reftable/refname.c
> > @@ -12,15 +12,15 @@
> >  #include "refname.h"
> >  #include "reftable-iterator.h"
> >  
> > -struct find_arg {
> > -	char **names;
> > -	const char *want;
> > +struct refname_needle_lesseq_args {
> > +	char **haystack;
> > +	const char *needle;
> >  };
> 
> I agree that the previous `names` and `want` are a bit ambiguous. What
> do you think about `refnames` and `target_refname` instead?

As said in the preceding commit I was rather aiming for consistency
across the callsites, so I'll keep this as-is for now. I'm happy to be
overruled though if others feel the same way.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 0/7] reftable: improvements for the `binsearch()` mechanism
  2024-03-22 12:22 [PATCH 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
                   ` (6 preceding siblings ...)
  2024-03-22 12:22 ` [PATCH 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
@ 2024-03-25 10:10 ` Patrick Steinhardt
  2024-03-25 10:10   ` [PATCH v2 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
                     ` (6 more replies)
  2024-04-02 17:24 ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
  2024-04-03  6:03 ` [PATCH v4 " Patrick Steinhardt
  9 siblings, 7 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-25 10:10 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 2780 bytes --]

Hi,

this is the second version of my patch series that refactors the
`binsearch()` mechanism in the reftable library. There's only a single
change compared to v1, namely a renamed struct.

Thanks!

Patrick

Patrick Steinhardt (7):
  reftable/basics: fix return type of `binsearch()` to be `size_t`
  reftable/basics: improve `binsearch()` test
  reftable/refname: refactor binary search over refnames
  reftable/block: refactor binary search over restart points
  reftable/block: fix error handling when searching restart points
  reftable/record: extract function to decode key lengths
  reftable/block: avoid decoding keys when searching restart points

 reftable/basics.c      |   7 ++-
 reftable/basics.h      |   7 +--
 reftable/basics_test.c |  55 +++++++++++---------
 reftable/block.c       | 114 ++++++++++++++++++++++++++++++-----------
 reftable/record.c      |  34 ++++++++----
 reftable/record.h      |   6 +++
 reftable/refname.c     |  53 +++++++++----------
 7 files changed, 179 insertions(+), 97 deletions(-)

Range-diff against v1:
1:  cd82ac6531 = 1:  cd82ac6531 reftable/basics: fix return type of `binsearch()` to be `size_t`
2:  7955f7983a ! 2:  a277d4fa6f reftable/basics: improve `binsearch()` test
    @@ reftable/basics_test.c: license that can be found in the LICENSE file or at
     -struct binsearch_args {
     -	int key;
     -	int *arr;
    -+struct integer_needle_lesseq {
    ++struct integer_needle_lesseq_args {
     +	int needle;
     +	int *haystack;
      };
    @@ reftable/basics_test.c: license that can be found in the LICENSE file or at
     -	struct binsearch_args *args = void_args;
     -
     -	return args->key < args->arr[i];
    -+	struct integer_needle_lesseq *args = _args;
    ++	struct integer_needle_lesseq_args *args = _args;
     +	return args->needle <= args->haystack[i];
      }
      
    @@ reftable/basics_test.c: license that can be found in the LICENSE file or at
     -		args.key = i;
     -		res = binsearch(sz, &binsearch_func, &args);
     +	for (i = 0; i < ARRAY_SIZE(testcases); i++) {
    -+		struct integer_needle_lesseq args = {
    ++		struct integer_needle_lesseq_args args = {
     +			.haystack = haystack,
     +			.needle = testcases[i].needle,
     +		};
3:  44386818ce = 3:  9ffcf45c32 reftable/refname: refactor binary search over refnames
4:  f56275f288 = 4:  5e20d93ae0 reftable/block: refactor binary search over restart points
5:  36b1ef8e5c = 5:  5bbeab114f reftable/block: fix error handling when searching restart points
6:  38666de451 = 6:  271bacb210 reftable/record: extract function to decode key lengths
7:  f716400686 = 7:  e751b3c536 reftable/block: avoid decoding keys when searching restart points
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t`
  2024-03-25 10:10 ` [PATCH v2 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
@ 2024-03-25 10:10   ` Patrick Steinhardt
  2024-03-25 10:10   ` [PATCH v2 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-25 10:10 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 4578 bytes --]

The `binsearch()` function can be used to find the first element for
which a callback functions returns a truish value. But while the array
size is of type `size_t`, the function in fact returns an `int` that is
supposed to index into that array.

Fix the function signature to return a `size_t`. This conversion does
not change any semantics given that the function would only ever return
a value in the range `[0, sz]` anyway.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/basics.c      |  2 +-
 reftable/basics.h      |  2 +-
 reftable/basics_test.c |  6 +++---
 reftable/block.c       |  3 ++-
 reftable/refname.c     | 17 +++++++----------
 5 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/reftable/basics.c b/reftable/basics.c
index 0785aff941..2c5f34b39e 100644
--- a/reftable/basics.c
+++ b/reftable/basics.c
@@ -27,7 +27,7 @@ void put_be16(uint8_t *out, uint16_t i)
 	out[1] = (uint8_t)(i & 0xff);
 }
 
-int binsearch(size_t sz, int (*f)(size_t k, void *args), void *args)
+size_t binsearch(size_t sz, int (*f)(size_t k, void *args), void *args)
 {
 	size_t lo = 0;
 	size_t hi = sz;
diff --git a/reftable/basics.h b/reftable/basics.h
index 91f3533efe..2672520e76 100644
--- a/reftable/basics.h
+++ b/reftable/basics.h
@@ -28,7 +28,7 @@ void put_be16(uint8_t *out, uint16_t i);
  * Contrary to bsearch(3), this returns something useful if the argument is not
  * found.
  */
-int binsearch(size_t sz, int (*f)(size_t k, void *args), void *args);
+size_t binsearch(size_t sz, int (*f)(size_t k, void *args), void *args);
 
 /*
  * Frees a NULL terminated array of malloced strings. The array itself is also
diff --git a/reftable/basics_test.c b/reftable/basics_test.c
index 1fcd229725..dc1c87c5df 100644
--- a/reftable/basics_test.c
+++ b/reftable/basics_test.c
@@ -34,15 +34,15 @@ static void test_binsearch(void)
 
 	int i = 0;
 	for (i = 1; i < 11; i++) {
-		int res;
+		size_t res;
+
 		args.key = i;
 		res = binsearch(sz, &binsearch_func, &args);
 
 		if (res < sz) {
 			EXPECT(args.key < arr[res]);
-			if (res > 0) {
+			if (res > 0)
 				EXPECT(args.key >= arr[res - 1]);
-			}
 		} else {
 			EXPECT(args.key == 10 || args.key == 11);
 		}
diff --git a/reftable/block.c b/reftable/block.c
index e2a2cee58d..422885bddb 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -382,7 +382,8 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 	};
 	struct block_iter next = BLOCK_ITER_INIT;
 	struct reftable_record rec;
-	int err = 0, i;
+	int err = 0;
+	size_t i;
 
 	if (args.error) {
 		err = REFTABLE_FORMAT_ERROR;
diff --git a/reftable/refname.c b/reftable/refname.c
index 7570e4acf9..64eba1b886 100644
--- a/reftable/refname.c
+++ b/reftable/refname.c
@@ -33,10 +33,9 @@ static int modification_has_ref(struct modification *mod, const char *name)
 			.names = mod->add,
 			.want = name,
 		};
-		int idx = binsearch(mod->add_len, find_name, &arg);
-		if (idx < mod->add_len && !strcmp(mod->add[idx], name)) {
+		size_t idx = binsearch(mod->add_len, find_name, &arg);
+		if (idx < mod->add_len && !strcmp(mod->add[idx], name))
 			return 0;
-		}
 	}
 
 	if (mod->del_len > 0) {
@@ -44,10 +43,9 @@ static int modification_has_ref(struct modification *mod, const char *name)
 			.names = mod->del,
 			.want = name,
 		};
-		int idx = binsearch(mod->del_len, find_name, &arg);
-		if (idx < mod->del_len && !strcmp(mod->del[idx], name)) {
+		size_t idx = binsearch(mod->del_len, find_name, &arg);
+		if (idx < mod->del_len && !strcmp(mod->del[idx], name))
 			return 1;
-		}
 	}
 
 	err = reftable_table_read_ref(&mod->tab, name, &ref);
@@ -77,7 +75,7 @@ static int modification_has_ref_with_prefix(struct modification *mod,
 			.names = mod->add,
 			.want = prefix,
 		};
-		int idx = binsearch(mod->add_len, find_name, &arg);
+		size_t idx = binsearch(mod->add_len, find_name, &arg);
 		if (idx < mod->add_len &&
 		    !strncmp(prefix, mod->add[idx], strlen(prefix)))
 			goto done;
@@ -96,11 +94,10 @@ static int modification_has_ref_with_prefix(struct modification *mod,
 				.names = mod->del,
 				.want = ref.refname,
 			};
-			int idx = binsearch(mod->del_len, find_name, &arg);
+			size_t idx = binsearch(mod->del_len, find_name, &arg);
 			if (idx < mod->del_len &&
-			    !strcmp(ref.refname, mod->del[idx])) {
+			    !strcmp(ref.refname, mod->del[idx]))
 				continue;
-			}
 		}
 
 		if (strncmp(ref.refname, prefix, strlen(prefix))) {
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 2/7] reftable/basics: improve `binsearch()` test
  2024-03-25 10:10 ` [PATCH v2 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
  2024-03-25 10:10   ` [PATCH v2 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
@ 2024-03-25 10:10   ` Patrick Steinhardt
  2024-03-25 10:10   ` [PATCH v2 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-25 10:10 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 2541 bytes --]

The `binsearch()` test is somewhat weird in that it doesn't explicitly
spell out its expectations. Instead it does so in a rather ad-hoc way
with some hard-to-understand computations.

Refactor the test to spell out the needle as well as expected index for
all testcases. This refactoring highlights that the `binsearch_func()`
is written somewhat weirdly to find the first integer smaller than the
needle, not smaller or equal to it. Adjust the function accordingly.

While at it, rename the callback function to better convey its meaning.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/basics_test.c | 55 ++++++++++++++++++++++++------------------
 1 file changed, 31 insertions(+), 24 deletions(-)

diff --git a/reftable/basics_test.c b/reftable/basics_test.c
index dc1c87c5df..997c4d9e01 100644
--- a/reftable/basics_test.c
+++ b/reftable/basics_test.c
@@ -12,40 +12,47 @@ license that can be found in the LICENSE file or at
 #include "test_framework.h"
 #include "reftable-tests.h"
 
-struct binsearch_args {
-	int key;
-	int *arr;
+struct integer_needle_lesseq_args {
+	int needle;
+	int *haystack;
 };
 
-static int binsearch_func(size_t i, void *void_args)
+static int integer_needle_lesseq(size_t i, void *_args)
 {
-	struct binsearch_args *args = void_args;
-
-	return args->key < args->arr[i];
+	struct integer_needle_lesseq_args *args = _args;
+	return args->needle <= args->haystack[i];
 }
 
 static void test_binsearch(void)
 {
-	int arr[] = { 2, 4, 6, 8, 10 };
-	size_t sz = ARRAY_SIZE(arr);
-	struct binsearch_args args = {
-		.arr = arr,
+	int haystack[] = { 2, 4, 6, 8, 10 };
+	struct {
+		int needle;
+		size_t expected_idx;
+	} testcases[] = {
+		{-9000, 0},
+		{-1, 0},
+		{0, 0},
+		{2, 0},
+		{3, 1},
+		{4, 1},
+		{7, 3},
+		{9, 4},
+		{10, 4},
+		{11, 5},
+		{9000, 5},
 	};
+	size_t i = 0;
 
-	int i = 0;
-	for (i = 1; i < 11; i++) {
-		size_t res;
-
-		args.key = i;
-		res = binsearch(sz, &binsearch_func, &args);
+	for (i = 0; i < ARRAY_SIZE(testcases); i++) {
+		struct integer_needle_lesseq_args args = {
+			.haystack = haystack,
+			.needle = testcases[i].needle,
+		};
+		size_t idx;
 
-		if (res < sz) {
-			EXPECT(args.key < arr[res]);
-			if (res > 0)
-				EXPECT(args.key >= arr[res - 1]);
-		} else {
-			EXPECT(args.key == 10 || args.key == 11);
-		}
+		idx = binsearch(ARRAY_SIZE(haystack), &integer_needle_lesseq, &args);
+		EXPECT(idx == testcases[i].expected_idx);
 	}
 }
 
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 3/7] reftable/refname: refactor binary search over refnames
  2024-03-25 10:10 ` [PATCH v2 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
  2024-03-25 10:10   ` [PATCH v2 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
  2024-03-25 10:10   ` [PATCH v2 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
@ 2024-03-25 10:10   ` Patrick Steinhardt
  2024-04-02 16:27     ` Justin Tobler
  2024-03-25 10:10   ` [PATCH v2 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-25 10:10 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 3246 bytes --]

It is comparatively hard to understand how exactly the binary search
over refnames works given that the function and variable names are not
exactly easy to grasp. Rename them to make this more obvious. This
should not result in any change in behaviour.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/refname.c | 44 ++++++++++++++++++++++----------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/reftable/refname.c b/reftable/refname.c
index 64eba1b886..9ec488d727 100644
--- a/reftable/refname.c
+++ b/reftable/refname.c
@@ -12,15 +12,15 @@
 #include "refname.h"
 #include "reftable-iterator.h"
 
-struct find_arg {
-	char **names;
-	const char *want;
+struct refname_needle_lesseq_args {
+	char **haystack;
+	const char *needle;
 };
 
-static int find_name(size_t k, void *arg)
+static int refname_needle_lesseq(size_t k, void *arg)
 {
-	struct find_arg *f_arg = arg;
-	return strcmp(f_arg->names[k], f_arg->want) >= 0;
+	struct refname_needle_lesseq_args *f_arg = arg;
+	return strcmp(f_arg->needle, f_arg->haystack[k]) <= 0;
 }
 
 static int modification_has_ref(struct modification *mod, const char *name)
@@ -29,21 +29,21 @@ static int modification_has_ref(struct modification *mod, const char *name)
 	int err = 0;
 
 	if (mod->add_len > 0) {
-		struct find_arg arg = {
-			.names = mod->add,
-			.want = name,
+		struct refname_needle_lesseq_args arg = {
+			.haystack = mod->add,
+			.needle = name,
 		};
-		size_t idx = binsearch(mod->add_len, find_name, &arg);
+		size_t idx = binsearch(mod->add_len, refname_needle_lesseq, &arg);
 		if (idx < mod->add_len && !strcmp(mod->add[idx], name))
 			return 0;
 	}
 
 	if (mod->del_len > 0) {
-		struct find_arg arg = {
-			.names = mod->del,
-			.want = name,
+		struct refname_needle_lesseq_args arg = {
+			.haystack = mod->del,
+			.needle = name,
 		};
-		size_t idx = binsearch(mod->del_len, find_name, &arg);
+		size_t idx = binsearch(mod->del_len, refname_needle_lesseq, &arg);
 		if (idx < mod->del_len && !strcmp(mod->del[idx], name))
 			return 1;
 	}
@@ -71,11 +71,11 @@ static int modification_has_ref_with_prefix(struct modification *mod,
 	int err = 0;
 
 	if (mod->add_len > 0) {
-		struct find_arg arg = {
-			.names = mod->add,
-			.want = prefix,
+		struct refname_needle_lesseq_args arg = {
+			.haystack = mod->add,
+			.needle = prefix,
 		};
-		size_t idx = binsearch(mod->add_len, find_name, &arg);
+		size_t idx = binsearch(mod->add_len, refname_needle_lesseq, &arg);
 		if (idx < mod->add_len &&
 		    !strncmp(prefix, mod->add[idx], strlen(prefix)))
 			goto done;
@@ -90,11 +90,11 @@ static int modification_has_ref_with_prefix(struct modification *mod,
 			goto done;
 
 		if (mod->del_len > 0) {
-			struct find_arg arg = {
-				.names = mod->del,
-				.want = ref.refname,
+			struct refname_needle_lesseq_args arg = {
+				.haystack = mod->del,
+				.needle = ref.refname,
 			};
-			size_t idx = binsearch(mod->del_len, find_name, &arg);
+			size_t idx = binsearch(mod->del_len, refname_needle_lesseq, &arg);
 			if (idx < mod->del_len &&
 			    !strcmp(ref.refname, mod->del[idx]))
 				continue;
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 4/7] reftable/block: refactor binary search over restart points
  2024-03-25 10:10 ` [PATCH v2 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
                     ` (2 preceding siblings ...)
  2024-03-25 10:10   ` [PATCH v2 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
@ 2024-03-25 10:10   ` Patrick Steinhardt
  2024-04-02 16:42     ` Justin Tobler
  2024-03-25 10:10   ` [PATCH v2 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-25 10:10 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 5820 bytes --]

When seeking a record in our block reader we perform a binary search
over the block's restart points so that we don't have to do a linear
scan over the whole block. The logic to do so is quite intricate though,
which makes it hard to understand.

Improve documentation and rename some of the functions and variables so
that the code becomes easier to understand overall. This refactoring
should not result in any change in behaviour.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/block.c | 97 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 70 insertions(+), 27 deletions(-)

diff --git a/reftable/block.c b/reftable/block.c
index 422885bddb..6cd4c053df 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -273,34 +273,36 @@ void block_reader_start(struct block_reader *br, struct block_iter *it)
 	it->next_off = br->header_off + 4;
 }
 
-struct restart_find_args {
+struct restart_needle_less_args {
 	int error;
-	struct strbuf key;
-	struct block_reader *r;
+	struct strbuf needle;
+	struct block_reader *reader;
 };
 
-static int restart_key_less(size_t idx, void *args)
+static int restart_needle_less(size_t idx, void *_args)
 {
-	struct restart_find_args *a = args;
-	uint32_t off = block_reader_restart_offset(a->r, idx);
+	struct restart_needle_less_args *args = _args;
+	uint32_t off = block_reader_restart_offset(args->reader, idx);
 	struct string_view in = {
-		.buf = a->r->block.data + off,
-		.len = a->r->block_len - off,
+		.buf = args->reader->block.data + off,
+		.len = args->reader->block_len - off,
 	};
-
-	/* the restart key is verbatim in the block, so this could avoid the
-	   alloc for decoding the key */
-	struct strbuf rkey = STRBUF_INIT;
+	struct strbuf kth_restart_key = STRBUF_INIT;
 	uint8_t unused_extra;
-	int n = reftable_decode_key(&rkey, &unused_extra, in);
-	int result;
+	int result, n;
+
+	/*
+	 * TODO: The restart key is verbatim in the block, so we can in theory
+	 * avoid decoding the key and thus save some allocations.
+	 */
+	n = reftable_decode_key(&kth_restart_key, &unused_extra, in);
 	if (n < 0) {
-		a->error = 1;
+		args->error = 1;
 		return -1;
 	}
 
-	result = strbuf_cmp(&a->key, &rkey);
-	strbuf_release(&rkey);
+	result = strbuf_cmp(&args->needle, &kth_restart_key);
+	strbuf_release(&kth_restart_key);
 	return result < 0;
 }
 
@@ -376,9 +378,9 @@ void block_iter_close(struct block_iter *it)
 int block_reader_seek(struct block_reader *br, struct block_iter *it,
 		      struct strbuf *want)
 {
-	struct restart_find_args args = {
-		.key = *want,
-		.r = br,
+	struct restart_needle_less_args args = {
+		.needle = *want,
+		.reader = br,
 	};
 	struct block_iter next = BLOCK_ITER_INIT;
 	struct reftable_record rec;
@@ -390,7 +392,35 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 		goto done;
 	}
 
-	i = binsearch(br->restart_count, &restart_key_less, &args);
+	/*
+	 * Perform a binary search over the block's restart points, which
+	 * avoids doing a linear scan over the whole block. Like this, we
+	 * identify the section of the block that should contain our key.
+	 *
+	 * Note that we explicitly search for the first restart point _greater_
+	 * than the sought-after record, not _greater or equal_ to it. In case
+	 * the sought-after record is located directly at the restart point we
+	 * would otherwise start doing the linear search at the preceding
+	 * restart point. While that works alright, we would end up scanning
+	 * too many record.
+	 */
+	i = binsearch(br->restart_count, &restart_needle_less, &args);
+
+	/*
+	 * Now there are multiple cases:
+	 *
+	 *   - `i == 0`: The wanted record must be contained before the first
+	 *     restart point. We will thus start searching for the record in
+	 *     that section after accounting for the header offset.
+	 *
+	 *   - `i == restart_count`: The wanted record was not found at any of
+	 *     the restart points. As there is no restart point at the end of
+	 *     the section the record may thus be contained in the last block.
+	 *
+	 *   - `i > 0`: The wanted record must be contained in the section
+	 *     before the found restart point. We thus do a linear search
+	 *     starting from the preceding restart point.
+	 */
 	if (i > 0)
 		it->next_off = block_reader_restart_offset(br, i - 1);
 	else
@@ -399,21 +429,34 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 
 	reftable_record_init(&rec, block_reader_type(br));
 
-	/* We're looking for the last entry less/equal than the wanted key, so
-	   we have to go one entry too far and then back up.
-	*/
+	/*
+	 * We're looking for the last entry less than the wanted key so that
+	 * the next call to `block_reader_next()` would yield the wanted
+	 * record. We thus don't want to position our reader at the sought
+	 * after record, but one before. To do so, we have to go one entry too
+	 * far and then back up.
+	 */
 	while (1) {
 		block_iter_copy_from(&next, it);
 		err = block_iter_next(&next, &rec);
 		if (err < 0)
 			goto done;
-
-		reftable_record_key(&rec, &it->last_key);
-		if (err > 0 || strbuf_cmp(&it->last_key, want) >= 0) {
+		if (err > 0) {
 			err = 0;
 			goto done;
 		}
 
+		/*
+		 * Check whether the current key is greater or equal to the
+		 * sought-after key. In case it is greater we know that the
+		 * record does not exist in the block and can thus abort early.
+		 * In case it is equal to the sought-after key we have found
+		 * the desired record.
+		 */
+		reftable_record_key(&rec, &it->last_key);
+		if (strbuf_cmp(&it->last_key, want) >= 0)
+			goto done;
+
 		block_iter_copy_from(it, &next);
 	}
 
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 5/7] reftable/block: fix error handling when searching restart points
  2024-03-25 10:10 ` [PATCH v2 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
                     ` (3 preceding siblings ...)
  2024-03-25 10:10   ` [PATCH v2 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
@ 2024-03-25 10:10   ` Patrick Steinhardt
  2024-03-25 10:10   ` [PATCH v2 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
  2024-03-25 10:11   ` [PATCH v2 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
  6 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-25 10:10 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 2756 bytes --]

When doing the binary search over restart points in a block we need to
decode the record keys. This decoding step can result in an error when
the block is corrupted, which we indicate to the caller of the binary
search by setting `args.error = 1`. But the only caller that exists
mishandles this because it in fact performs the error check before
calling `binsearch()`.

Fix this bug by checking for errors at the right point in time.
Furthermore, refactor `binsearch()` so that it aborts the search in case
the callback function returns a negative value so that we don't
needlessly continue to search the block.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/basics.c | 5 ++++-
 reftable/basics.h | 5 +++--
 reftable/block.c  | 9 ++++-----
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/reftable/basics.c b/reftable/basics.c
index 2c5f34b39e..fea711db7e 100644
--- a/reftable/basics.c
+++ b/reftable/basics.c
@@ -39,8 +39,11 @@ size_t binsearch(size_t sz, int (*f)(size_t k, void *args), void *args)
 	 */
 	while (hi - lo > 1) {
 		size_t mid = lo + (hi - lo) / 2;
+		int ret = f(mid, args);
+		if (ret < 0)
+			return sz;
 
-		if (f(mid, args))
+		if (ret > 0)
 			hi = mid;
 		else
 			lo = mid;
diff --git a/reftable/basics.h b/reftable/basics.h
index 2672520e76..523ecd5307 100644
--- a/reftable/basics.h
+++ b/reftable/basics.h
@@ -22,8 +22,9 @@ uint32_t get_be24(uint8_t *in);
 void put_be16(uint8_t *out, uint16_t i);
 
 /*
- * find smallest index i in [0, sz) at which f(i) is true, assuming
- * that f is ascending. Return sz if f(i) is false for all indices.
+ * find smallest index i in [0, sz) at which `f(i) > 0`, assuming that f is
+ * ascending. Return sz if `f(i) == 0` for all indices. The search is aborted
+ * and `sz` is returned in case `f(i) < 0`.
  *
  * Contrary to bsearch(3), this returns something useful if the argument is not
  * found.
diff --git a/reftable/block.c b/reftable/block.c
index 6cd4c053df..ca80a05e21 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -387,11 +387,6 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 	int err = 0;
 	size_t i;
 
-	if (args.error) {
-		err = REFTABLE_FORMAT_ERROR;
-		goto done;
-	}
-
 	/*
 	 * Perform a binary search over the block's restart points, which
 	 * avoids doing a linear scan over the whole block. Like this, we
@@ -405,6 +400,10 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 	 * too many record.
 	 */
 	i = binsearch(br->restart_count, &restart_needle_less, &args);
+	if (args.error) {
+		err = REFTABLE_FORMAT_ERROR;
+		goto done;
+	}
 
 	/*
 	 * Now there are multiple cases:
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 6/7] reftable/record: extract function to decode key lengths
  2024-03-25 10:10 ` [PATCH v2 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
                     ` (4 preceding siblings ...)
  2024-03-25 10:10   ` [PATCH v2 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
@ 2024-03-25 10:10   ` Patrick Steinhardt
  2024-03-25 10:11   ` [PATCH v2 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
  6 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-25 10:10 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 2614 bytes --]

We're about to refactor the binary search over restart points so that it
does not need to fully decode the record keys anymore. To do so we will
need to decode the record key lengths, which is non-trivial logic.

Extract the logic to decode these lengths from `refatble_decode_key()`
so that we can reuse it.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/record.c | 34 +++++++++++++++++++++++++---------
 reftable/record.h |  6 ++++++
 2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/reftable/record.c b/reftable/record.c
index 23b497adab..5506f3e913 100644
--- a/reftable/record.c
+++ b/reftable/record.c
@@ -159,26 +159,42 @@ int reftable_encode_key(int *restart, struct string_view dest,
 	return start.len - dest.len;
 }
 
-int reftable_decode_key(struct strbuf *last_key, uint8_t *extra,
-			struct string_view in)
+int reftable_decode_keylen(struct string_view in,
+			   uint64_t *prefix_len,
+			   uint64_t *suffix_len,
+			   uint8_t *extra)
 {
-	int start_len = in.len;
-	uint64_t prefix_len = 0;
-	uint64_t suffix_len = 0;
+	size_t start_len = in.len;
 	int n;
 
-	n = get_var_int(&prefix_len, &in);
+	n = get_var_int(prefix_len, &in);
 	if (n < 0)
 		return -1;
 	string_view_consume(&in, n);
 
-	n = get_var_int(&suffix_len, &in);
+	n = get_var_int(suffix_len, &in);
 	if (n <= 0)
 		return -1;
 	string_view_consume(&in, n);
 
-	*extra = (uint8_t)(suffix_len & 0x7);
-	suffix_len >>= 3;
+	*extra = (uint8_t)(*suffix_len & 0x7);
+	*suffix_len >>= 3;
+
+	return start_len - in.len;
+}
+
+int reftable_decode_key(struct strbuf *last_key, uint8_t *extra,
+			struct string_view in)
+{
+	int start_len = in.len;
+	uint64_t prefix_len = 0;
+	uint64_t suffix_len = 0;
+	int n;
+
+	n = reftable_decode_keylen(in, &prefix_len, &suffix_len, extra);
+	if (n < 0)
+		return -1;
+	string_view_consume(&in, n);
 
 	if (in.len < suffix_len ||
 	    prefix_len > last_key->len)
diff --git a/reftable/record.h b/reftable/record.h
index 826ee1c55c..d778133e6e 100644
--- a/reftable/record.h
+++ b/reftable/record.h
@@ -86,6 +86,12 @@ int reftable_encode_key(int *is_restart, struct string_view dest,
 			struct strbuf prev_key, struct strbuf key,
 			uint8_t extra);
 
+/* Decode a record's key lengths. */
+int reftable_decode_keylen(struct string_view in,
+			   uint64_t *prefix_len,
+			   uint64_t *suffix_len,
+			   uint8_t *extra);
+
 /*
  * Decode into `last_key` and `extra` from `in`. `last_key` is expected to
  * contain the decoded key of the preceding record, if any.
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 7/7] reftable/block: avoid decoding keys when searching restart points
  2024-03-25 10:10 ` [PATCH v2 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
                     ` (5 preceding siblings ...)
  2024-03-25 10:10   ` [PATCH v2 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
@ 2024-03-25 10:11   ` Patrick Steinhardt
  2024-04-02 16:47     ` Justin Tobler
  6 siblings, 1 reply; 45+ messages in thread
From: Patrick Steinhardt @ 2024-03-25 10:11 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 2061 bytes --]

When searching over restart points in a block we decode the key of each
of the records, which results in a memory allocation. This is quite
pointless though given that records it restart points will never use
prefix compression and thus store their keys verbatim in the block.

Refactor the code so that we can avoid decoding the keys, which saves us
some allocations.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/block.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/reftable/block.c b/reftable/block.c
index ca80a05e21..8bb4e43cec 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -287,23 +287,32 @@ static int restart_needle_less(size_t idx, void *_args)
 		.buf = args->reader->block.data + off,
 		.len = args->reader->block_len - off,
 	};
-	struct strbuf kth_restart_key = STRBUF_INIT;
-	uint8_t unused_extra;
-	int result, n;
+	uint64_t prefix_len, suffix_len;
+	uint8_t extra;
+	int n;
 
 	/*
-	 * TODO: The restart key is verbatim in the block, so we can in theory
-	 * avoid decoding the key and thus save some allocations.
+	 * Records at restart points are stored without prefix compression, so
+	 * there is no need to fully decode the record key here. This removes
+	 * the need for allocating memory.
 	 */
-	n = reftable_decode_key(&kth_restart_key, &unused_extra, in);
-	if (n < 0) {
+	n = reftable_decode_keylen(in, &prefix_len, &suffix_len, &extra);
+	if (n < 0 || prefix_len) {
 		args->error = 1;
 		return -1;
 	}
 
-	result = strbuf_cmp(&args->needle, &kth_restart_key);
-	strbuf_release(&kth_restart_key);
-	return result < 0;
+	string_view_consume(&in, n);
+	if (suffix_len > in.len) {
+		args->error = 1;
+		return -1;
+	}
+
+	n = memcmp(args->needle.buf, in.buf,
+		   args->needle.len < suffix_len ? args->needle.len : suffix_len);
+	if (n)
+		return n < 0;
+	return args->needle.len < suffix_len;
 }
 
 void block_iter_copy_from(struct block_iter *dest, struct block_iter *src)
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 3/7] reftable/refname: refactor binary search over refnames
  2024-03-25 10:10   ` [PATCH v2 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
@ 2024-04-02 16:27     ` Justin Tobler
  2024-04-02 17:15       ` Patrick Steinhardt
  0 siblings, 1 reply; 45+ messages in thread
From: Justin Tobler @ 2024-04-02 16:27 UTC (permalink / raw
  To: Patrick Steinhardt; +Cc: git

On 24/03/25 11:10AM, Patrick Steinhardt wrote:
> It is comparatively hard to understand how exactly the binary search
> over refnames works given that the function and variable names are not
> exactly easy to grasp. Rename them to make this more obvious. This
> should not result in any change in behaviour.
> 
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  reftable/refname.c | 44 ++++++++++++++++++++++----------------------
>  1 file changed, 22 insertions(+), 22 deletions(-)
> 
> diff --git a/reftable/refname.c b/reftable/refname.c
> index 64eba1b886..9ec488d727 100644
> --- a/reftable/refname.c
> +++ b/reftable/refname.c
> @@ -12,15 +12,15 @@
>  #include "refname.h"
>  #include "reftable-iterator.h"
>  
> -struct find_arg {
> -	char **names;
> -	const char *want;
> +struct refname_needle_lesseq_args {
> +	char **haystack;
> +	const char *needle;
>  };
>  
> -static int find_name(size_t k, void *arg)
> +static int refname_needle_lesseq(size_t k, void *arg)
>  {
> -	struct find_arg *f_arg = arg;
> -	return strcmp(f_arg->names[k], f_arg->want) >= 0;
> +	struct refname_needle_lesseq_args *f_arg = arg;

nit: Looks like the `f_arg` variable name is a remnant from when the
type was called `find_arg`. Do we want to rename this to `args`? We 
could also rename `void *arg` to `void *_args`.

> +	return strcmp(f_arg->needle, f_arg->haystack[k]) <= 0;
>  }
>  
>  static int modification_has_ref(struct modification *mod, const char *name)
> @@ -29,21 +29,21 @@ static int modification_has_ref(struct modification *mod, const char *name)
>  	int err = 0;
>  
>  	if (mod->add_len > 0) {
> -		struct find_arg arg = {
> -			.names = mod->add,
> -			.want = name,
> +		struct refname_needle_lesseq_args arg = {
> +			.haystack = mod->add,
> +			.needle = name,
>  		};
> -		size_t idx = binsearch(mod->add_len, find_name, &arg);
> +		size_t idx = binsearch(mod->add_len, refname_needle_lesseq, &arg);
>  		if (idx < mod->add_len && !strcmp(mod->add[idx], name))
>  			return 0;
>  	}
>  
>  	if (mod->del_len > 0) {
> -		struct find_arg arg = {
> -			.names = mod->del,
> -			.want = name,
> +		struct refname_needle_lesseq_args arg = {

nit: In the other commits we use `args`. Do we want to be consistent?

-Justin

> +			.haystack = mod->del,
> +			.needle = name,
>  		};
> -		size_t idx = binsearch(mod->del_len, find_name, &arg);
> +		size_t idx = binsearch(mod->del_len, refname_needle_lesseq, &arg);
>  		if (idx < mod->del_len && !strcmp(mod->del[idx], name))
>  			return 1;
>  	}
> @@ -71,11 +71,11 @@ static int modification_has_ref_with_prefix(struct modification *mod,
>  	int err = 0;
>  
>  	if (mod->add_len > 0) {
> -		struct find_arg arg = {
> -			.names = mod->add,
> -			.want = prefix,
> +		struct refname_needle_lesseq_args arg = {
> +			.haystack = mod->add,
> +			.needle = prefix,
>  		};
> -		size_t idx = binsearch(mod->add_len, find_name, &arg);
> +		size_t idx = binsearch(mod->add_len, refname_needle_lesseq, &arg);
>  		if (idx < mod->add_len &&
>  		    !strncmp(prefix, mod->add[idx], strlen(prefix)))
>  			goto done;
> @@ -90,11 +90,11 @@ static int modification_has_ref_with_prefix(struct modification *mod,
>  			goto done;
>  
>  		if (mod->del_len > 0) {
> -			struct find_arg arg = {
> -				.names = mod->del,
> -				.want = ref.refname,
> +			struct refname_needle_lesseq_args arg = {
> +				.haystack = mod->del,
> +				.needle = ref.refname,
>  			};
> -			size_t idx = binsearch(mod->del_len, find_name, &arg);
> +			size_t idx = binsearch(mod->del_len, refname_needle_lesseq, &arg);
>  			if (idx < mod->del_len &&
>  			    !strcmp(ref.refname, mod->del[idx]))
>  				continue;
> -- 
> 2.44.GIT
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 4/7] reftable/block: refactor binary search over restart points
  2024-03-25 10:10   ` [PATCH v2 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
@ 2024-04-02 16:42     ` Justin Tobler
  2024-04-02 17:15       ` Patrick Steinhardt
  0 siblings, 1 reply; 45+ messages in thread
From: Justin Tobler @ 2024-04-02 16:42 UTC (permalink / raw
  To: Patrick Steinhardt; +Cc: git

On 24/03/25 11:10AM, Patrick Steinhardt wrote:
> When seeking a record in our block reader we perform a binary search
> over the block's restart points so that we don't have to do a linear
> scan over the whole block. The logic to do so is quite intricate though,
> which makes it hard to understand.
> 
> Improve documentation and rename some of the functions and variables so
> that the code becomes easier to understand overall. This refactoring
> should not result in any change in behaviour.
> 
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
...  
> -	i = binsearch(br->restart_count, &restart_key_less, &args);
> +	/*
> +	 * Perform a binary search over the block's restart points, which
> +	 * avoids doing a linear scan over the whole block. Like this, we
> +	 * identify the section of the block that should contain our key.
> +	 *
> +	 * Note that we explicitly search for the first restart point _greater_
> +	 * than the sought-after record, not _greater or equal_ to it. In case
> +	 * the sought-after record is located directly at the restart point we
> +	 * would otherwise start doing the linear search at the preceding
> +	 * restart point. While that works alright, we would end up scanning
> +	 * too many record.
> +	 */
> +	i = binsearch(br->restart_count, &restart_needle_less, &args);
> +
> +	/*
> +	 * Now there are multiple cases:
> +	 *
> +	 *   - `i == 0`: The wanted record must be contained before the first
> +	 *     restart point. We will thus start searching for the record in
> +	 *     that section after accounting for the header offset.

To clarify my understanding, does a restart_offset not exist at the
beginning of a block? I was originally under the impression that the
first reset point would be at the beginning of the block (or just after
the header). If this was the case, the first restart point being greater
would indicate that the wanted record is not contained within the block.

-Justin

> +	 *
> +	 *   - `i == restart_count`: The wanted record was not found at any of
> +	 *     the restart points. As there is no restart point at the end of
> +	 *     the section the record may thus be contained in the last block.
> +	 *
> +	 *   - `i > 0`: The wanted record must be contained in the section
> +	 *     before the found restart point. We thus do a linear search
> +	 *     starting from the preceding restart point.
> +	 */
>  	if (i > 0)
>  		it->next_off = block_reader_restart_offset(br, i - 1);
>  	else
> @@ -399,21 +429,34 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
>  
>  	reftable_record_init(&rec, block_reader_type(br));
>  
> -	/* We're looking for the last entry less/equal than the wanted key, so
> -	   we have to go one entry too far and then back up.
> -	*/
> +	/*
> +	 * We're looking for the last entry less than the wanted key so that
> +	 * the next call to `block_reader_next()` would yield the wanted
> +	 * record. We thus don't want to position our reader at the sought
> +	 * after record, but one before. To do so, we have to go one entry too
> +	 * far and then back up.
> +	 */
>  	while (1) {
>  		block_iter_copy_from(&next, it);
>  		err = block_iter_next(&next, &rec);
>  		if (err < 0)
>  			goto done;
> -
> -		reftable_record_key(&rec, &it->last_key);
> -		if (err > 0 || strbuf_cmp(&it->last_key, want) >= 0) {
> +		if (err > 0) {
>  			err = 0;
>  			goto done;
>  		}
>  
> +		/*
> +		 * Check whether the current key is greater or equal to the
> +		 * sought-after key. In case it is greater we know that the
> +		 * record does not exist in the block and can thus abort early.
> +		 * In case it is equal to the sought-after key we have found
> +		 * the desired record.
> +		 */
> +		reftable_record_key(&rec, &it->last_key);
> +		if (strbuf_cmp(&it->last_key, want) >= 0)
> +			goto done;
> +
>  		block_iter_copy_from(it, &next);
>  	}
>  
> -- 
> 2.44.GIT
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 7/7] reftable/block: avoid decoding keys when searching restart points
  2024-03-25 10:11   ` [PATCH v2 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
@ 2024-04-02 16:47     ` Justin Tobler
  2024-04-02 17:15       ` Patrick Steinhardt
  0 siblings, 1 reply; 45+ messages in thread
From: Justin Tobler @ 2024-04-02 16:47 UTC (permalink / raw
  To: Patrick Steinhardt; +Cc: git

On 24/03/25 11:11AM, Patrick Steinhardt wrote:
> When searching over restart points in a block we decode the key of each
> of the records, which results in a memory allocation. This is quite
> pointless though given that records it restart points will never use
> prefix compression and thus store their keys verbatim in the block.
> 
> Refactor the code so that we can avoid decoding the keys, which saves us
> some allocations.

Out of curiousity, do you have any benchmarks around this change and
would that be something we would want to add to the commit message?

-Justin

> 
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  reftable/block.c | 29 +++++++++++++++++++----------
>  1 file changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/reftable/block.c b/reftable/block.c
> index ca80a05e21..8bb4e43cec 100644
> --- a/reftable/block.c
> +++ b/reftable/block.c
> @@ -287,23 +287,32 @@ static int restart_needle_less(size_t idx, void *_args)
>  		.buf = args->reader->block.data + off,
>  		.len = args->reader->block_len - off,
>  	};
> -	struct strbuf kth_restart_key = STRBUF_INIT;
> -	uint8_t unused_extra;
> -	int result, n;
> +	uint64_t prefix_len, suffix_len;
> +	uint8_t extra;
> +	int n;
>  
>  	/*
> -	 * TODO: The restart key is verbatim in the block, so we can in theory
> -	 * avoid decoding the key and thus save some allocations.
> +	 * Records at restart points are stored without prefix compression, so
> +	 * there is no need to fully decode the record key here. This removes
> +	 * the need for allocating memory.
>  	 */
> -	n = reftable_decode_key(&kth_restart_key, &unused_extra, in);
> -	if (n < 0) {
> +	n = reftable_decode_keylen(in, &prefix_len, &suffix_len, &extra);
> +	if (n < 0 || prefix_len) {
>  		args->error = 1;
>  		return -1;
>  	}
>  
> -	result = strbuf_cmp(&args->needle, &kth_restart_key);
> -	strbuf_release(&kth_restart_key);
> -	return result < 0;
> +	string_view_consume(&in, n);
> +	if (suffix_len > in.len) {
> +		args->error = 1;
> +		return -1;
> +	}
> +
> +	n = memcmp(args->needle.buf, in.buf,
> +		   args->needle.len < suffix_len ? args->needle.len : suffix_len);
> +	if (n)
> +		return n < 0;
> +	return args->needle.len < suffix_len;
>  }
>  
>  void block_iter_copy_from(struct block_iter *dest, struct block_iter *src)
> -- 
> 2.44.GIT
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 3/7] reftable/refname: refactor binary search over refnames
  2024-04-02 16:27     ` Justin Tobler
@ 2024-04-02 17:15       ` Patrick Steinhardt
  0 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-02 17:15 UTC (permalink / raw
  To: git

[-- Attachment #1: Type: text/plain, Size: 4208 bytes --]

On Tue, Apr 02, 2024 at 11:27:01AM -0500, Justin Tobler wrote:
> On 24/03/25 11:10AM, Patrick Steinhardt wrote:
> > It is comparatively hard to understand how exactly the binary search
> > over refnames works given that the function and variable names are not
> > exactly easy to grasp. Rename them to make this more obvious. This
> > should not result in any change in behaviour.
> > 
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> >  reftable/refname.c | 44 ++++++++++++++++++++++----------------------
> >  1 file changed, 22 insertions(+), 22 deletions(-)
> > 
> > diff --git a/reftable/refname.c b/reftable/refname.c
> > index 64eba1b886..9ec488d727 100644
> > --- a/reftable/refname.c
> > +++ b/reftable/refname.c
> > @@ -12,15 +12,15 @@
> >  #include "refname.h"
> >  #include "reftable-iterator.h"
> >  
> > -struct find_arg {
> > -	char **names;
> > -	const char *want;
> > +struct refname_needle_lesseq_args {
> > +	char **haystack;
> > +	const char *needle;
> >  };
> >  
> > -static int find_name(size_t k, void *arg)
> > +static int refname_needle_lesseq(size_t k, void *arg)
> >  {
> > -	struct find_arg *f_arg = arg;
> > -	return strcmp(f_arg->names[k], f_arg->want) >= 0;
> > +	struct refname_needle_lesseq_args *f_arg = arg;
> 
> nit: Looks like the `f_arg` variable name is a remnant from when the
> type was called `find_arg`. Do we want to rename this to `args`? We 
> could also rename `void *arg` to `void *_args`.

Yup.

> > +	return strcmp(f_arg->needle, f_arg->haystack[k]) <= 0;
> >  }
> >  
> >  static int modification_has_ref(struct modification *mod, const char *name)
> > @@ -29,21 +29,21 @@ static int modification_has_ref(struct modification *mod, const char *name)
> >  	int err = 0;
> >  
> >  	if (mod->add_len > 0) {
> > -		struct find_arg arg = {
> > -			.names = mod->add,
> > -			.want = name,
> > +		struct refname_needle_lesseq_args arg = {
> > +			.haystack = mod->add,
> > +			.needle = name,
> >  		};
> > -		size_t idx = binsearch(mod->add_len, find_name, &arg);
> > +		size_t idx = binsearch(mod->add_len, refname_needle_lesseq, &arg);
> >  		if (idx < mod->add_len && !strcmp(mod->add[idx], name))
> >  			return 0;
> >  	}
> >  
> >  	if (mod->del_len > 0) {
> > -		struct find_arg arg = {
> > -			.names = mod->del,
> > -			.want = name,
> > +		struct refname_needle_lesseq_args arg = {
> 
> nit: In the other commits we use `args`. Do we want to be consistent?

Yeah. Given that it was my own argument to be consistent across
callsites I should rather make it truly consistent :)

Patrick

> -Justin
> 
> > +			.haystack = mod->del,
> > +			.needle = name,
> >  		};
> > -		size_t idx = binsearch(mod->del_len, find_name, &arg);
> > +		size_t idx = binsearch(mod->del_len, refname_needle_lesseq, &arg);
> >  		if (idx < mod->del_len && !strcmp(mod->del[idx], name))
> >  			return 1;
> >  	}
> > @@ -71,11 +71,11 @@ static int modification_has_ref_with_prefix(struct modification *mod,
> >  	int err = 0;
> >  
> >  	if (mod->add_len > 0) {
> > -		struct find_arg arg = {
> > -			.names = mod->add,
> > -			.want = prefix,
> > +		struct refname_needle_lesseq_args arg = {
> > +			.haystack = mod->add,
> > +			.needle = prefix,
> >  		};
> > -		size_t idx = binsearch(mod->add_len, find_name, &arg);
> > +		size_t idx = binsearch(mod->add_len, refname_needle_lesseq, &arg);
> >  		if (idx < mod->add_len &&
> >  		    !strncmp(prefix, mod->add[idx], strlen(prefix)))
> >  			goto done;
> > @@ -90,11 +90,11 @@ static int modification_has_ref_with_prefix(struct modification *mod,
> >  			goto done;
> >  
> >  		if (mod->del_len > 0) {
> > -			struct find_arg arg = {
> > -				.names = mod->del,
> > -				.want = ref.refname,
> > +			struct refname_needle_lesseq_args arg = {
> > +				.haystack = mod->del,
> > +				.needle = ref.refname,
> >  			};
> > -			size_t idx = binsearch(mod->del_len, find_name, &arg);
> > +			size_t idx = binsearch(mod->del_len, refname_needle_lesseq, &arg);
> >  			if (idx < mod->del_len &&
> >  			    !strcmp(ref.refname, mod->del[idx]))
> >  				continue;
> > -- 
> > 2.44.GIT
> > 
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 4/7] reftable/block: refactor binary search over restart points
  2024-04-02 16:42     ` Justin Tobler
@ 2024-04-02 17:15       ` Patrick Steinhardt
  2024-04-02 17:46         ` Justin Tobler
  0 siblings, 1 reply; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-02 17:15 UTC (permalink / raw
  To: git

[-- Attachment #1: Type: text/plain, Size: 4686 bytes --]

On Tue, Apr 02, 2024 at 11:42:29AM -0500, Justin Tobler wrote:
> On 24/03/25 11:10AM, Patrick Steinhardt wrote:
> > When seeking a record in our block reader we perform a binary search
> > over the block's restart points so that we don't have to do a linear
> > scan over the whole block. The logic to do so is quite intricate though,
> > which makes it hard to understand.
> > 
> > Improve documentation and rename some of the functions and variables so
> > that the code becomes easier to understand overall. This refactoring
> > should not result in any change in behaviour.
> > 
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> ...  
> > -	i = binsearch(br->restart_count, &restart_key_less, &args);
> > +	/*
> > +	 * Perform a binary search over the block's restart points, which
> > +	 * avoids doing a linear scan over the whole block. Like this, we
> > +	 * identify the section of the block that should contain our key.
> > +	 *
> > +	 * Note that we explicitly search for the first restart point _greater_
> > +	 * than the sought-after record, not _greater or equal_ to it. In case
> > +	 * the sought-after record is located directly at the restart point we
> > +	 * would otherwise start doing the linear search at the preceding
> > +	 * restart point. While that works alright, we would end up scanning
> > +	 * too many record.
> > +	 */
> > +	i = binsearch(br->restart_count, &restart_needle_less, &args);
> > +
> > +	/*
> > +	 * Now there are multiple cases:
> > +	 *
> > +	 *   - `i == 0`: The wanted record must be contained before the first
> > +	 *     restart point. We will thus start searching for the record in
> > +	 *     that section after accounting for the header offset.
> 
> To clarify my understanding, does a restart_offset not exist at the
> beginning of a block? I was originally under the impression that the
> first reset point would be at the beginning of the block (or just after
> the header). If this was the case, the first restart point being greater
> would indicate that the wanted record is not contained within the block.

It wouldn't make much sense to include it as a restart point. A restart
point is a point where the prefix compression will be reset such that
the record at that point can be read without reading preceding records.
This is always implicitly true for the first record in a block as it is
never prefix-compressed. Consequently, writing a restart point for the
first record would be a waste of disk space.

Patrick

> -Justin
> 
> > +	 *
> > +	 *   - `i == restart_count`: The wanted record was not found at any of
> > +	 *     the restart points. As there is no restart point at the end of
> > +	 *     the section the record may thus be contained in the last block.
> > +	 *
> > +	 *   - `i > 0`: The wanted record must be contained in the section
> > +	 *     before the found restart point. We thus do a linear search
> > +	 *     starting from the preceding restart point.
> > +	 */
> >  	if (i > 0)
> >  		it->next_off = block_reader_restart_offset(br, i - 1);
> >  	else
> > @@ -399,21 +429,34 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
> >  
> >  	reftable_record_init(&rec, block_reader_type(br));
> >  
> > -	/* We're looking for the last entry less/equal than the wanted key, so
> > -	   we have to go one entry too far and then back up.
> > -	*/
> > +	/*
> > +	 * We're looking for the last entry less than the wanted key so that
> > +	 * the next call to `block_reader_next()` would yield the wanted
> > +	 * record. We thus don't want to position our reader at the sought
> > +	 * after record, but one before. To do so, we have to go one entry too
> > +	 * far and then back up.
> > +	 */
> >  	while (1) {
> >  		block_iter_copy_from(&next, it);
> >  		err = block_iter_next(&next, &rec);
> >  		if (err < 0)
> >  			goto done;
> > -
> > -		reftable_record_key(&rec, &it->last_key);
> > -		if (err > 0 || strbuf_cmp(&it->last_key, want) >= 0) {
> > +		if (err > 0) {
> >  			err = 0;
> >  			goto done;
> >  		}
> >  
> > +		/*
> > +		 * Check whether the current key is greater or equal to the
> > +		 * sought-after key. In case it is greater we know that the
> > +		 * record does not exist in the block and can thus abort early.
> > +		 * In case it is equal to the sought-after key we have found
> > +		 * the desired record.
> > +		 */
> > +		reftable_record_key(&rec, &it->last_key);
> > +		if (strbuf_cmp(&it->last_key, want) >= 0)
> > +			goto done;
> > +
> >  		block_iter_copy_from(it, &next);
> >  	}
> >  
> > -- 
> > 2.44.GIT
> > 
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 7/7] reftable/block: avoid decoding keys when searching restart points
  2024-04-02 16:47     ` Justin Tobler
@ 2024-04-02 17:15       ` Patrick Steinhardt
  0 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-02 17:15 UTC (permalink / raw
  To: git

[-- Attachment #1: Type: text/plain, Size: 3702 bytes --]

On Tue, Apr 02, 2024 at 11:47:16AM -0500, Justin Tobler wrote:
> On 24/03/25 11:11AM, Patrick Steinhardt wrote:
> > When searching over restart points in a block we decode the key of each
> > of the records, which results in a memory allocation. This is quite
> > pointless though given that records it restart points will never use
> > prefix compression and thus store their keys verbatim in the block.
> > 
> > Refactor the code so that we can avoid decoding the keys, which saves us
> > some allocations.
> 
> Out of curiousity, do you have any benchmarks around this change and
> would that be something we would want to add to the commit message?

I don't have a benchmark. The problem is that the difference isn't
really measureable when doing a single seek, only, because seeks are
simply too fast. The only usecase where I know that there are a ton of
of record seeks are writes, but here the performance improvement is
getting drowned out by everything else.

You can try to measure allocations and indeed see a difference. But
again, this is getting drowned out by the noise for writes. With my
block reader refactorings (ps/reftable-block-iteration-optim) you can
see the difference when iterating through refs. Before:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 314 allocs, 189 frees, 106,035 bytes allocated

After:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 303 allocs, 178 frees, 105,763 bytes allocated

But yeah, it's nothing that'd make you go "Oh, wow!". As said, it will
add up when doing many seeks, but I didn't manage to find a proper
benchamrk yet that would be worthy to make it into the commit message.

Patrick

> -Justin
> 
> > 
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> >  reftable/block.c | 29 +++++++++++++++++++----------
> >  1 file changed, 19 insertions(+), 10 deletions(-)
> > 
> > diff --git a/reftable/block.c b/reftable/block.c
> > index ca80a05e21..8bb4e43cec 100644
> > --- a/reftable/block.c
> > +++ b/reftable/block.c
> > @@ -287,23 +287,32 @@ static int restart_needle_less(size_t idx, void *_args)
> >  		.buf = args->reader->block.data + off,
> >  		.len = args->reader->block_len - off,
> >  	};
> > -	struct strbuf kth_restart_key = STRBUF_INIT;
> > -	uint8_t unused_extra;
> > -	int result, n;
> > +	uint64_t prefix_len, suffix_len;
> > +	uint8_t extra;
> > +	int n;
> >  
> >  	/*
> > -	 * TODO: The restart key is verbatim in the block, so we can in theory
> > -	 * avoid decoding the key and thus save some allocations.
> > +	 * Records at restart points are stored without prefix compression, so
> > +	 * there is no need to fully decode the record key here. This removes
> > +	 * the need for allocating memory.
> >  	 */
> > -	n = reftable_decode_key(&kth_restart_key, &unused_extra, in);
> > -	if (n < 0) {
> > +	n = reftable_decode_keylen(in, &prefix_len, &suffix_len, &extra);
> > +	if (n < 0 || prefix_len) {
> >  		args->error = 1;
> >  		return -1;
> >  	}
> >  
> > -	result = strbuf_cmp(&args->needle, &kth_restart_key);
> > -	strbuf_release(&kth_restart_key);
> > -	return result < 0;
> > +	string_view_consume(&in, n);
> > +	if (suffix_len > in.len) {
> > +		args->error = 1;
> > +		return -1;
> > +	}
> > +
> > +	n = memcmp(args->needle.buf, in.buf,
> > +		   args->needle.len < suffix_len ? args->needle.len : suffix_len);
> > +	if (n)
> > +		return n < 0;
> > +	return args->needle.len < suffix_len;
> >  }
> >  
> >  void block_iter_copy_from(struct block_iter *dest, struct block_iter *src)
> > -- 
> > 2.44.GIT
> > 
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism
  2024-03-22 12:22 [PATCH 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
                   ` (7 preceding siblings ...)
  2024-03-25 10:10 ` [PATCH v2 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
@ 2024-04-02 17:24 ` Patrick Steinhardt
  2024-04-02 17:24   ` [PATCH v3 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
                     ` (7 more replies)
  2024-04-03  6:03 ` [PATCH v4 " Patrick Steinhardt
  9 siblings, 8 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-02 17:24 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 5210 bytes --]

Hi,

this is the third version of my patch series that refactors the
`binsearch()` mechanism in the reftable library. There's only a single
change compared to v2, namely a rename of some arguments when calling
`refname_needle_lesseq()`.

Thanks!

Patrick

Patrick Steinhardt (7):
  reftable/basics: fix return type of `binsearch()` to be `size_t`
  reftable/basics: improve `binsearch()` test
  reftable/refname: refactor binary search over refnames
  reftable/block: refactor binary search over restart points
  reftable/block: fix error handling when searching restart points
  reftable/record: extract function to decode key lengths
  reftable/block: avoid decoding keys when searching restart points

 reftable/basics.c      |   7 ++-
 reftable/basics.h      |   7 +--
 reftable/basics_test.c |  55 +++++++++++---------
 reftable/block.c       | 114 ++++++++++++++++++++++++++++++-----------
 reftable/record.c      |  34 ++++++++----
 reftable/record.h      |   6 +++
 reftable/refname.c     |  53 +++++++++----------
 7 files changed, 179 insertions(+), 97 deletions(-)

Range-diff against v2:
1:  cd82ac6531 = 1:  baa07ef144 reftable/basics: fix return type of `binsearch()` to be `size_t`
2:  a277d4fa6f = 2:  cbc2a107c1 reftable/basics: improve `binsearch()` test
3:  9ffcf45c32 ! 3:  f5bf65e0dd reftable/refname: refactor binary search over refnames
    @@ reftable/refname.c
      };
      
     -static int find_name(size_t k, void *arg)
    -+static int refname_needle_lesseq(size_t k, void *arg)
    ++static int refname_needle_lesseq(size_t k, void *_args)
      {
     -	struct find_arg *f_arg = arg;
     -	return strcmp(f_arg->names[k], f_arg->want) >= 0;
    -+	struct refname_needle_lesseq_args *f_arg = arg;
    -+	return strcmp(f_arg->needle, f_arg->haystack[k]) <= 0;
    ++	struct refname_needle_lesseq_args *args = _args;
    ++	return strcmp(args->needle, args->haystack[k]) <= 0;
      }
      
      static int modification_has_ref(struct modification *mod, const char *name)
    @@ reftable/refname.c: static int modification_has_ref(struct modification *mod, co
     -		struct find_arg arg = {
     -			.names = mod->add,
     -			.want = name,
    -+		struct refname_needle_lesseq_args arg = {
    ++		struct refname_needle_lesseq_args args = {
     +			.haystack = mod->add,
     +			.needle = name,
      		};
     -		size_t idx = binsearch(mod->add_len, find_name, &arg);
    -+		size_t idx = binsearch(mod->add_len, refname_needle_lesseq, &arg);
    ++		size_t idx = binsearch(mod->add_len, refname_needle_lesseq, &args);
      		if (idx < mod->add_len && !strcmp(mod->add[idx], name))
      			return 0;
      	}
    @@ reftable/refname.c: static int modification_has_ref(struct modification *mod, co
     -		struct find_arg arg = {
     -			.names = mod->del,
     -			.want = name,
    -+		struct refname_needle_lesseq_args arg = {
    ++		struct refname_needle_lesseq_args args = {
     +			.haystack = mod->del,
     +			.needle = name,
      		};
     -		size_t idx = binsearch(mod->del_len, find_name, &arg);
    -+		size_t idx = binsearch(mod->del_len, refname_needle_lesseq, &arg);
    ++		size_t idx = binsearch(mod->del_len, refname_needle_lesseq, &args);
      		if (idx < mod->del_len && !strcmp(mod->del[idx], name))
      			return 1;
      	}
    @@ reftable/refname.c: static int modification_has_ref_with_prefix(struct modificat
     -		struct find_arg arg = {
     -			.names = mod->add,
     -			.want = prefix,
    -+		struct refname_needle_lesseq_args arg = {
    ++		struct refname_needle_lesseq_args args = {
     +			.haystack = mod->add,
     +			.needle = prefix,
      		};
     -		size_t idx = binsearch(mod->add_len, find_name, &arg);
    -+		size_t idx = binsearch(mod->add_len, refname_needle_lesseq, &arg);
    ++		size_t idx = binsearch(mod->add_len, refname_needle_lesseq, &args);
      		if (idx < mod->add_len &&
      		    !strncmp(prefix, mod->add[idx], strlen(prefix)))
      			goto done;
    @@ reftable/refname.c: static int modification_has_ref_with_prefix(struct modificat
     -			struct find_arg arg = {
     -				.names = mod->del,
     -				.want = ref.refname,
    -+			struct refname_needle_lesseq_args arg = {
    ++			struct refname_needle_lesseq_args args = {
     +				.haystack = mod->del,
     +				.needle = ref.refname,
      			};
     -			size_t idx = binsearch(mod->del_len, find_name, &arg);
    -+			size_t idx = binsearch(mod->del_len, refname_needle_lesseq, &arg);
    ++			size_t idx = binsearch(mod->del_len, refname_needle_lesseq, &args);
      			if (idx < mod->del_len &&
      			    !strcmp(ref.refname, mod->del[idx]))
      				continue;
4:  5e20d93ae0 = 4:  435cd0be94 reftable/block: refactor binary search over restart points
5:  5bbeab114f = 5:  8d8abfd290 reftable/block: fix error handling when searching restart points
6:  271bacb210 = 6:  f87f7ad01a reftable/record: extract function to decode key lengths
7:  e751b3c536 = 7:  f53bf9e1cc reftable/block: avoid decoding keys when searching restart points

base-commit: 11c821f2f2a31e70fb5cc449f9a29401c333aad2
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v3 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t`
  2024-04-02 17:24 ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
@ 2024-04-02 17:24   ` Patrick Steinhardt
  2024-04-02 17:24   ` [PATCH v3 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-02 17:24 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 4578 bytes --]

The `binsearch()` function can be used to find the first element for
which a callback functions returns a truish value. But while the array
size is of type `size_t`, the function in fact returns an `int` that is
supposed to index into that array.

Fix the function signature to return a `size_t`. This conversion does
not change any semantics given that the function would only ever return
a value in the range `[0, sz]` anyway.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/basics.c      |  2 +-
 reftable/basics.h      |  2 +-
 reftable/basics_test.c |  6 +++---
 reftable/block.c       |  3 ++-
 reftable/refname.c     | 17 +++++++----------
 5 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/reftable/basics.c b/reftable/basics.c
index 0785aff941..2c5f34b39e 100644
--- a/reftable/basics.c
+++ b/reftable/basics.c
@@ -27,7 +27,7 @@ void put_be16(uint8_t *out, uint16_t i)
 	out[1] = (uint8_t)(i & 0xff);
 }
 
-int binsearch(size_t sz, int (*f)(size_t k, void *args), void *args)
+size_t binsearch(size_t sz, int (*f)(size_t k, void *args), void *args)
 {
 	size_t lo = 0;
 	size_t hi = sz;
diff --git a/reftable/basics.h b/reftable/basics.h
index 91f3533efe..2672520e76 100644
--- a/reftable/basics.h
+++ b/reftable/basics.h
@@ -28,7 +28,7 @@ void put_be16(uint8_t *out, uint16_t i);
  * Contrary to bsearch(3), this returns something useful if the argument is not
  * found.
  */
-int binsearch(size_t sz, int (*f)(size_t k, void *args), void *args);
+size_t binsearch(size_t sz, int (*f)(size_t k, void *args), void *args);
 
 /*
  * Frees a NULL terminated array of malloced strings. The array itself is also
diff --git a/reftable/basics_test.c b/reftable/basics_test.c
index 1fcd229725..dc1c87c5df 100644
--- a/reftable/basics_test.c
+++ b/reftable/basics_test.c
@@ -34,15 +34,15 @@ static void test_binsearch(void)
 
 	int i = 0;
 	for (i = 1; i < 11; i++) {
-		int res;
+		size_t res;
+
 		args.key = i;
 		res = binsearch(sz, &binsearch_func, &args);
 
 		if (res < sz) {
 			EXPECT(args.key < arr[res]);
-			if (res > 0) {
+			if (res > 0)
 				EXPECT(args.key >= arr[res - 1]);
-			}
 		} else {
 			EXPECT(args.key == 10 || args.key == 11);
 		}
diff --git a/reftable/block.c b/reftable/block.c
index e2a2cee58d..422885bddb 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -382,7 +382,8 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 	};
 	struct block_iter next = BLOCK_ITER_INIT;
 	struct reftable_record rec;
-	int err = 0, i;
+	int err = 0;
+	size_t i;
 
 	if (args.error) {
 		err = REFTABLE_FORMAT_ERROR;
diff --git a/reftable/refname.c b/reftable/refname.c
index 7570e4acf9..64eba1b886 100644
--- a/reftable/refname.c
+++ b/reftable/refname.c
@@ -33,10 +33,9 @@ static int modification_has_ref(struct modification *mod, const char *name)
 			.names = mod->add,
 			.want = name,
 		};
-		int idx = binsearch(mod->add_len, find_name, &arg);
-		if (idx < mod->add_len && !strcmp(mod->add[idx], name)) {
+		size_t idx = binsearch(mod->add_len, find_name, &arg);
+		if (idx < mod->add_len && !strcmp(mod->add[idx], name))
 			return 0;
-		}
 	}
 
 	if (mod->del_len > 0) {
@@ -44,10 +43,9 @@ static int modification_has_ref(struct modification *mod, const char *name)
 			.names = mod->del,
 			.want = name,
 		};
-		int idx = binsearch(mod->del_len, find_name, &arg);
-		if (idx < mod->del_len && !strcmp(mod->del[idx], name)) {
+		size_t idx = binsearch(mod->del_len, find_name, &arg);
+		if (idx < mod->del_len && !strcmp(mod->del[idx], name))
 			return 1;
-		}
 	}
 
 	err = reftable_table_read_ref(&mod->tab, name, &ref);
@@ -77,7 +75,7 @@ static int modification_has_ref_with_prefix(struct modification *mod,
 			.names = mod->add,
 			.want = prefix,
 		};
-		int idx = binsearch(mod->add_len, find_name, &arg);
+		size_t idx = binsearch(mod->add_len, find_name, &arg);
 		if (idx < mod->add_len &&
 		    !strncmp(prefix, mod->add[idx], strlen(prefix)))
 			goto done;
@@ -96,11 +94,10 @@ static int modification_has_ref_with_prefix(struct modification *mod,
 				.names = mod->del,
 				.want = ref.refname,
 			};
-			int idx = binsearch(mod->del_len, find_name, &arg);
+			size_t idx = binsearch(mod->del_len, find_name, &arg);
 			if (idx < mod->del_len &&
-			    !strcmp(ref.refname, mod->del[idx])) {
+			    !strcmp(ref.refname, mod->del[idx]))
 				continue;
-			}
 		}
 
 		if (strncmp(ref.refname, prefix, strlen(prefix))) {
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 2/7] reftable/basics: improve `binsearch()` test
  2024-04-02 17:24 ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
  2024-04-02 17:24   ` [PATCH v3 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
@ 2024-04-02 17:24   ` Patrick Steinhardt
  2024-04-02 17:24   ` [PATCH v3 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-02 17:24 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 2541 bytes --]

The `binsearch()` test is somewhat weird in that it doesn't explicitly
spell out its expectations. Instead it does so in a rather ad-hoc way
with some hard-to-understand computations.

Refactor the test to spell out the needle as well as expected index for
all testcases. This refactoring highlights that the `binsearch_func()`
is written somewhat weirdly to find the first integer smaller than the
needle, not smaller or equal to it. Adjust the function accordingly.

While at it, rename the callback function to better convey its meaning.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/basics_test.c | 55 ++++++++++++++++++++++++------------------
 1 file changed, 31 insertions(+), 24 deletions(-)

diff --git a/reftable/basics_test.c b/reftable/basics_test.c
index dc1c87c5df..997c4d9e01 100644
--- a/reftable/basics_test.c
+++ b/reftable/basics_test.c
@@ -12,40 +12,47 @@ license that can be found in the LICENSE file or at
 #include "test_framework.h"
 #include "reftable-tests.h"
 
-struct binsearch_args {
-	int key;
-	int *arr;
+struct integer_needle_lesseq_args {
+	int needle;
+	int *haystack;
 };
 
-static int binsearch_func(size_t i, void *void_args)
+static int integer_needle_lesseq(size_t i, void *_args)
 {
-	struct binsearch_args *args = void_args;
-
-	return args->key < args->arr[i];
+	struct integer_needle_lesseq_args *args = _args;
+	return args->needle <= args->haystack[i];
 }
 
 static void test_binsearch(void)
 {
-	int arr[] = { 2, 4, 6, 8, 10 };
-	size_t sz = ARRAY_SIZE(arr);
-	struct binsearch_args args = {
-		.arr = arr,
+	int haystack[] = { 2, 4, 6, 8, 10 };
+	struct {
+		int needle;
+		size_t expected_idx;
+	} testcases[] = {
+		{-9000, 0},
+		{-1, 0},
+		{0, 0},
+		{2, 0},
+		{3, 1},
+		{4, 1},
+		{7, 3},
+		{9, 4},
+		{10, 4},
+		{11, 5},
+		{9000, 5},
 	};
+	size_t i = 0;
 
-	int i = 0;
-	for (i = 1; i < 11; i++) {
-		size_t res;
-
-		args.key = i;
-		res = binsearch(sz, &binsearch_func, &args);
+	for (i = 0; i < ARRAY_SIZE(testcases); i++) {
+		struct integer_needle_lesseq_args args = {
+			.haystack = haystack,
+			.needle = testcases[i].needle,
+		};
+		size_t idx;
 
-		if (res < sz) {
-			EXPECT(args.key < arr[res]);
-			if (res > 0)
-				EXPECT(args.key >= arr[res - 1]);
-		} else {
-			EXPECT(args.key == 10 || args.key == 11);
-		}
+		idx = binsearch(ARRAY_SIZE(haystack), &integer_needle_lesseq, &args);
+		EXPECT(idx == testcases[i].expected_idx);
 	}
 }
 
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 3/7] reftable/refname: refactor binary search over refnames
  2024-04-02 17:24 ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
  2024-04-02 17:24   ` [PATCH v3 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
  2024-04-02 17:24   ` [PATCH v3 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
@ 2024-04-02 17:24   ` Patrick Steinhardt
  2024-04-02 17:24   ` [PATCH v3 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-02 17:24 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 3255 bytes --]

It is comparatively hard to understand how exactly the binary search
over refnames works given that the function and variable names are not
exactly easy to grasp. Rename them to make this more obvious. This
should not result in any change in behaviour.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/refname.c | 44 ++++++++++++++++++++++----------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/reftable/refname.c b/reftable/refname.c
index 64eba1b886..bbfde15754 100644
--- a/reftable/refname.c
+++ b/reftable/refname.c
@@ -12,15 +12,15 @@
 #include "refname.h"
 #include "reftable-iterator.h"
 
-struct find_arg {
-	char **names;
-	const char *want;
+struct refname_needle_lesseq_args {
+	char **haystack;
+	const char *needle;
 };
 
-static int find_name(size_t k, void *arg)
+static int refname_needle_lesseq(size_t k, void *_args)
 {
-	struct find_arg *f_arg = arg;
-	return strcmp(f_arg->names[k], f_arg->want) >= 0;
+	struct refname_needle_lesseq_args *args = _args;
+	return strcmp(args->needle, args->haystack[k]) <= 0;
 }
 
 static int modification_has_ref(struct modification *mod, const char *name)
@@ -29,21 +29,21 @@ static int modification_has_ref(struct modification *mod, const char *name)
 	int err = 0;
 
 	if (mod->add_len > 0) {
-		struct find_arg arg = {
-			.names = mod->add,
-			.want = name,
+		struct refname_needle_lesseq_args args = {
+			.haystack = mod->add,
+			.needle = name,
 		};
-		size_t idx = binsearch(mod->add_len, find_name, &arg);
+		size_t idx = binsearch(mod->add_len, refname_needle_lesseq, &args);
 		if (idx < mod->add_len && !strcmp(mod->add[idx], name))
 			return 0;
 	}
 
 	if (mod->del_len > 0) {
-		struct find_arg arg = {
-			.names = mod->del,
-			.want = name,
+		struct refname_needle_lesseq_args args = {
+			.haystack = mod->del,
+			.needle = name,
 		};
-		size_t idx = binsearch(mod->del_len, find_name, &arg);
+		size_t idx = binsearch(mod->del_len, refname_needle_lesseq, &args);
 		if (idx < mod->del_len && !strcmp(mod->del[idx], name))
 			return 1;
 	}
@@ -71,11 +71,11 @@ static int modification_has_ref_with_prefix(struct modification *mod,
 	int err = 0;
 
 	if (mod->add_len > 0) {
-		struct find_arg arg = {
-			.names = mod->add,
-			.want = prefix,
+		struct refname_needle_lesseq_args args = {
+			.haystack = mod->add,
+			.needle = prefix,
 		};
-		size_t idx = binsearch(mod->add_len, find_name, &arg);
+		size_t idx = binsearch(mod->add_len, refname_needle_lesseq, &args);
 		if (idx < mod->add_len &&
 		    !strncmp(prefix, mod->add[idx], strlen(prefix)))
 			goto done;
@@ -90,11 +90,11 @@ static int modification_has_ref_with_prefix(struct modification *mod,
 			goto done;
 
 		if (mod->del_len > 0) {
-			struct find_arg arg = {
-				.names = mod->del,
-				.want = ref.refname,
+			struct refname_needle_lesseq_args args = {
+				.haystack = mod->del,
+				.needle = ref.refname,
 			};
-			size_t idx = binsearch(mod->del_len, find_name, &arg);
+			size_t idx = binsearch(mod->del_len, refname_needle_lesseq, &args);
 			if (idx < mod->del_len &&
 			    !strcmp(ref.refname, mod->del[idx]))
 				continue;
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 4/7] reftable/block: refactor binary search over restart points
  2024-04-02 17:24 ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
                     ` (2 preceding siblings ...)
  2024-04-02 17:24   ` [PATCH v3 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
@ 2024-04-02 17:24   ` Patrick Steinhardt
  2024-04-02 17:24   ` [PATCH v3 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-02 17:24 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 5820 bytes --]

When seeking a record in our block reader we perform a binary search
over the block's restart points so that we don't have to do a linear
scan over the whole block. The logic to do so is quite intricate though,
which makes it hard to understand.

Improve documentation and rename some of the functions and variables so
that the code becomes easier to understand overall. This refactoring
should not result in any change in behaviour.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/block.c | 97 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 70 insertions(+), 27 deletions(-)

diff --git a/reftable/block.c b/reftable/block.c
index 422885bddb..6cd4c053df 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -273,34 +273,36 @@ void block_reader_start(struct block_reader *br, struct block_iter *it)
 	it->next_off = br->header_off + 4;
 }
 
-struct restart_find_args {
+struct restart_needle_less_args {
 	int error;
-	struct strbuf key;
-	struct block_reader *r;
+	struct strbuf needle;
+	struct block_reader *reader;
 };
 
-static int restart_key_less(size_t idx, void *args)
+static int restart_needle_less(size_t idx, void *_args)
 {
-	struct restart_find_args *a = args;
-	uint32_t off = block_reader_restart_offset(a->r, idx);
+	struct restart_needle_less_args *args = _args;
+	uint32_t off = block_reader_restart_offset(args->reader, idx);
 	struct string_view in = {
-		.buf = a->r->block.data + off,
-		.len = a->r->block_len - off,
+		.buf = args->reader->block.data + off,
+		.len = args->reader->block_len - off,
 	};
-
-	/* the restart key is verbatim in the block, so this could avoid the
-	   alloc for decoding the key */
-	struct strbuf rkey = STRBUF_INIT;
+	struct strbuf kth_restart_key = STRBUF_INIT;
 	uint8_t unused_extra;
-	int n = reftable_decode_key(&rkey, &unused_extra, in);
-	int result;
+	int result, n;
+
+	/*
+	 * TODO: The restart key is verbatim in the block, so we can in theory
+	 * avoid decoding the key and thus save some allocations.
+	 */
+	n = reftable_decode_key(&kth_restart_key, &unused_extra, in);
 	if (n < 0) {
-		a->error = 1;
+		args->error = 1;
 		return -1;
 	}
 
-	result = strbuf_cmp(&a->key, &rkey);
-	strbuf_release(&rkey);
+	result = strbuf_cmp(&args->needle, &kth_restart_key);
+	strbuf_release(&kth_restart_key);
 	return result < 0;
 }
 
@@ -376,9 +378,9 @@ void block_iter_close(struct block_iter *it)
 int block_reader_seek(struct block_reader *br, struct block_iter *it,
 		      struct strbuf *want)
 {
-	struct restart_find_args args = {
-		.key = *want,
-		.r = br,
+	struct restart_needle_less_args args = {
+		.needle = *want,
+		.reader = br,
 	};
 	struct block_iter next = BLOCK_ITER_INIT;
 	struct reftable_record rec;
@@ -390,7 +392,35 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 		goto done;
 	}
 
-	i = binsearch(br->restart_count, &restart_key_less, &args);
+	/*
+	 * Perform a binary search over the block's restart points, which
+	 * avoids doing a linear scan over the whole block. Like this, we
+	 * identify the section of the block that should contain our key.
+	 *
+	 * Note that we explicitly search for the first restart point _greater_
+	 * than the sought-after record, not _greater or equal_ to it. In case
+	 * the sought-after record is located directly at the restart point we
+	 * would otherwise start doing the linear search at the preceding
+	 * restart point. While that works alright, we would end up scanning
+	 * too many record.
+	 */
+	i = binsearch(br->restart_count, &restart_needle_less, &args);
+
+	/*
+	 * Now there are multiple cases:
+	 *
+	 *   - `i == 0`: The wanted record must be contained before the first
+	 *     restart point. We will thus start searching for the record in
+	 *     that section after accounting for the header offset.
+	 *
+	 *   - `i == restart_count`: The wanted record was not found at any of
+	 *     the restart points. As there is no restart point at the end of
+	 *     the section the record may thus be contained in the last block.
+	 *
+	 *   - `i > 0`: The wanted record must be contained in the section
+	 *     before the found restart point. We thus do a linear search
+	 *     starting from the preceding restart point.
+	 */
 	if (i > 0)
 		it->next_off = block_reader_restart_offset(br, i - 1);
 	else
@@ -399,21 +429,34 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 
 	reftable_record_init(&rec, block_reader_type(br));
 
-	/* We're looking for the last entry less/equal than the wanted key, so
-	   we have to go one entry too far and then back up.
-	*/
+	/*
+	 * We're looking for the last entry less than the wanted key so that
+	 * the next call to `block_reader_next()` would yield the wanted
+	 * record. We thus don't want to position our reader at the sought
+	 * after record, but one before. To do so, we have to go one entry too
+	 * far and then back up.
+	 */
 	while (1) {
 		block_iter_copy_from(&next, it);
 		err = block_iter_next(&next, &rec);
 		if (err < 0)
 			goto done;
-
-		reftable_record_key(&rec, &it->last_key);
-		if (err > 0 || strbuf_cmp(&it->last_key, want) >= 0) {
+		if (err > 0) {
 			err = 0;
 			goto done;
 		}
 
+		/*
+		 * Check whether the current key is greater or equal to the
+		 * sought-after key. In case it is greater we know that the
+		 * record does not exist in the block and can thus abort early.
+		 * In case it is equal to the sought-after key we have found
+		 * the desired record.
+		 */
+		reftable_record_key(&rec, &it->last_key);
+		if (strbuf_cmp(&it->last_key, want) >= 0)
+			goto done;
+
 		block_iter_copy_from(it, &next);
 	}
 
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 5/7] reftable/block: fix error handling when searching restart points
  2024-04-02 17:24 ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
                     ` (3 preceding siblings ...)
  2024-04-02 17:24   ` [PATCH v3 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
@ 2024-04-02 17:24   ` Patrick Steinhardt
  2024-04-02 17:25   ` [PATCH v3 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-02 17:24 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 2756 bytes --]

When doing the binary search over restart points in a block we need to
decode the record keys. This decoding step can result in an error when
the block is corrupted, which we indicate to the caller of the binary
search by setting `args.error = 1`. But the only caller that exists
mishandles this because it in fact performs the error check before
calling `binsearch()`.

Fix this bug by checking for errors at the right point in time.
Furthermore, refactor `binsearch()` so that it aborts the search in case
the callback function returns a negative value so that we don't
needlessly continue to search the block.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/basics.c | 5 ++++-
 reftable/basics.h | 5 +++--
 reftable/block.c  | 9 ++++-----
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/reftable/basics.c b/reftable/basics.c
index 2c5f34b39e..fea711db7e 100644
--- a/reftable/basics.c
+++ b/reftable/basics.c
@@ -39,8 +39,11 @@ size_t binsearch(size_t sz, int (*f)(size_t k, void *args), void *args)
 	 */
 	while (hi - lo > 1) {
 		size_t mid = lo + (hi - lo) / 2;
+		int ret = f(mid, args);
+		if (ret < 0)
+			return sz;
 
-		if (f(mid, args))
+		if (ret > 0)
 			hi = mid;
 		else
 			lo = mid;
diff --git a/reftable/basics.h b/reftable/basics.h
index 2672520e76..523ecd5307 100644
--- a/reftable/basics.h
+++ b/reftable/basics.h
@@ -22,8 +22,9 @@ uint32_t get_be24(uint8_t *in);
 void put_be16(uint8_t *out, uint16_t i);
 
 /*
- * find smallest index i in [0, sz) at which f(i) is true, assuming
- * that f is ascending. Return sz if f(i) is false for all indices.
+ * find smallest index i in [0, sz) at which `f(i) > 0`, assuming that f is
+ * ascending. Return sz if `f(i) == 0` for all indices. The search is aborted
+ * and `sz` is returned in case `f(i) < 0`.
  *
  * Contrary to bsearch(3), this returns something useful if the argument is not
  * found.
diff --git a/reftable/block.c b/reftable/block.c
index 6cd4c053df..ca80a05e21 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -387,11 +387,6 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 	int err = 0;
 	size_t i;
 
-	if (args.error) {
-		err = REFTABLE_FORMAT_ERROR;
-		goto done;
-	}
-
 	/*
 	 * Perform a binary search over the block's restart points, which
 	 * avoids doing a linear scan over the whole block. Like this, we
@@ -405,6 +400,10 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 	 * too many record.
 	 */
 	i = binsearch(br->restart_count, &restart_needle_less, &args);
+	if (args.error) {
+		err = REFTABLE_FORMAT_ERROR;
+		goto done;
+	}
 
 	/*
 	 * Now there are multiple cases:
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 6/7] reftable/record: extract function to decode key lengths
  2024-04-02 17:24 ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
                     ` (4 preceding siblings ...)
  2024-04-02 17:24   ` [PATCH v3 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
@ 2024-04-02 17:25   ` Patrick Steinhardt
  2024-04-02 17:25   ` [PATCH v3 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
  2024-04-02 17:49   ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Justin Tobler
  7 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-02 17:25 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 2614 bytes --]

We're about to refactor the binary search over restart points so that it
does not need to fully decode the record keys anymore. To do so we will
need to decode the record key lengths, which is non-trivial logic.

Extract the logic to decode these lengths from `refatble_decode_key()`
so that we can reuse it.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/record.c | 34 +++++++++++++++++++++++++---------
 reftable/record.h |  6 ++++++
 2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/reftable/record.c b/reftable/record.c
index 23b497adab..5506f3e913 100644
--- a/reftable/record.c
+++ b/reftable/record.c
@@ -159,26 +159,42 @@ int reftable_encode_key(int *restart, struct string_view dest,
 	return start.len - dest.len;
 }
 
-int reftable_decode_key(struct strbuf *last_key, uint8_t *extra,
-			struct string_view in)
+int reftable_decode_keylen(struct string_view in,
+			   uint64_t *prefix_len,
+			   uint64_t *suffix_len,
+			   uint8_t *extra)
 {
-	int start_len = in.len;
-	uint64_t prefix_len = 0;
-	uint64_t suffix_len = 0;
+	size_t start_len = in.len;
 	int n;
 
-	n = get_var_int(&prefix_len, &in);
+	n = get_var_int(prefix_len, &in);
 	if (n < 0)
 		return -1;
 	string_view_consume(&in, n);
 
-	n = get_var_int(&suffix_len, &in);
+	n = get_var_int(suffix_len, &in);
 	if (n <= 0)
 		return -1;
 	string_view_consume(&in, n);
 
-	*extra = (uint8_t)(suffix_len & 0x7);
-	suffix_len >>= 3;
+	*extra = (uint8_t)(*suffix_len & 0x7);
+	*suffix_len >>= 3;
+
+	return start_len - in.len;
+}
+
+int reftable_decode_key(struct strbuf *last_key, uint8_t *extra,
+			struct string_view in)
+{
+	int start_len = in.len;
+	uint64_t prefix_len = 0;
+	uint64_t suffix_len = 0;
+	int n;
+
+	n = reftable_decode_keylen(in, &prefix_len, &suffix_len, extra);
+	if (n < 0)
+		return -1;
+	string_view_consume(&in, n);
 
 	if (in.len < suffix_len ||
 	    prefix_len > last_key->len)
diff --git a/reftable/record.h b/reftable/record.h
index 826ee1c55c..d778133e6e 100644
--- a/reftable/record.h
+++ b/reftable/record.h
@@ -86,6 +86,12 @@ int reftable_encode_key(int *is_restart, struct string_view dest,
 			struct strbuf prev_key, struct strbuf key,
 			uint8_t extra);
 
+/* Decode a record's key lengths. */
+int reftable_decode_keylen(struct string_view in,
+			   uint64_t *prefix_len,
+			   uint64_t *suffix_len,
+			   uint8_t *extra);
+
 /*
  * Decode into `last_key` and `extra` from `in`. `last_key` is expected to
  * contain the decoded key of the preceding record, if any.
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 7/7] reftable/block: avoid decoding keys when searching restart points
  2024-04-02 17:24 ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
                     ` (5 preceding siblings ...)
  2024-04-02 17:25   ` [PATCH v3 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
@ 2024-04-02 17:25   ` Patrick Steinhardt
  2024-04-02 17:49   ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Justin Tobler
  7 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-02 17:25 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 2061 bytes --]

When searching over restart points in a block we decode the key of each
of the records, which results in a memory allocation. This is quite
pointless though given that records it restart points will never use
prefix compression and thus store their keys verbatim in the block.

Refactor the code so that we can avoid decoding the keys, which saves us
some allocations.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/block.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/reftable/block.c b/reftable/block.c
index ca80a05e21..8bb4e43cec 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -287,23 +287,32 @@ static int restart_needle_less(size_t idx, void *_args)
 		.buf = args->reader->block.data + off,
 		.len = args->reader->block_len - off,
 	};
-	struct strbuf kth_restart_key = STRBUF_INIT;
-	uint8_t unused_extra;
-	int result, n;
+	uint64_t prefix_len, suffix_len;
+	uint8_t extra;
+	int n;
 
 	/*
-	 * TODO: The restart key is verbatim in the block, so we can in theory
-	 * avoid decoding the key and thus save some allocations.
+	 * Records at restart points are stored without prefix compression, so
+	 * there is no need to fully decode the record key here. This removes
+	 * the need for allocating memory.
 	 */
-	n = reftable_decode_key(&kth_restart_key, &unused_extra, in);
-	if (n < 0) {
+	n = reftable_decode_keylen(in, &prefix_len, &suffix_len, &extra);
+	if (n < 0 || prefix_len) {
 		args->error = 1;
 		return -1;
 	}
 
-	result = strbuf_cmp(&args->needle, &kth_restart_key);
-	strbuf_release(&kth_restart_key);
-	return result < 0;
+	string_view_consume(&in, n);
+	if (suffix_len > in.len) {
+		args->error = 1;
+		return -1;
+	}
+
+	n = memcmp(args->needle.buf, in.buf,
+		   args->needle.len < suffix_len ? args->needle.len : suffix_len);
+	if (n)
+		return n < 0;
+	return args->needle.len < suffix_len;
 }
 
 void block_iter_copy_from(struct block_iter *dest, struct block_iter *src)
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 4/7] reftable/block: refactor binary search over restart points
  2024-04-02 17:15       ` Patrick Steinhardt
@ 2024-04-02 17:46         ` Justin Tobler
  2024-04-03  6:01           ` Patrick Steinhardt
  0 siblings, 1 reply; 45+ messages in thread
From: Justin Tobler @ 2024-04-02 17:46 UTC (permalink / raw
  To: Patrick Steinhardt; +Cc: git

On 24/04/02 07:15PM, Patrick Steinhardt wrote:
> On Tue, Apr 02, 2024 at 11:42:29AM -0500, Justin Tobler wrote:
> > On 24/03/25 11:10AM, Patrick Steinhardt wrote:
> > > When seeking a record in our block reader we perform a binary search
> > > over the block's restart points so that we don't have to do a linear
> > > scan over the whole block. The logic to do so is quite intricate though,
> > > which makes it hard to understand.
> > > 
> > > Improve documentation and rename some of the functions and variables so
> > > that the code becomes easier to understand overall. This refactoring
> > > should not result in any change in behaviour.
> > > 
> > > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > > ---
> > ...  
> > > -	i = binsearch(br->restart_count, &restart_key_less, &args);
> > > +	/*
> > > +	 * Perform a binary search over the block's restart points, which
> > > +	 * avoids doing a linear scan over the whole block. Like this, we
> > > +	 * identify the section of the block that should contain our key.
> > > +	 *
> > > +	 * Note that we explicitly search for the first restart point _greater_
> > > +	 * than the sought-after record, not _greater or equal_ to it. In case
> > > +	 * the sought-after record is located directly at the restart point we
> > > +	 * would otherwise start doing the linear search at the preceding
> > > +	 * restart point. While that works alright, we would end up scanning
> > > +	 * too many record.
> > > +	 */
> > > +	i = binsearch(br->restart_count, &restart_needle_less, &args);
> > > +
> > > +	/*
> > > +	 * Now there are multiple cases:
> > > +	 *
> > > +	 *   - `i == 0`: The wanted record must be contained before the first
> > > +	 *     restart point. We will thus start searching for the record in
> > > +	 *     that section after accounting for the header offset.
> > 
> > To clarify my understanding, does a restart_offset not exist at the
> > beginning of a block? I was originally under the impression that the
> > first reset point would be at the beginning of the block (or just after
> > the header). If this was the case, the first restart point being greater
> > would indicate that the wanted record is not contained within the block.
> 
> It wouldn't make much sense to include it as a restart point. A restart
> point is a point where the prefix compression will be reset such that
> the record at that point can be read without reading preceding records.
> This is always implicitly true for the first record in a block as it is
> never prefix-compressed. Consequently, writing a restart point for the
> first record would be a waste of disk space.

Thanks Patrick! Good to know :)

From Documentation/technical/reftable.txt:

> As the first ref block shares the first file block with the file
> header, all restart_offset in the first block are relative to the
> start of the file (position 0), and include the file header. This 
> forces the first restart_offset to be 28.

I'm assumming this is out-of-date.

-Justin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism
  2024-04-02 17:24 ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
                     ` (6 preceding siblings ...)
  2024-04-02 17:25   ` [PATCH v3 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
@ 2024-04-02 17:49   ` Justin Tobler
  7 siblings, 0 replies; 45+ messages in thread
From: Justin Tobler @ 2024-04-02 17:49 UTC (permalink / raw
  To: Patrick Steinhardt; +Cc: git

On 24/04/02 07:24PM, Patrick Steinhardt wrote:
> Hi,
> 
> this is the third version of my patch series that refactors the
> `binsearch()` mechanism in the reftable library. There's only a single
> change compared to v2, namely a rename of some arguments when calling
> `refname_needle_lesseq()`.
> 
> Thanks!
> 
> Patrick
> 
...

Thanks Patrick! This version looks good to me :)

-Justin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 4/7] reftable/block: refactor binary search over restart points
  2024-04-02 17:46         ` Justin Tobler
@ 2024-04-03  6:01           ` Patrick Steinhardt
  0 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-03  6:01 UTC (permalink / raw
  To: git

[-- Attachment #1: Type: text/plain, Size: 4817 bytes --]

On Tue, Apr 02, 2024 at 12:46:30PM -0500, Justin Tobler wrote:
> On 24/04/02 07:15PM, Patrick Steinhardt wrote:
> > On Tue, Apr 02, 2024 at 11:42:29AM -0500, Justin Tobler wrote:
> > > On 24/03/25 11:10AM, Patrick Steinhardt wrote:
> > > > When seeking a record in our block reader we perform a binary search
> > > > over the block's restart points so that we don't have to do a linear
> > > > scan over the whole block. The logic to do so is quite intricate though,
> > > > which makes it hard to understand.
> > > > 
> > > > Improve documentation and rename some of the functions and variables so
> > > > that the code becomes easier to understand overall. This refactoring
> > > > should not result in any change in behaviour.
> > > > 
> > > > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > > > ---
> > > ...  
> > > > -	i = binsearch(br->restart_count, &restart_key_less, &args);
> > > > +	/*
> > > > +	 * Perform a binary search over the block's restart points, which
> > > > +	 * avoids doing a linear scan over the whole block. Like this, we
> > > > +	 * identify the section of the block that should contain our key.
> > > > +	 *
> > > > +	 * Note that we explicitly search for the first restart point _greater_
> > > > +	 * than the sought-after record, not _greater or equal_ to it. In case
> > > > +	 * the sought-after record is located directly at the restart point we
> > > > +	 * would otherwise start doing the linear search at the preceding
> > > > +	 * restart point. While that works alright, we would end up scanning
> > > > +	 * too many record.
> > > > +	 */
> > > > +	i = binsearch(br->restart_count, &restart_needle_less, &args);
> > > > +
> > > > +	/*
> > > > +	 * Now there are multiple cases:
> > > > +	 *
> > > > +	 *   - `i == 0`: The wanted record must be contained before the first
> > > > +	 *     restart point. We will thus start searching for the record in
> > > > +	 *     that section after accounting for the header offset.
> > > 
> > > To clarify my understanding, does a restart_offset not exist at the
> > > beginning of a block? I was originally under the impression that the
> > > first reset point would be at the beginning of the block (or just after
> > > the header). If this was the case, the first restart point being greater
> > > would indicate that the wanted record is not contained within the block.
> > 
> > It wouldn't make much sense to include it as a restart point. A restart
> > point is a point where the prefix compression will be reset such that
> > the record at that point can be read without reading preceding records.
> > This is always implicitly true for the first record in a block as it is
> > never prefix-compressed. Consequently, writing a restart point for the
> > first record would be a waste of disk space.
> 
> Thanks Patrick! Good to know :)
> 
> From Documentation/technical/reftable.txt:
> 
> > As the first ref block shares the first file block with the file
> > header, all restart_offset in the first block are relative to the
> > start of the file (position 0), and include the file header. This 
> > forces the first restart_offset to be 28.
> 
> I'm assumming this is out-of-date.

That paragraph actually made me re-check my own assumptions. Turns out
my claim is wrong. The function responsible for registering restarts is
`block_writer_register_restart()`, which gets a parameter `is_restart`
that is determined by `reftable_encode_key()`. And that function in turn
will set `restart = 1` whenever `prefix_len == 0`. And given that the
first record always has `prefix_len == 0`, it will thus have a restart
point, as well.

I'll update the comment like this:

diff --git a/reftable/block.c b/reftable/block.c
index 8bb4e43cec..298e8c56b9 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -417,9 +417,12 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 	/*
 	 * Now there are multiple cases:
 	 *
-	 *   - `i == 0`: The wanted record must be contained before the first
-	 *     restart point. We will thus start searching for the record in
-	 *     that section after accounting for the header offset.
+	 *   - `i == 0`: The wanted record is smaller than the record found at
+	 *     the first restart point. As the first restart point is the first
+	 *     record in the block, our wanted record cannot be located in this
+	 *     block at all. We still need to position the iterator so that the
+	 *     next call to `block_iter_next()` will yield an end-of-iterator
+	 *     signal.
 	 *
 	 *   - `i == restart_count`: The wanted record was not found at any of
 	 *     the restart points. As there is no restart point at the end of

Thanks for making me challenge my own assumptions!

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 0/7] reftable: improvements for the `binsearch()` mechanism
  2024-03-22 12:22 [PATCH 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
                   ` (8 preceding siblings ...)
  2024-04-02 17:24 ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
@ 2024-04-03  6:03 ` Patrick Steinhardt
  2024-04-03  6:03   ` [PATCH v4 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
                     ` (6 more replies)
  9 siblings, 7 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-03  6:03 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 2807 bytes --]

Hi,

this is the fourth and hopefully final version of my patch series that
refactors the `binsearch()` mechanism in the reftable library.

There's only a single change compared to v3, which is a fix to a comment
that was misexplaining one of the cases when searching for the restart
points.

Thanks!

Patrick

Patrick Steinhardt (7):
  reftable/basics: fix return type of `binsearch()` to be `size_t`
  reftable/basics: improve `binsearch()` test
  reftable/refname: refactor binary search over refnames
  reftable/block: refactor binary search over restart points
  reftable/block: fix error handling when searching restart points
  reftable/record: extract function to decode key lengths
  reftable/block: avoid decoding keys when searching restart points

 reftable/basics.c      |   7 ++-
 reftable/basics.h      |   7 +--
 reftable/basics_test.c |  55 ++++++++++---------
 reftable/block.c       | 117 ++++++++++++++++++++++++++++++-----------
 reftable/record.c      |  34 ++++++++----
 reftable/record.h      |   6 +++
 reftable/refname.c     |  53 +++++++++----------
 7 files changed, 182 insertions(+), 97 deletions(-)

Range-diff against v3:
1:  baa07ef144 = 1:  baa07ef144 reftable/basics: fix return type of `binsearch()` to be `size_t`
2:  cbc2a107c1 = 2:  cbc2a107c1 reftable/basics: improve `binsearch()` test
3:  f5bf65e0dd = 3:  f5bf65e0dd reftable/refname: refactor binary search over refnames
4:  435cd0be94 ! 4:  6d4a79f3e2 reftable/block: refactor binary search over restart points
    @@ reftable/block.c: int block_reader_seek(struct block_reader *br, struct block_it
     +	/*
     +	 * Now there are multiple cases:
     +	 *
    -+	 *   - `i == 0`: The wanted record must be contained before the first
    -+	 *     restart point. We will thus start searching for the record in
    -+	 *     that section after accounting for the header offset.
    ++	 *   - `i == 0`: The wanted record is smaller than the record found at
    ++	 *     the first restart point. As the first restart point is the first
    ++	 *     record in the block, our wanted record cannot be located in this
    ++	 *     block at all. We still need to position the iterator so that the
    ++	 *     next call to `block_iter_next()` will yield an end-of-iterator
    ++	 *     signal.
     +	 *
     +	 *   - `i == restart_count`: The wanted record was not found at any of
     +	 *     the restart points. As there is no restart point at the end of
5:  8d8abfd290 = 5:  52c238ee9f reftable/block: fix error handling when searching restart points
6:  f87f7ad01a = 6:  ca41ad30f4 reftable/record: extract function to decode key lengths
7:  f53bf9e1cc = 7:  632f725dde reftable/block: avoid decoding keys when searching restart points
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v4 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t`
  2024-04-03  6:03 ` [PATCH v4 " Patrick Steinhardt
@ 2024-04-03  6:03   ` Patrick Steinhardt
  2024-04-03  6:04   ` [PATCH v4 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-03  6:03 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 4578 bytes --]

The `binsearch()` function can be used to find the first element for
which a callback functions returns a truish value. But while the array
size is of type `size_t`, the function in fact returns an `int` that is
supposed to index into that array.

Fix the function signature to return a `size_t`. This conversion does
not change any semantics given that the function would only ever return
a value in the range `[0, sz]` anyway.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/basics.c      |  2 +-
 reftable/basics.h      |  2 +-
 reftable/basics_test.c |  6 +++---
 reftable/block.c       |  3 ++-
 reftable/refname.c     | 17 +++++++----------
 5 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/reftable/basics.c b/reftable/basics.c
index 0785aff941..2c5f34b39e 100644
--- a/reftable/basics.c
+++ b/reftable/basics.c
@@ -27,7 +27,7 @@ void put_be16(uint8_t *out, uint16_t i)
 	out[1] = (uint8_t)(i & 0xff);
 }
 
-int binsearch(size_t sz, int (*f)(size_t k, void *args), void *args)
+size_t binsearch(size_t sz, int (*f)(size_t k, void *args), void *args)
 {
 	size_t lo = 0;
 	size_t hi = sz;
diff --git a/reftable/basics.h b/reftable/basics.h
index 91f3533efe..2672520e76 100644
--- a/reftable/basics.h
+++ b/reftable/basics.h
@@ -28,7 +28,7 @@ void put_be16(uint8_t *out, uint16_t i);
  * Contrary to bsearch(3), this returns something useful if the argument is not
  * found.
  */
-int binsearch(size_t sz, int (*f)(size_t k, void *args), void *args);
+size_t binsearch(size_t sz, int (*f)(size_t k, void *args), void *args);
 
 /*
  * Frees a NULL terminated array of malloced strings. The array itself is also
diff --git a/reftable/basics_test.c b/reftable/basics_test.c
index 1fcd229725..dc1c87c5df 100644
--- a/reftable/basics_test.c
+++ b/reftable/basics_test.c
@@ -34,15 +34,15 @@ static void test_binsearch(void)
 
 	int i = 0;
 	for (i = 1; i < 11; i++) {
-		int res;
+		size_t res;
+
 		args.key = i;
 		res = binsearch(sz, &binsearch_func, &args);
 
 		if (res < sz) {
 			EXPECT(args.key < arr[res]);
-			if (res > 0) {
+			if (res > 0)
 				EXPECT(args.key >= arr[res - 1]);
-			}
 		} else {
 			EXPECT(args.key == 10 || args.key == 11);
 		}
diff --git a/reftable/block.c b/reftable/block.c
index e2a2cee58d..422885bddb 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -382,7 +382,8 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 	};
 	struct block_iter next = BLOCK_ITER_INIT;
 	struct reftable_record rec;
-	int err = 0, i;
+	int err = 0;
+	size_t i;
 
 	if (args.error) {
 		err = REFTABLE_FORMAT_ERROR;
diff --git a/reftable/refname.c b/reftable/refname.c
index 7570e4acf9..64eba1b886 100644
--- a/reftable/refname.c
+++ b/reftable/refname.c
@@ -33,10 +33,9 @@ static int modification_has_ref(struct modification *mod, const char *name)
 			.names = mod->add,
 			.want = name,
 		};
-		int idx = binsearch(mod->add_len, find_name, &arg);
-		if (idx < mod->add_len && !strcmp(mod->add[idx], name)) {
+		size_t idx = binsearch(mod->add_len, find_name, &arg);
+		if (idx < mod->add_len && !strcmp(mod->add[idx], name))
 			return 0;
-		}
 	}
 
 	if (mod->del_len > 0) {
@@ -44,10 +43,9 @@ static int modification_has_ref(struct modification *mod, const char *name)
 			.names = mod->del,
 			.want = name,
 		};
-		int idx = binsearch(mod->del_len, find_name, &arg);
-		if (idx < mod->del_len && !strcmp(mod->del[idx], name)) {
+		size_t idx = binsearch(mod->del_len, find_name, &arg);
+		if (idx < mod->del_len && !strcmp(mod->del[idx], name))
 			return 1;
-		}
 	}
 
 	err = reftable_table_read_ref(&mod->tab, name, &ref);
@@ -77,7 +75,7 @@ static int modification_has_ref_with_prefix(struct modification *mod,
 			.names = mod->add,
 			.want = prefix,
 		};
-		int idx = binsearch(mod->add_len, find_name, &arg);
+		size_t idx = binsearch(mod->add_len, find_name, &arg);
 		if (idx < mod->add_len &&
 		    !strncmp(prefix, mod->add[idx], strlen(prefix)))
 			goto done;
@@ -96,11 +94,10 @@ static int modification_has_ref_with_prefix(struct modification *mod,
 				.names = mod->del,
 				.want = ref.refname,
 			};
-			int idx = binsearch(mod->del_len, find_name, &arg);
+			size_t idx = binsearch(mod->del_len, find_name, &arg);
 			if (idx < mod->del_len &&
-			    !strcmp(ref.refname, mod->del[idx])) {
+			    !strcmp(ref.refname, mod->del[idx]))
 				continue;
-			}
 		}
 
 		if (strncmp(ref.refname, prefix, strlen(prefix))) {
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 2/7] reftable/basics: improve `binsearch()` test
  2024-04-03  6:03 ` [PATCH v4 " Patrick Steinhardt
  2024-04-03  6:03   ` [PATCH v4 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
@ 2024-04-03  6:04   ` Patrick Steinhardt
  2024-04-03  6:04   ` [PATCH v4 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-03  6:04 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 2541 bytes --]

The `binsearch()` test is somewhat weird in that it doesn't explicitly
spell out its expectations. Instead it does so in a rather ad-hoc way
with some hard-to-understand computations.

Refactor the test to spell out the needle as well as expected index for
all testcases. This refactoring highlights that the `binsearch_func()`
is written somewhat weirdly to find the first integer smaller than the
needle, not smaller or equal to it. Adjust the function accordingly.

While at it, rename the callback function to better convey its meaning.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/basics_test.c | 55 ++++++++++++++++++++++++------------------
 1 file changed, 31 insertions(+), 24 deletions(-)

diff --git a/reftable/basics_test.c b/reftable/basics_test.c
index dc1c87c5df..997c4d9e01 100644
--- a/reftable/basics_test.c
+++ b/reftable/basics_test.c
@@ -12,40 +12,47 @@ license that can be found in the LICENSE file or at
 #include "test_framework.h"
 #include "reftable-tests.h"
 
-struct binsearch_args {
-	int key;
-	int *arr;
+struct integer_needle_lesseq_args {
+	int needle;
+	int *haystack;
 };
 
-static int binsearch_func(size_t i, void *void_args)
+static int integer_needle_lesseq(size_t i, void *_args)
 {
-	struct binsearch_args *args = void_args;
-
-	return args->key < args->arr[i];
+	struct integer_needle_lesseq_args *args = _args;
+	return args->needle <= args->haystack[i];
 }
 
 static void test_binsearch(void)
 {
-	int arr[] = { 2, 4, 6, 8, 10 };
-	size_t sz = ARRAY_SIZE(arr);
-	struct binsearch_args args = {
-		.arr = arr,
+	int haystack[] = { 2, 4, 6, 8, 10 };
+	struct {
+		int needle;
+		size_t expected_idx;
+	} testcases[] = {
+		{-9000, 0},
+		{-1, 0},
+		{0, 0},
+		{2, 0},
+		{3, 1},
+		{4, 1},
+		{7, 3},
+		{9, 4},
+		{10, 4},
+		{11, 5},
+		{9000, 5},
 	};
+	size_t i = 0;
 
-	int i = 0;
-	for (i = 1; i < 11; i++) {
-		size_t res;
-
-		args.key = i;
-		res = binsearch(sz, &binsearch_func, &args);
+	for (i = 0; i < ARRAY_SIZE(testcases); i++) {
+		struct integer_needle_lesseq_args args = {
+			.haystack = haystack,
+			.needle = testcases[i].needle,
+		};
+		size_t idx;
 
-		if (res < sz) {
-			EXPECT(args.key < arr[res]);
-			if (res > 0)
-				EXPECT(args.key >= arr[res - 1]);
-		} else {
-			EXPECT(args.key == 10 || args.key == 11);
-		}
+		idx = binsearch(ARRAY_SIZE(haystack), &integer_needle_lesseq, &args);
+		EXPECT(idx == testcases[i].expected_idx);
 	}
 }
 
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 3/7] reftable/refname: refactor binary search over refnames
  2024-04-03  6:03 ` [PATCH v4 " Patrick Steinhardt
  2024-04-03  6:03   ` [PATCH v4 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
  2024-04-03  6:04   ` [PATCH v4 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
@ 2024-04-03  6:04   ` Patrick Steinhardt
  2024-04-03  6:04   ` [PATCH v4 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-03  6:04 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 3255 bytes --]

It is comparatively hard to understand how exactly the binary search
over refnames works given that the function and variable names are not
exactly easy to grasp. Rename them to make this more obvious. This
should not result in any change in behaviour.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/refname.c | 44 ++++++++++++++++++++++----------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/reftable/refname.c b/reftable/refname.c
index 64eba1b886..bbfde15754 100644
--- a/reftable/refname.c
+++ b/reftable/refname.c
@@ -12,15 +12,15 @@
 #include "refname.h"
 #include "reftable-iterator.h"
 
-struct find_arg {
-	char **names;
-	const char *want;
+struct refname_needle_lesseq_args {
+	char **haystack;
+	const char *needle;
 };
 
-static int find_name(size_t k, void *arg)
+static int refname_needle_lesseq(size_t k, void *_args)
 {
-	struct find_arg *f_arg = arg;
-	return strcmp(f_arg->names[k], f_arg->want) >= 0;
+	struct refname_needle_lesseq_args *args = _args;
+	return strcmp(args->needle, args->haystack[k]) <= 0;
 }
 
 static int modification_has_ref(struct modification *mod, const char *name)
@@ -29,21 +29,21 @@ static int modification_has_ref(struct modification *mod, const char *name)
 	int err = 0;
 
 	if (mod->add_len > 0) {
-		struct find_arg arg = {
-			.names = mod->add,
-			.want = name,
+		struct refname_needle_lesseq_args args = {
+			.haystack = mod->add,
+			.needle = name,
 		};
-		size_t idx = binsearch(mod->add_len, find_name, &arg);
+		size_t idx = binsearch(mod->add_len, refname_needle_lesseq, &args);
 		if (idx < mod->add_len && !strcmp(mod->add[idx], name))
 			return 0;
 	}
 
 	if (mod->del_len > 0) {
-		struct find_arg arg = {
-			.names = mod->del,
-			.want = name,
+		struct refname_needle_lesseq_args args = {
+			.haystack = mod->del,
+			.needle = name,
 		};
-		size_t idx = binsearch(mod->del_len, find_name, &arg);
+		size_t idx = binsearch(mod->del_len, refname_needle_lesseq, &args);
 		if (idx < mod->del_len && !strcmp(mod->del[idx], name))
 			return 1;
 	}
@@ -71,11 +71,11 @@ static int modification_has_ref_with_prefix(struct modification *mod,
 	int err = 0;
 
 	if (mod->add_len > 0) {
-		struct find_arg arg = {
-			.names = mod->add,
-			.want = prefix,
+		struct refname_needle_lesseq_args args = {
+			.haystack = mod->add,
+			.needle = prefix,
 		};
-		size_t idx = binsearch(mod->add_len, find_name, &arg);
+		size_t idx = binsearch(mod->add_len, refname_needle_lesseq, &args);
 		if (idx < mod->add_len &&
 		    !strncmp(prefix, mod->add[idx], strlen(prefix)))
 			goto done;
@@ -90,11 +90,11 @@ static int modification_has_ref_with_prefix(struct modification *mod,
 			goto done;
 
 		if (mod->del_len > 0) {
-			struct find_arg arg = {
-				.names = mod->del,
-				.want = ref.refname,
+			struct refname_needle_lesseq_args args = {
+				.haystack = mod->del,
+				.needle = ref.refname,
 			};
-			size_t idx = binsearch(mod->del_len, find_name, &arg);
+			size_t idx = binsearch(mod->del_len, refname_needle_lesseq, &args);
 			if (idx < mod->del_len &&
 			    !strcmp(ref.refname, mod->del[idx]))
 				continue;
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 4/7] reftable/block: refactor binary search over restart points
  2024-04-03  6:03 ` [PATCH v4 " Patrick Steinhardt
                     ` (2 preceding siblings ...)
  2024-04-03  6:04   ` [PATCH v4 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
@ 2024-04-03  6:04   ` Patrick Steinhardt
  2024-04-03  6:04   ` [PATCH v4 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-03  6:04 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 6002 bytes --]

When seeking a record in our block reader we perform a binary search
over the block's restart points so that we don't have to do a linear
scan over the whole block. The logic to do so is quite intricate though,
which makes it hard to understand.

Improve documentation and rename some of the functions and variables so
that the code becomes easier to understand overall. This refactoring
should not result in any change in behaviour.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/block.c | 100 ++++++++++++++++++++++++++++++++++-------------
 1 file changed, 73 insertions(+), 27 deletions(-)

diff --git a/reftable/block.c b/reftable/block.c
index 422885bddb..57e3dbf658 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -273,34 +273,36 @@ void block_reader_start(struct block_reader *br, struct block_iter *it)
 	it->next_off = br->header_off + 4;
 }
 
-struct restart_find_args {
+struct restart_needle_less_args {
 	int error;
-	struct strbuf key;
-	struct block_reader *r;
+	struct strbuf needle;
+	struct block_reader *reader;
 };
 
-static int restart_key_less(size_t idx, void *args)
+static int restart_needle_less(size_t idx, void *_args)
 {
-	struct restart_find_args *a = args;
-	uint32_t off = block_reader_restart_offset(a->r, idx);
+	struct restart_needle_less_args *args = _args;
+	uint32_t off = block_reader_restart_offset(args->reader, idx);
 	struct string_view in = {
-		.buf = a->r->block.data + off,
-		.len = a->r->block_len - off,
+		.buf = args->reader->block.data + off,
+		.len = args->reader->block_len - off,
 	};
-
-	/* the restart key is verbatim in the block, so this could avoid the
-	   alloc for decoding the key */
-	struct strbuf rkey = STRBUF_INIT;
+	struct strbuf kth_restart_key = STRBUF_INIT;
 	uint8_t unused_extra;
-	int n = reftable_decode_key(&rkey, &unused_extra, in);
-	int result;
+	int result, n;
+
+	/*
+	 * TODO: The restart key is verbatim in the block, so we can in theory
+	 * avoid decoding the key and thus save some allocations.
+	 */
+	n = reftable_decode_key(&kth_restart_key, &unused_extra, in);
 	if (n < 0) {
-		a->error = 1;
+		args->error = 1;
 		return -1;
 	}
 
-	result = strbuf_cmp(&a->key, &rkey);
-	strbuf_release(&rkey);
+	result = strbuf_cmp(&args->needle, &kth_restart_key);
+	strbuf_release(&kth_restart_key);
 	return result < 0;
 }
 
@@ -376,9 +378,9 @@ void block_iter_close(struct block_iter *it)
 int block_reader_seek(struct block_reader *br, struct block_iter *it,
 		      struct strbuf *want)
 {
-	struct restart_find_args args = {
-		.key = *want,
-		.r = br,
+	struct restart_needle_less_args args = {
+		.needle = *want,
+		.reader = br,
 	};
 	struct block_iter next = BLOCK_ITER_INIT;
 	struct reftable_record rec;
@@ -390,7 +392,38 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 		goto done;
 	}
 
-	i = binsearch(br->restart_count, &restart_key_less, &args);
+	/*
+	 * Perform a binary search over the block's restart points, which
+	 * avoids doing a linear scan over the whole block. Like this, we
+	 * identify the section of the block that should contain our key.
+	 *
+	 * Note that we explicitly search for the first restart point _greater_
+	 * than the sought-after record, not _greater or equal_ to it. In case
+	 * the sought-after record is located directly at the restart point we
+	 * would otherwise start doing the linear search at the preceding
+	 * restart point. While that works alright, we would end up scanning
+	 * too many record.
+	 */
+	i = binsearch(br->restart_count, &restart_needle_less, &args);
+
+	/*
+	 * Now there are multiple cases:
+	 *
+	 *   - `i == 0`: The wanted record is smaller than the record found at
+	 *     the first restart point. As the first restart point is the first
+	 *     record in the block, our wanted record cannot be located in this
+	 *     block at all. We still need to position the iterator so that the
+	 *     next call to `block_iter_next()` will yield an end-of-iterator
+	 *     signal.
+	 *
+	 *   - `i == restart_count`: The wanted record was not found at any of
+	 *     the restart points. As there is no restart point at the end of
+	 *     the section the record may thus be contained in the last block.
+	 *
+	 *   - `i > 0`: The wanted record must be contained in the section
+	 *     before the found restart point. We thus do a linear search
+	 *     starting from the preceding restart point.
+	 */
 	if (i > 0)
 		it->next_off = block_reader_restart_offset(br, i - 1);
 	else
@@ -399,21 +432,34 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 
 	reftable_record_init(&rec, block_reader_type(br));
 
-	/* We're looking for the last entry less/equal than the wanted key, so
-	   we have to go one entry too far and then back up.
-	*/
+	/*
+	 * We're looking for the last entry less than the wanted key so that
+	 * the next call to `block_reader_next()` would yield the wanted
+	 * record. We thus don't want to position our reader at the sought
+	 * after record, but one before. To do so, we have to go one entry too
+	 * far and then back up.
+	 */
 	while (1) {
 		block_iter_copy_from(&next, it);
 		err = block_iter_next(&next, &rec);
 		if (err < 0)
 			goto done;
-
-		reftable_record_key(&rec, &it->last_key);
-		if (err > 0 || strbuf_cmp(&it->last_key, want) >= 0) {
+		if (err > 0) {
 			err = 0;
 			goto done;
 		}
 
+		/*
+		 * Check whether the current key is greater or equal to the
+		 * sought-after key. In case it is greater we know that the
+		 * record does not exist in the block and can thus abort early.
+		 * In case it is equal to the sought-after key we have found
+		 * the desired record.
+		 */
+		reftable_record_key(&rec, &it->last_key);
+		if (strbuf_cmp(&it->last_key, want) >= 0)
+			goto done;
+
 		block_iter_copy_from(it, &next);
 	}
 
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 5/7] reftable/block: fix error handling when searching restart points
  2024-04-03  6:03 ` [PATCH v4 " Patrick Steinhardt
                     ` (3 preceding siblings ...)
  2024-04-03  6:04   ` [PATCH v4 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
@ 2024-04-03  6:04   ` Patrick Steinhardt
  2024-04-03  6:04   ` [PATCH v4 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
  2024-04-03  6:04   ` [PATCH v4 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
  6 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-03  6:04 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 2756 bytes --]

When doing the binary search over restart points in a block we need to
decode the record keys. This decoding step can result in an error when
the block is corrupted, which we indicate to the caller of the binary
search by setting `args.error = 1`. But the only caller that exists
mishandles this because it in fact performs the error check before
calling `binsearch()`.

Fix this bug by checking for errors at the right point in time.
Furthermore, refactor `binsearch()` so that it aborts the search in case
the callback function returns a negative value so that we don't
needlessly continue to search the block.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/basics.c | 5 ++++-
 reftable/basics.h | 5 +++--
 reftable/block.c  | 9 ++++-----
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/reftable/basics.c b/reftable/basics.c
index 2c5f34b39e..fea711db7e 100644
--- a/reftable/basics.c
+++ b/reftable/basics.c
@@ -39,8 +39,11 @@ size_t binsearch(size_t sz, int (*f)(size_t k, void *args), void *args)
 	 */
 	while (hi - lo > 1) {
 		size_t mid = lo + (hi - lo) / 2;
+		int ret = f(mid, args);
+		if (ret < 0)
+			return sz;
 
-		if (f(mid, args))
+		if (ret > 0)
 			hi = mid;
 		else
 			lo = mid;
diff --git a/reftable/basics.h b/reftable/basics.h
index 2672520e76..523ecd5307 100644
--- a/reftable/basics.h
+++ b/reftable/basics.h
@@ -22,8 +22,9 @@ uint32_t get_be24(uint8_t *in);
 void put_be16(uint8_t *out, uint16_t i);
 
 /*
- * find smallest index i in [0, sz) at which f(i) is true, assuming
- * that f is ascending. Return sz if f(i) is false for all indices.
+ * find smallest index i in [0, sz) at which `f(i) > 0`, assuming that f is
+ * ascending. Return sz if `f(i) == 0` for all indices. The search is aborted
+ * and `sz` is returned in case `f(i) < 0`.
  *
  * Contrary to bsearch(3), this returns something useful if the argument is not
  * found.
diff --git a/reftable/block.c b/reftable/block.c
index 57e3dbf658..ca217636ae 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -387,11 +387,6 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 	int err = 0;
 	size_t i;
 
-	if (args.error) {
-		err = REFTABLE_FORMAT_ERROR;
-		goto done;
-	}
-
 	/*
 	 * Perform a binary search over the block's restart points, which
 	 * avoids doing a linear scan over the whole block. Like this, we
@@ -405,6 +400,10 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 	 * too many record.
 	 */
 	i = binsearch(br->restart_count, &restart_needle_less, &args);
+	if (args.error) {
+		err = REFTABLE_FORMAT_ERROR;
+		goto done;
+	}
 
 	/*
 	 * Now there are multiple cases:
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 6/7] reftable/record: extract function to decode key lengths
  2024-04-03  6:03 ` [PATCH v4 " Patrick Steinhardt
                     ` (4 preceding siblings ...)
  2024-04-03  6:04   ` [PATCH v4 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
@ 2024-04-03  6:04   ` Patrick Steinhardt
  2024-04-03  6:04   ` [PATCH v4 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
  6 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-03  6:04 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 2614 bytes --]

We're about to refactor the binary search over restart points so that it
does not need to fully decode the record keys anymore. To do so we will
need to decode the record key lengths, which is non-trivial logic.

Extract the logic to decode these lengths from `refatble_decode_key()`
so that we can reuse it.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/record.c | 34 +++++++++++++++++++++++++---------
 reftable/record.h |  6 ++++++
 2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/reftable/record.c b/reftable/record.c
index 23b497adab..5506f3e913 100644
--- a/reftable/record.c
+++ b/reftable/record.c
@@ -159,26 +159,42 @@ int reftable_encode_key(int *restart, struct string_view dest,
 	return start.len - dest.len;
 }
 
-int reftable_decode_key(struct strbuf *last_key, uint8_t *extra,
-			struct string_view in)
+int reftable_decode_keylen(struct string_view in,
+			   uint64_t *prefix_len,
+			   uint64_t *suffix_len,
+			   uint8_t *extra)
 {
-	int start_len = in.len;
-	uint64_t prefix_len = 0;
-	uint64_t suffix_len = 0;
+	size_t start_len = in.len;
 	int n;
 
-	n = get_var_int(&prefix_len, &in);
+	n = get_var_int(prefix_len, &in);
 	if (n < 0)
 		return -1;
 	string_view_consume(&in, n);
 
-	n = get_var_int(&suffix_len, &in);
+	n = get_var_int(suffix_len, &in);
 	if (n <= 0)
 		return -1;
 	string_view_consume(&in, n);
 
-	*extra = (uint8_t)(suffix_len & 0x7);
-	suffix_len >>= 3;
+	*extra = (uint8_t)(*suffix_len & 0x7);
+	*suffix_len >>= 3;
+
+	return start_len - in.len;
+}
+
+int reftable_decode_key(struct strbuf *last_key, uint8_t *extra,
+			struct string_view in)
+{
+	int start_len = in.len;
+	uint64_t prefix_len = 0;
+	uint64_t suffix_len = 0;
+	int n;
+
+	n = reftable_decode_keylen(in, &prefix_len, &suffix_len, extra);
+	if (n < 0)
+		return -1;
+	string_view_consume(&in, n);
 
 	if (in.len < suffix_len ||
 	    prefix_len > last_key->len)
diff --git a/reftable/record.h b/reftable/record.h
index 826ee1c55c..d778133e6e 100644
--- a/reftable/record.h
+++ b/reftable/record.h
@@ -86,6 +86,12 @@ int reftable_encode_key(int *is_restart, struct string_view dest,
 			struct strbuf prev_key, struct strbuf key,
 			uint8_t extra);
 
+/* Decode a record's key lengths. */
+int reftable_decode_keylen(struct string_view in,
+			   uint64_t *prefix_len,
+			   uint64_t *suffix_len,
+			   uint8_t *extra);
+
 /*
  * Decode into `last_key` and `extra` from `in`. `last_key` is expected to
  * contain the decoded key of the preceding record, if any.
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 7/7] reftable/block: avoid decoding keys when searching restart points
  2024-04-03  6:03 ` [PATCH v4 " Patrick Steinhardt
                     ` (5 preceding siblings ...)
  2024-04-03  6:04   ` [PATCH v4 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
@ 2024-04-03  6:04   ` Patrick Steinhardt
  6 siblings, 0 replies; 45+ messages in thread
From: Patrick Steinhardt @ 2024-04-03  6:04 UTC (permalink / raw
  To: git; +Cc: Justin Tobler

[-- Attachment #1: Type: text/plain, Size: 2061 bytes --]

When searching over restart points in a block we decode the key of each
of the records, which results in a memory allocation. This is quite
pointless though given that records it restart points will never use
prefix compression and thus store their keys verbatim in the block.

Refactor the code so that we can avoid decoding the keys, which saves us
some allocations.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/block.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/reftable/block.c b/reftable/block.c
index ca217636ae..298e8c56b9 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -287,23 +287,32 @@ static int restart_needle_less(size_t idx, void *_args)
 		.buf = args->reader->block.data + off,
 		.len = args->reader->block_len - off,
 	};
-	struct strbuf kth_restart_key = STRBUF_INIT;
-	uint8_t unused_extra;
-	int result, n;
+	uint64_t prefix_len, suffix_len;
+	uint8_t extra;
+	int n;
 
 	/*
-	 * TODO: The restart key is verbatim in the block, so we can in theory
-	 * avoid decoding the key and thus save some allocations.
+	 * Records at restart points are stored without prefix compression, so
+	 * there is no need to fully decode the record key here. This removes
+	 * the need for allocating memory.
 	 */
-	n = reftable_decode_key(&kth_restart_key, &unused_extra, in);
-	if (n < 0) {
+	n = reftable_decode_keylen(in, &prefix_len, &suffix_len, &extra);
+	if (n < 0 || prefix_len) {
 		args->error = 1;
 		return -1;
 	}
 
-	result = strbuf_cmp(&args->needle, &kth_restart_key);
-	strbuf_release(&kth_restart_key);
-	return result < 0;
+	string_view_consume(&in, n);
+	if (suffix_len > in.len) {
+		args->error = 1;
+		return -1;
+	}
+
+	n = memcmp(args->needle.buf, in.buf,
+		   args->needle.len < suffix_len ? args->needle.len : suffix_len);
+	if (n)
+		return n < 0;
+	return args->needle.len < suffix_len;
 }
 
 void block_iter_copy_from(struct block_iter *dest, struct block_iter *src)
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2024-04-03  6:04 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-22 12:22 [PATCH 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
2024-03-22 18:46   ` Justin Tobler
2024-03-25 10:07     ` Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
2024-03-22 18:55   ` Justin Tobler
2024-03-25 10:07     ` Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
2024-03-25 10:10 ` [PATCH v2 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
2024-03-25 10:10   ` [PATCH v2 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
2024-03-25 10:10   ` [PATCH v2 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
2024-03-25 10:10   ` [PATCH v2 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
2024-04-02 16:27     ` Justin Tobler
2024-04-02 17:15       ` Patrick Steinhardt
2024-03-25 10:10   ` [PATCH v2 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
2024-04-02 16:42     ` Justin Tobler
2024-04-02 17:15       ` Patrick Steinhardt
2024-04-02 17:46         ` Justin Tobler
2024-04-03  6:01           ` Patrick Steinhardt
2024-03-25 10:10   ` [PATCH v2 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
2024-03-25 10:10   ` [PATCH v2 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
2024-03-25 10:11   ` [PATCH v2 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
2024-04-02 16:47     ` Justin Tobler
2024-04-02 17:15       ` Patrick Steinhardt
2024-04-02 17:24 ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
2024-04-02 17:24   ` [PATCH v3 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
2024-04-02 17:24   ` [PATCH v3 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
2024-04-02 17:24   ` [PATCH v3 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
2024-04-02 17:24   ` [PATCH v3 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
2024-04-02 17:24   ` [PATCH v3 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
2024-04-02 17:25   ` [PATCH v3 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
2024-04-02 17:25   ` [PATCH v3 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
2024-04-02 17:49   ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Justin Tobler
2024-04-03  6:03 ` [PATCH v4 " Patrick Steinhardt
2024-04-03  6:03   ` [PATCH v4 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
2024-04-03  6:04   ` [PATCH v4 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
2024-04-03  6:04   ` [PATCH v4 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
2024-04-03  6:04   ` [PATCH v4 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
2024-04-03  6:04   ` [PATCH v4 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
2024-04-03  6:04   ` [PATCH v4 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
2024-04-03  6:04   ` [PATCH v4 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).