On Mon, May 08, 2023 at 06:00:08PM -0400, Taylor Blau wrote:
> When iterating through the `packed-refs` file in order to answer a query
> like:
> 
>     $ git for-each-ref --exclude=refs/__hidden__
> 
> it would be useful to avoid walking over all of the entries in
> `refs/__hidden__/*` when possible, since we know that the ref-filter
> code is going to throw them away anyways.
> 
> In certain circumstances, doing so is possible. The algorithm for doing
> so is as follows:
> 
>   - For each excluded pattern, find the first record that matches it,
>     and the first pattern that *doesn't* match it (i.e. the location
>     you'd next want to consider when excluding that pattern).
> 
>   - Sort the patterns by their starting location within the
>     `packed-refs` file.
> 
>   - Construct a skip list of regions by combining adjacent and
>     overlapping regions from the previous step.
> 
>   - When iterating through the `packed-refs` file, if `iter->pos` is
>     ever contained in one of the regions from the previous steps,
>     advance `iter->pos` past the end of that region, and continue
>     enumeration.
> 
> Note that this optimization is only possible when none of the excluded
> pattern(s) have special meta-characters in them. To see why this is the
> case, consider the exclusion pattern "refs/foo[a]". In general, in order
> to find the location of the first record that matches this pattern, we
> could only consider up to the first meta-character, "refs/foo". But this
> doesn't work, since the excluded region we'd come up with would include
> "refs/foobar", even though it is not excluded.

Is this generally true though? A naive implementation would iterate
through all references and find the first reference that matches the
exclusion regular exepression. From thereon we continue to iterate until
we find the first entry that doesn't match. This may cause us to end up
with a suboptimal skip list, but the skip list would still be valid.

As I said, this implementation would be naive as we're now forced to
iterate through all references starting at the beginning. I assume that
your implementation will instead use a binary search to locate the first
entry that matches the exclusion pattern and the last pattern. But the
way this paragraph is formulated makes it sound like this is a general
fact, even though it is a fact that derives from the implementation.

I of course don't propose to change the algorithm here, but instead to
clarify where this restriction actually comes from and why the tradeoff
makes sense.

[snip]
> @@ -929,6 +972,108 @@ static struct ref_iterator_vtable packed_ref_iterator_vtable = {
>  	.abort = packed_ref_iterator_abort
>  };
>  
> +static int skip_list_entry_cmp(const void *va, const void *vb)
> +{
> +	const struct skip_list_entry *a = va;
> +	const struct skip_list_entry *b = vb;
> +
> +	if (a->start < b->start)
> +		return -1;
> +	if (a->start > b->start)
> +		return 1;
> +	return 0;
> +}
> +
> +static int has_glob_special(const char *str)
> +{
> +	const char *p;
> +	for (p = str; *p; p++) {
> +		if (is_glob_special(*p))
> +			return 1;
> +	}
> +	return 0;
> +}
> +
> +static const char *ptr_max(const char *x, const char *y)
> +{
> +	if (x > y)
> +		return x;
> +	return y;
> +}
> +
> +static void populate_excluded_skip_list(struct packed_ref_iterator *iter,
> +					struct snapshot *snapshot,
> +					const char **excluded_patterns)
> +{
> +	size_t i, j;
> +	const char **pattern;
> +
> +	if (!excluded_patterns)
> +		return;
> +
> +	for (pattern = excluded_patterns; *pattern; pattern++) {
> +		struct skip_list_entry *e;
> +
> +		/*
> +		 * We can't feed any excludes with globs in them to the
> +		 * refs machinery.  It only understands prefix matching.
> +		 * We likewise can't even feed the string leading up to
> +		 * the first meta-character, as something like "foo[a]"
> +		 * should not exclude "foobar" (but the prefix "foo"
> +		 * would match that and mark it for exclusion).
> +		 */
> +		if (has_glob_special(*pattern))
> +			continue;
> +
> +		ALLOC_GROW(iter->skip, iter->skip_nr + 1, iter->skip_alloc);
> +
> +		e = &iter->skip[iter->skip_nr++];
> +		e->start = find_reference_location(snapshot, *pattern, 0);
> +		e->end = find_reference_location_end(snapshot, *pattern, 0);

One thing that makes this hard to reason about is that we don't
explicitly handle the case where the pattern doesn't match at all. So
you require a bunch of knowledge about what exactly the functions
`find_reference_location()` and `find_reference_location_end()` do in
that case in order to determine whether we will end up doing the right
thing.

Explicitly handling this would give us some benefits:

- It makes the code more obvious.

- We can avoid an extra skip list entry for every non-matching
  pattern.

- We wouldn't have to perform binary search over the snapshot twice.

Might be I'm missing something though.

> +	}
> +
> +	if (!iter->skip_nr) {
> +		/*
> +		 * Every entry in exclude_patterns has a meta-character,
> +		 * nothing to do here.
> +		 */
> +		return;
> +	}
> +
> +	QSORT(iter->skip, iter->skip_nr, skip_list_entry_cmp);
> +
> +	/*
> +	 * As an optimization, merge adjacent entries in the skip list
> +	 * to jump forwards as far as possible when entering a skipped
> +	 * region.
> +	 *
> +	 * For example, if we have two skipped regions:
> +	 *
> +	 *	[[A, B], [B, C]]
> +	 *
> +	 * we want to combine that into a single entry jumping from A to
> +	 * C.
> +	 */
> +	for (i = 1, j = 1; i < iter->skip_nr; i++) {
> +		struct skip_list_entry *ours = &iter->skip[i];
> +		struct skip_list_entry *prev = &iter->skip[i - 1];
> +
> +		if (ours->start == ours->end) {
> +			/* ignore empty regions (no matching entries) */
> +			continue;
> +		} else if (prev->end >= ours->start) {
> +			/* overlapping regions extend the previous one */
> +			prev->end = ptr_max(prev->end, ours->end);
> +		} else {
> +			/* otherwise, insert a new region */
> +			iter->skip[j++] = *ours;
> +		}
> +	}

Mh. Does this do the right thing in case we have multiple consecutive
overlapping skip list entries? We always end up adjusting the immediate
predecessor as we use `i - 1` to find `prev`. Shouldn't we instead start
with `j = 0` and assign `prev = &iter->skip[j]`?

[snip]
> diff --git a/t/t1419-exclude-refs.sh b/t/t1419-exclude-refs.sh
> new file mode 100755
> index 0000000000..da5265a5a8
> --- /dev/null
> +++ b/t/t1419-exclude-refs.sh
> @@ -0,0 +1,101 @@
> +#!/bin/sh
> +
> +test_description='test exclude_patterns functionality in main ref store'
> +
> +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
> +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
> +
> +TEST_PASSES_SANITIZE_LEAK=true
> +. ./test-lib.sh
> +
> +for_each_ref__exclude () {
> +	test-tool ref-store main for-each-ref--exclude "$@" >actual.raw
> +	cut -d ' ' -f 2 actual.raw
> +}
> +
> +for_each_ref () {
> +	git for-each-ref --format='%(refname)' "$@"
> +}
> +
> +test_expect_success 'setup' '
> +	test_commit --no-tag base &&
> +	base="$(git rev-parse HEAD)" &&
> +
> +	for name in foo bar baz quux
> +	do
> +		for i in 1 2 3
> +		do
> +			echo "create refs/heads/$name/$i $base" || return 1
> +		done || return 1
> +	done >in &&
> +	echo "delete refs/heads/main" >>in &&
> +
> +	git update-ref --stdin <in &&
> +	git pack-refs --all
> +'
> +
> +test_expect_success 'for_each_ref__exclude(refs/heads/foo/)' '
> +	# region in middle
> +	for_each_ref__exclude refs/heads refs/heads/foo >actual &&
> +	for_each_ref refs/heads/bar refs/heads/baz refs/heads/quux >expect &&
> +
> +	test_cmp expect actual
> +'

Nit: it might be a bit more readable if we put the comment into the test
description instead of having an opaque description that mentions ref
names that don't have much of a meaning without reading the test itself.

Patrick