Git Mailing List Archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Changed path filter hash fix and version bump
@ 2023-05-22 21:48 Jonathan Tan
  2023-05-22 21:48 ` [PATCH 1/2] t4216: test wrong bloom filter version rejection Jonathan Tan
                   ` (8 more replies)
  0 siblings, 9 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-05-22 21:48 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, me

Following the conversation in [1], here are patches to fix the murmur3
hash function used in creating (and interpreting) changed path filters,
and also to bump the version number to 2.

This is I think the simplest way to do this (invalidating all existing
changed path filters). The resource-consuming part of creating a changed
path filter is in computing the changed paths (thus, reading trees and
calculating changes), and to check if a changed path filter could be
reused, one would need to compute the changed paths anyway in order to
determine if any of them have high-bit strings, so I did not pursue this
further. Server operators might be able to reuse changed path filters
if, for example, they have a more efficient way to determine that no
paths in a repo have the high bit set, but I think that this is out of
scope for the Git project.

In patch 2, I couldn't figure out how to make Bash pass high-bit strings
as a CLI argument for some reason, so I hardcoded the string I wanted
in the test helper instead. If anyone knows how to pass such strings,
please let me know.

[1] https://lore.kernel.org/git/20230511224101.972442-1-jonathantanmy@google.com/

Jonathan Tan (2):
  t4216: test wrong bloom filter version rejection
  commit-graph: fix murmur3, bump filter ver. to 2

 bloom.c               | 14 +++++++-------
 bloom.h               |  9 ++++++---
 commit-graph.c        |  4 ++--
 t/helper/test-bloom.c |  7 +++++++
 t/t0095-bloom.sh      |  8 ++++++++
 t/t4216-log-bloom.sh  | 36 +++++++++++++++++++++++++++++++++---
 6 files changed, 63 insertions(+), 15 deletions(-)

-- 
2.40.1.698.g37aff9b760-goog


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 1/2] t4216: test wrong bloom filter version rejection
  2023-05-22 21:48 [PATCH 0/2] Changed path filter hash fix and version bump Jonathan Tan
@ 2023-05-22 21:48 ` Jonathan Tan
  2023-05-22 21:48 ` [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2 Jonathan Tan
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-05-22 21:48 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, me

Add a test that checks that Git does not make use of changed path
filters that have an unrecognized version.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 t/t4216-log-bloom.sh | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fa9d32facf..f14cc1c1f1 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -85,6 +85,36 @@ test_bloom_filters_not_used () {
 	test_cmp log_wo_bloom log_w_bloom
 }
 
+get_bdat_offset () {
+	perl -0777 -ne \
+		'print unpack("N", "$1") if /BDAT\0\0\0\0(....)/ or exit 1' \
+		.git/objects/info/commit-graph
+}
+
+test_expect_success 'incompatible bloom filter versions are not used' '
+	cp .git/objects/info/commit-graph old-commit-graph &&
+	test_when_finished "mv old-commit-graph .git/objects/info/commit-graph" &&
+
+	BDAT_OFFSET=$(get_bdat_offset) &&
+
+	# Write an arbitrary number to the least significant byte of the
+	# version field in the BDAT chunk
+	cat old-commit-graph >new-commit-graph &&
+	printf "\aa" |
+		dd of=new-commit-graph bs=1 count=1 \
+			seek=$((BDAT_OFFSET + 3)) conv=notrunc &&
+	mv new-commit-graph .git/objects/info/commit-graph &&
+	test_bloom_filters_not_used "-- A" &&
+
+	# But the correct version number works
+	cat old-commit-graph >new-commit-graph &&
+	printf "\01" |
+		dd of=new-commit-graph bs=1 count=1 \
+			seek=$((BDAT_OFFSET + 3)) conv=notrunc &&
+	mv new-commit-graph .git/objects/info/commit-graph &&
+	test_bloom_filters_used "-- A"
+'
+
 for path in A A/B A/B/C A/file1 A/B/file2 A/B/C/file3 file4 file5 file5_renamed file_to_be_deleted
 do
 	for option in "" \
-- 
2.40.1.698.g37aff9b760-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2
  2023-05-22 21:48 [PATCH 0/2] Changed path filter hash fix and version bump Jonathan Tan
  2023-05-22 21:48 ` [PATCH 1/2] t4216: test wrong bloom filter version rejection Jonathan Tan
@ 2023-05-22 21:48 ` Jonathan Tan
  2023-05-23 13:00   ` Derrick Stolee
  2023-05-23  4:42 ` [PATCH 0/2] Changed path filter hash fix and version bump Junio C Hamano
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-05-22 21:48 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, me

The murmur3 implementation in bloom.c has a bug when converting series
of 4 bytes into network-order integers when char is signed (which is
controllable by a compiler option, and the default signedness of char is
platform-specific). When a string contains characters with the high bit
set, this bug causes results that, although internally consistent within
Git, does not accord with other implementations of murmur3 and even with
Git binaries that were compiled with different signedness of char. This
bug affects both how Git writes changed path filters to disk and how Git
interprets changed path filters on disk.

Therefore, fix this bug. And because changed path filters on disk might
no longer be compatible, teach Git to write "2" as the version when
writing changed path filters (instead of "1" currently), and only accept
"2" as the version when reading them (instead of "1" currently).

Because this bug only manifests with characters that have the high bit
set, it may be possible that some (or all) commits in a given repo would
have the same changed path filter both before and after this fix is
applied. However, in order to determine whether this is the case, the
changed paths would first have to be computed, at which point it is not
much more expensive to just compute a new changed path filter. So this
patch does not include any mechanism to "salvage" changed path filters
from repositories.

There is a change in write_commit_graph(). graph_read_bloom_data()
makes it possible for chunk_bloom_data to be non-NULL but
bloom_filter_settings to be NULL, which causes a segfault later on. I
produced such a segfault while developing this patch, but couldn't find
a way to reproduce it neither after this complete patch (or before),
but in any case it seemed like a good thing to include that might help
future patch authors.

The value in the test was obtained from another murmur3 implementation
using the following Go source code:

  package main

  import "fmt"
  import "github.com/spaolacci/murmur3"

  func main() {
          fmt.Printf("%x\n", murmur3.Sum32([]byte("Hello world!")))
          fmt.Printf("%x\n", murmur3.Sum32([]byte{0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff}))
  }

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 bloom.c               | 14 +++++++-------
 bloom.h               |  9 ++++++---
 commit-graph.c        |  4 ++--
 t/helper/test-bloom.c |  7 +++++++
 t/t0095-bloom.sh      |  8 ++++++++
 t/t4216-log-bloom.sh  |  8 ++++----
 6 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/bloom.c b/bloom.c
index aef6b5fea2..fec243b2f1 100644
--- a/bloom.c
+++ b/bloom.c
@@ -82,10 +82,10 @@ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
 
 	uint32_t k;
 	for (i = 0; i < len4; i++) {
-		uint32_t byte1 = (uint32_t)data[4*i];
-		uint32_t byte2 = ((uint32_t)data[4*i + 1]) << 8;
-		uint32_t byte3 = ((uint32_t)data[4*i + 2]) << 16;
-		uint32_t byte4 = ((uint32_t)data[4*i + 3]) << 24;
+		uint32_t byte1 = (uint32_t)(unsigned char)data[4*i];
+		uint32_t byte2 = ((uint32_t)(unsigned char)data[4*i + 1]) << 8;
+		uint32_t byte3 = ((uint32_t)(unsigned char)data[4*i + 2]) << 16;
+		uint32_t byte4 = ((uint32_t)(unsigned char)data[4*i + 3]) << 24;
 		k = byte1 | byte2 | byte3 | byte4;
 		k *= c1;
 		k = rotate_left(k, r1);
@@ -99,13 +99,13 @@ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
 
 	switch (len & (sizeof(uint32_t) - 1)) {
 	case 3:
-		k1 ^= ((uint32_t)tail[2]) << 16;
+		k1 ^= ((uint32_t)(unsigned char)tail[2]) << 16;
 		/*-fallthrough*/
 	case 2:
-		k1 ^= ((uint32_t)tail[1]) << 8;
+		k1 ^= ((uint32_t)(unsigned char)tail[1]) << 8;
 		/*-fallthrough*/
 	case 1:
-		k1 ^= ((uint32_t)tail[0]) << 0;
+		k1 ^= ((uint32_t)(unsigned char)tail[0]) << 0;
 		k1 *= c1;
 		k1 = rotate_left(k1, r1);
 		k1 *= c2;
diff --git a/bloom.h b/bloom.h
index adde6dfe21..8526fa948c 100644
--- a/bloom.h
+++ b/bloom.h
@@ -7,9 +7,11 @@ struct repository;
 struct bloom_filter_settings {
 	/*
 	 * The version of the hashing technique being used.
-	 * We currently only support version = 1 which is
+	 * We currently only support version = 2 which is
 	 * the seeded murmur3 hashing technique implemented
-	 * in bloom.c.
+	 * in bloom.c. Bloom filters of version 1 were created
+	 * with prior versions of Git, which had a bug in the
+	 * implementation of the hash function.
 	 */
 	uint32_t hash_version;
 
@@ -38,8 +40,9 @@ struct bloom_filter_settings {
 	uint32_t max_changed_paths;
 };
 
+#define BLOOM_HASH_VERSION 2
 #define DEFAULT_BLOOM_MAX_CHANGES 512
-#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10, DEFAULT_BLOOM_MAX_CHANGES }
+#define DEFAULT_BLOOM_FILTER_SETTINGS { BLOOM_HASH_VERSION, 7, 10, DEFAULT_BLOOM_MAX_CHANGES }
 #define BITS_PER_WORD 8
 #define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
 
diff --git a/commit-graph.c b/commit-graph.c
index 843bdb458d..2eb9b781f4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -314,7 +314,7 @@ static int graph_read_bloom_data(const unsigned char *chunk_start,
 	g->chunk_bloom_data = chunk_start;
 	hash_version = get_be32(chunk_start);
 
-	if (hash_version != 1)
+	if (hash_version != BLOOM_HASH_VERSION)
 		return 0;
 
 	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
@@ -2402,7 +2402,7 @@ int write_commit_graph(struct object_directory *odb,
 		g = ctx->r->objects->commit_graph;
 
 		/* We have changed-paths already. Keep them in the next graph */
-		if (g && g->chunk_bloom_data) {
+		if (g && g->bloom_filter_settings) {
 			ctx->changed_paths = 1;
 			ctx->bloom_settings = g->bloom_filter_settings;
 		}
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index aabe31d724..5624e4d2db 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -50,6 +50,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 
 static const char *bloom_usage = "\n"
 "  test-tool bloom get_murmur3 <string>\n"
+"  test-tool bloom get_murmur3_seven_highbit\n"
 "  test-tool bloom generate_filter <string> [<string>...]\n"
 "  test-tool bloom get_filter_for_commit <commit-hex>\n";
 
@@ -68,6 +69,12 @@ int cmd__bloom(int argc, const char **argv)
 		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
 	}
 
+	if (!strcmp(argv[1], "get_murmur3_seven_highbit")) {
+		uint32_t hashed;
+		hashed = murmur3_seeded(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7);
+		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
+	}
+
 	if (!strcmp(argv[1], "generate_filter")) {
 		struct bloom_filter filter;
 		int i = 2;
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index b567383eb8..c8d84ab606 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -29,6 +29,14 @@ test_expect_success 'compute unseeded murmur3 hash for test string 2' '
 	test_cmp expect actual
 '
 
+test_expect_success 'compute unseeded murmur3 hash for test string 3' '
+	cat >expect <<-\EOF &&
+	Murmur3 Hash with seed=0:0xa183ccfd
+	EOF
+	test-tool bloom get_murmur3_seven_highbit >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'compute bloom key for empty string' '
 	cat >expect <<-\EOF &&
 	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index f14cc1c1f1..7a193aa143 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -48,7 +48,7 @@ graph_read_expect () {
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
-	options: bloom(1,10,7) read_generation_data
+	options: bloom(2,10,7) read_generation_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
@@ -108,7 +108,7 @@ test_expect_success 'incompatible bloom filter versions are not used' '
 
 	# But the correct version number works
 	cat old-commit-graph >new-commit-graph &&
-	printf "\01" |
+	printf "\02" |
 		dd of=new-commit-graph bs=1 count=1 \
 			seek=$((BDAT_OFFSET + 3)) conv=notrunc &&
 	mv new-commit-graph .git/objects/info/commit-graph &&
@@ -209,10 +209,10 @@ test_expect_success 'persist filter settings' '
 		GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
 		GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2.txt &&
+	grep "{\"hash_version\":2,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2.txt &&
 	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2-auto.txt
+	grep "{\"hash_version\":2,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2-auto.txt
 '
 
 test_max_changed_paths () {
-- 
2.40.1.698.g37aff9b760-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH 0/2] Changed path filter hash fix and version bump
  2023-05-22 21:48 [PATCH 0/2] Changed path filter hash fix and version bump Jonathan Tan
  2023-05-22 21:48 ` [PATCH 1/2] t4216: test wrong bloom filter version rejection Jonathan Tan
  2023-05-22 21:48 ` [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2 Jonathan Tan
@ 2023-05-23  4:42 ` Junio C Hamano
  2023-05-31 23:12 ` [PATCH v2 0/3] " Jonathan Tan
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 116+ messages in thread
From: Junio C Hamano @ 2023-05-23  4:42 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, me

Jonathan Tan <jonathantanmy@google.com> writes:

> Following the conversation in [1], here are patches to fix the murmur3
> hash function used in creating (and interpreting) changed path filters,
> and also to bump the version number to 2.

Wonderful.  Thanks for a quick update.  Will take a look when I come
back to the keyboard (I'm on half-vacation right now).

>
> This is I think the simplest way to do this (invalidating all existing
> changed path filters). The resource-consuming part of creating a changed
> path filter is in computing the changed paths (thus, reading trees and
> calculating changes), and to check if a changed path filter could be
> reused, one would need to compute the changed paths anyway in order to
> determine if any of them have high-bit strings, so I did not pursue this
> further. Server operators might be able to reuse changed path filters
> if, for example, they have a more efficient way to determine that no
> paths in a repo have the high bit set, but I think that this is out of
> scope for the Git project.
>
> In patch 2, I couldn't figure out how to make Bash pass high-bit strings
> as a CLI argument for some reason, so I hardcoded the string I wanted
> in the test helper instead. If anyone knows how to pass such strings,
> please let me know.
>
> [1] https://lore.kernel.org/git/20230511224101.972442-1-jonathantanmy@google.com/
>
> Jonathan Tan (2):
>   t4216: test wrong bloom filter version rejection
>   commit-graph: fix murmur3, bump filter ver. to 2
>
>  bloom.c               | 14 +++++++-------
>  bloom.h               |  9 ++++++---
>  commit-graph.c        |  4 ++--
>  t/helper/test-bloom.c |  7 +++++++
>  t/t0095-bloom.sh      |  8 ++++++++
>  t/t4216-log-bloom.sh  | 36 +++++++++++++++++++++++++++++++++---
>  6 files changed, 63 insertions(+), 15 deletions(-)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2
  2023-05-22 21:48 ` [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2 Jonathan Tan
@ 2023-05-23 13:00   ` Derrick Stolee
  2023-05-23 23:00     ` Jonathan Tan
  2023-05-23 23:51     ` Junio C Hamano
  0 siblings, 2 replies; 116+ messages in thread
From: Derrick Stolee @ 2023-05-23 13:00 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: me

On 5/22/2023 5:48 PM, Jonathan Tan wrote:
> The murmur3 implementation in bloom.c has a bug when converting series
> of 4 bytes into network-order integers when char is signed (which is
> controllable by a compiler option, and the default signedness of char is
> platform-specific). When a string contains characters with the high bit
> set, this bug causes results that, although internally consistent within
> Git, does not accord with other implementations of murmur3 and even with
> Git binaries that were compiled with different signedness of char. This
> bug affects both how Git writes changed path filters to disk and how Git
> interprets changed path filters on disk.
> 
> Therefore, fix this bug. And because changed path filters on disk might
> no longer be compatible, teach Git to write "2" as the version when
> writing changed path filters (instead of "1" currently), and only accept
> "2" as the version when reading them (instead of "1" currently).

I appreciate that you discovered and are presenting a way out of this
problem, however the current approach does not preserve compatibility
enough.

> @@ -82,10 +82,10 @@ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
>  
>  	uint32_t k;
>  	for (i = 0; i < len4; i++) {
> -		uint32_t byte1 = (uint32_t)data[4*i];
> -		uint32_t byte2 = ((uint32_t)data[4*i + 1]) << 8;
> -		uint32_t byte3 = ((uint32_t)data[4*i + 2]) << 16;
> -		uint32_t byte4 = ((uint32_t)data[4*i + 3]) << 24;
> +		uint32_t byte1 = (uint32_t)(unsigned char)data[4*i];
> +		uint32_t byte2 = ((uint32_t)(unsigned char)data[4*i + 1]) << 8;
> +		uint32_t byte3 = ((uint32_t)(unsigned char)data[4*i + 2]) << 16;
> +		uint32_t byte4 = ((uint32_t)(unsigned char)data[4*i + 3]) << 24;
>  		k = byte1 | byte2 | byte3 | byte4;
>  		k *= c1;
>  		k = rotate_left(k, r1);

By changing this algorithm directly (instead of making an "unsigned" version,
or renaming this one to the "maybe signed" version), you are making it
impossible for us to ship a version that can read version 1 Bloom filters,
so all read-only history operations will immediately slow down (because they
will ignore v1 chunks, better than incorrectly parsing v1 chunks).

> @@ -314,7 +314,7 @@ static int graph_read_bloom_data(const unsigned char *chunk_start,
>  	g->chunk_bloom_data = chunk_start;
>  	hash_version = get_be32(chunk_start);
>  
> -	if (hash_version != 1)
> +	if (hash_version != BLOOM_HASH_VERSION)
>  		return 0;
  
Here's where we would ignore v1 filters, instead of continuing to read them
(with all the risks involved).

In order for this to be something we can ship safely to environments that depend
on changed-path Bloom filters, we need to be able to parse v1 filters. It would
be even better if we didn't write v2 filters by default, but instead hid it
behind a config option that is off by default for at least one major release.

I notice that you didn't update the commit-graph format docs, which seems like
a valuable place to describe the new version number, as well as any plans to
completely deprecate v1. For instance, describing the v1 implementation as
having inconsistent application of murmur3 is a valuable thing to have, but
then describe the plans for deprecating it as an unsafe format.

Here is a potential plan to consider:

 1. v2.42.0 includes writing v2 format, off by default.
 2. v2.43.0 writes v2 format by default.
 3. v2.44.0 no longer parses v1 format (ignored without error).

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2
  2023-05-23 13:00   ` Derrick Stolee
@ 2023-05-23 23:00     ` Jonathan Tan
  2023-05-23 23:51     ` Junio C Hamano
  1 sibling, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-05-23 23:00 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Jonathan Tan, git, me

Derrick Stolee <derrickstolee@github.com> writes:
> I notice that you didn't update the commit-graph format docs,

Ah, thanks for the reminder.

> Here is a potential plan to consider:
> 
>  1. v2.42.0 includes writing v2 format, off by default.
>  2. v2.43.0 writes v2 format by default.
>  3. v2.44.0 no longer parses v1 format (ignored without error).

First of all, thanks for your comments on the migration process - that
is indeed the most complicated part of this.

The code change to support 2 versions seems not too hard (duplicate
murmur3_seeded() and modify one so that we have one version 1 and
one version 2, and teach fill_bloom_key() to call the appropriate one
based on the struct bloom_filter_settings) but this requires both the
code author and reviewer(s) to check that we don't have cases in which
we read or write one version when we're supposed to do it with the
other. And the benefit of doing this seems to just be giving server
administrators an opportunity to perform the migration at a more relaxed
pace, which I think there are other ways to accomplish if we really
wanted to do this, so I wanted to avoid having 2 versions in the Git
codebase.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2
  2023-05-23 13:00   ` Derrick Stolee
  2023-05-23 23:00     ` Jonathan Tan
@ 2023-05-23 23:51     ` Junio C Hamano
  2023-05-24 21:26       ` Jonathan Tan
  1 sibling, 1 reply; 116+ messages in thread
From: Junio C Hamano @ 2023-05-23 23:51 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Jonathan Tan, git, me

Derrick Stolee <derrickstolee@github.com> writes:

> I appreciate that you discovered and are presenting a way out of this
> problem, however the current approach does not preserve compatibility
> enough.
> ...
> By changing this algorithm directly (instead of making an "unsigned" version,
> or renaming this one to the "maybe signed" version), you are making it
> impossible for us to ship a version that can read version 1 Bloom filters,
> so all read-only history operations will immediately slow down (because they
> will ignore v1 chunks, better than incorrectly parsing v1 chunks).
>
> Here's where we would ignore v1 filters, instead of continuing to read them
> (with all the risks involved).

I do not know the "all the risks involved" comment.  Is the risk
something we can mitigate by still reading v1 data but be careful
about when not to apply the filters?

I may be misremembering the original discussion, but wasn't the
conclusion that v1 data is salvageable in the sense that it can
still reliably say that, given a pathname without bytes with
high-bit set, it cannot possibly belong to the set of changed paths,
even though, because the filter is contaminated with "signed" data,
its false-positive rate may be higher than using "unsigned" version?
And based on that observation, we can still read v1 data but only
use the Bloom filters when the queried paths have no byte with
high-bit set.

Also if we are operating in such an environment then would it be
possible to first compute as if we were going to generate v2 data,
but write it as v1 after reading all the path and realizing there
are no problematic paths?  IOW, even if the version of Git is
capable of writing and reading v2, it does not have to use v2,
right?  To put it the other way around, we will have to and can keep
supporting data that is labeled as v1, no?

> In order for this to be something we can ship safely to environments that depend
> on changed-path Bloom filters, we need to be able to parse v1 filters. It would
> be even better if we didn't write v2 filters by default, but instead hid it
> behind a config option that is off by default for at least one major release.

Is the concern that we will double the chunk size because both v1
and v2 will be written?  Or is it that we will stop writing v1 if we
start writing v2 and switching too early will mean the repositories
will become slower for older implementations that haven't died out?

Thanks.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2
  2023-05-23 23:51     ` Junio C Hamano
@ 2023-05-24 21:26       ` Jonathan Tan
  2023-05-26 13:19         ` Derrick Stolee
  0 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-05-24 21:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, Derrick Stolee, git, me

Junio C Hamano <gitster@pobox.com> writes:
> I may be misremembering the original discussion, but wasn't the
> conclusion that v1 data is salvageable in the sense that it can
> still reliably say that, given a pathname without bytes with
> high-bit set, it cannot possibly belong to the set of changed paths,
> even though, because the filter is contaminated with "signed" data,
> its false-positive rate may be higher than using "unsigned" version?
> And based on that observation, we can still read v1 data but only
> use the Bloom filters when the queried paths have no byte with
> high-bit set.

There are at least 3 ways of salvaging the data that we've discussed:

- Enumerating all of a repo's paths and if none of them have a high bit,
  retain the existing filters.
- Walking all of a repo's trees (so that we know which tree corresponds
  to which commit) and if for a commit, all its trees have no high bit,
  retain the filter for that tree (otherwise recompute it).
- Keep using a version 1 filter but only when the sought-for path has no
  high bit (as you describe here).

(The first 2 is my interpretation of what Taylor described [1].)

I'm not sure if we want to keep version 1 filters around at all -
this would work with Git as long as it is not compiled with different
signedness of char, but would not work with other implementations of
Git (unless they replicate the hashing bug). There is also the issue of
how we're going to indicate that in a commit graph file, some filters
are version 1 and some filters are version 2 (unless the plan is to
completely rewrite the filters in this case, but then we'll run into
the issue that computing these filters en-masse is expensive, as Taylor
describes also in [1]).

> Also if we are operating in such an environment then would it be
> possible to first compute as if we were going to generate v2 data,
> but write it as v1 after reading all the path and realizing there
> are no problematic paths?  

I think in this case, we would want to write it as v2 anyway, because
there's no way to distinguish a v1 that has high bits and is written
incorrectly versus a v1 that happens to have no high bits and therefore
is identical under v2.

> IOW, even if the version of Git is
> capable of writing and reading v2, it does not have to use v2,
> right?  To put it the other way around, we will have to and can keep
> supporting data that is labeled as v1, no?

I think this is the main point - whether we want to continue supporting
data labeled as v1. I personally think that we should migrate to v2
as quickly as possible. But if the consensus is that we should support
both, at least for a few releases of Git, I'll go with that.

[1] https://lore.kernel.org/git/ZF116EDcmAy7XEbC@nand.local/

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2
  2023-05-24 21:26       ` Jonathan Tan
@ 2023-05-26 13:19         ` Derrick Stolee
  2023-05-30 17:26           ` Jonathan Tan
  0 siblings, 1 reply; 116+ messages in thread
From: Derrick Stolee @ 2023-05-26 13:19 UTC (permalink / raw)
  To: Jonathan Tan, Junio C Hamano; +Cc: git, me

On 5/24/2023 5:26 PM, Jonathan Tan wrote:
> Junio C Hamano <gitster@pobox.com> writes:

>> IOW, even if the version of Git is
>> capable of writing and reading v2, it does not have to use v2,
>> right?  To put it the other way around, we will have to and can keep
>> supporting data that is labeled as v1, no?
> 
> I think this is the main point - whether we want to continue supporting
> data labeled as v1. I personally think that we should migrate to v2
> as quickly as possible. But if the consensus is that we should support
> both, at least for a few releases of Git, I'll go with that.

I agree on migrating quickly as possible, within basic safety guidelines.

Shipping a Git change that suddenly is unable to use on-disk data that
it has previously relied upon is not a safe change. And that is the
absolute minimum amount of safety required. The other side is to not
make a Git change that suddenly changes the on-disk format without a
switch to disable it.

Think about it this way: if there was a bug in the code, could we
safely roll it back? If we are immediately writing v2 filters after
the deployment, then rolling it back will cause the previous version
to not recognize those filters, leading to a delayed recovery.

I'd be willing to modify my suggested steps:

>>> 1. v2.42.0 includes writing v2 format, off by default.
>>> 2. v2.43.0 writes v2 format by default.
>>> 3. v2.44.0 no longer parses v1 format (ignored without error).

to something simpler:

 1. v2.42.0 writes v2 format by default, but can be disabled by config.
 2. v2.43.0 no longer parses or writes v1 format.

With this, we could proactively set the config value that disables the
v2 format in our production environment, then slowly re-enable that
config after the binaries have deployed. This allows us to limit the
blast radius if something goes wrong, which is really important.

Further, I'm describing an environment where we control all of the Git
versions that are interacting with the repositories. Other environments
don't have that luxury, such as typical client users.

Even the three-version plan is an accelerated deprecation plan, based
on previous examples in Git.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2
  2023-05-26 13:19         ` Derrick Stolee
@ 2023-05-30 17:26           ` Jonathan Tan
  0 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-05-30 17:26 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Jonathan Tan, Junio C Hamano, git, me

Derrick Stolee <derrickstolee@github.com> writes:
> I'd be willing to modify my suggested steps:
> 
> >>> 1. v2.42.0 includes writing v2 format, off by default.
> >>> 2. v2.43.0 writes v2 format by default.
> >>> 3. v2.44.0 no longer parses v1 format (ignored without error).
> 
> to something simpler:
> 
>  1. v2.42.0 writes v2 format by default, but can be disabled by config.
>  2. v2.43.0 no longer parses or writes v1 format.
> 
> With this, we could proactively set the config value that disables the
> v2 format in our production environment, then slowly re-enable that
> config after the binaries have deployed. This allows us to limit the
> blast radius if something goes wrong, which is really important.
> 
> Further, I'm describing an environment where we control all of the Git
> versions that are interacting with the repositories. Other environments
> don't have that luxury, such as typical client users.
> 
> Even the three-version plan is an accelerated deprecation plan, based
> on previous examples in Git.
> 
> Thanks,
> -Stolee

OK, let me take a look and see what this (having at least a version
of Git that supports both versions of hash functions) would look like.
If we're going to have this, we might as well roll it out as safely
as possible, so I'll aim for your original step 1 of 3 (write v2, off
by default).

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 0/3] Changed path filter hash fix and version bump
  2023-05-22 21:48 [PATCH 0/2] Changed path filter hash fix and version bump Jonathan Tan
                   ` (2 preceding siblings ...)
  2023-05-23  4:42 ` [PATCH 0/2] Changed path filter hash fix and version bump Junio C Hamano
@ 2023-05-31 23:12 ` Jonathan Tan
  2023-05-31 23:12   ` [PATCH v2 1/3] t4216: test changed path filters with high bit paths Jonathan Tan
                     ` (3 more replies)
  2023-06-08 19:21 ` [PATCH v3 0/4] " Jonathan Tan
                   ` (4 subsequent siblings)
  8 siblings, 4 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-05-31 23:12 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano

Here's a new version. With this, Git can function with both version
1 (incorrect murmur3) and version 2 (correct murmur3) changed path
filters, but not at the same time: the user can set a config variable to
choose which one, and Git will ignore existing changed path filters of
the wrong version (and always write the version that the config variable
says).

In patch 1, the test assumes that char is signed. I'm not sure if it's
worth asserting on the contents of the filter, since it depends on
whether char is signed, but I've included it anyway (since it's easy
to remove).

Jonathan Tan (3):
  t4216: test changed path filters with high bit paths
  repo-settings: introduce commitgraph.changedPathsVersion
  commit-graph: new filter ver. that fixes murmur3

 Documentation/config/commitgraph.txt | 16 +++++--
 bloom.c                              | 65 ++++++++++++++++++++++++++--
 bloom.h                              |  8 ++--
 commit-graph.c                       | 29 ++++++++++---
 oss-fuzz/fuzz-commit-graph.c         |  2 +-
 repo-settings.c                      |  6 ++-
 repository.h                         |  2 +-
 t/helper/test-bloom.c                |  9 +++-
 t/t0095-bloom.sh                     |  8 ++++
 t/t4216-log-bloom.sh                 | 65 ++++++++++++++++++++++++++++
 10 files changed, 192 insertions(+), 18 deletions(-)

Range-diff against v1:
1:  3a5d53d3c0 < -:  ---------- t4216: test wrong bloom filter version rejection
2:  5a91f9682b < -:  ---------- commit-graph: fix murmur3, bump filter ver. to 2
-:  ---------- > 1:  175dc912fe t4216: test changed path filters with high bit paths
-:  ---------- > 2:  4a7553f3fb repo-settings: introduce commitgraph.changedPathsVersion
-:  ---------- > 3:  f5c3f6080a commit-graph: new filter ver. that fixes murmur3
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 1/3] t4216: test changed path filters with high bit paths
  2023-05-31 23:12 ` [PATCH v2 0/3] " Jonathan Tan
@ 2023-05-31 23:12   ` Jonathan Tan
  2023-05-31 23:12   ` [PATCH v2 2/3] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-05-31 23:12 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano

Subsequent commits will teach Git another version of changed path
filter that has different behavior with paths that contain at least
one character with its high bit set, so test the existing behavior as
a baseline.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 t/t4216-log-bloom.sh | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fa9d32facf..2ec5b5b5e7 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -404,4 +404,38 @@ test_expect_success 'Bloom generation backfills empty commits' '
 	)
 '
 
+get_bdat_offset () {
+	perl -0777 -ne \
+		'print unpack("N", "$1") if /BDAT\0\0\0\0(....)/ or exit 1' \
+		.git/objects/info/commit-graph
+}
+
+get_first_changed_path_filter () {
+	BDAT_OFFSET=$(get_bdat_offset) &&
+	perl -0777 -ne \
+		'print unpack("H*", substr($_, '$BDAT_OFFSET' + 12, 2))' \
+		.git/objects/info/commit-graph
+}
+
+# chosen to be the same under all Unicode normalization forms
+CENT=$(printf "\xc2\xa2")
+
+test_expect_success 'set up repo with high bit path, version 1 changed-path' '
+	git init highbit1 &&
+	test_commit -C highbit1 c1 "$CENT" &&
+	git -C highbit1 commit-graph write --reachable --changed-paths
+'
+
+test_expect_success 'check value of version 1 changed-path' '
+	(cd highbit1 &&
+		printf "52a9" >expect &&
+		get_first_changed_path_filter >actual &&
+		test_cmp expect actual)
+'
+
+test_expect_success 'version 1 changed-path used when version 1 requested' '
+	(cd highbit1 &&
+		test_bloom_filters_used "-- $CENT")
+'
+
 test_done
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 2/3] repo-settings: introduce commitgraph.changedPathsVersion
  2023-05-31 23:12 ` [PATCH v2 0/3] " Jonathan Tan
  2023-05-31 23:12   ` [PATCH v2 1/3] t4216: test changed path filters with high bit paths Jonathan Tan
@ 2023-05-31 23:12   ` Jonathan Tan
  2023-05-31 23:12   ` [PATCH v2 3/3] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
  2023-06-03  1:01   ` [PATCH v2 0/3] Changed path filter hash fix and version bump Junio C Hamano
  3 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-05-31 23:12 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano

A subsequent commit will introduce another version of the changed-path
filter in the commit graph file. In order to control which version is
to be accepted when read (and which version to write), a config variable
is needed.

Therefore, introduce this config variable. For forwards compatibility,
teach Git to not read commit graphs when the config variable
is set to an unsupported version. Because we teach Git this,
commitgraph.readChangedPaths is now redundant, so deprecate it and
define its behavior in terms of the config variable we introduce.

This commit does not change the behavior of writing (Git writes changed
path filters when explicitly instructed regardless of any config
variable), but a subsequent commit will restrict Git such that it will
only write when commitgraph.changedPathsVersion is 0, 1, or 2.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/config/commitgraph.txt | 16 +++++++++++++---
 commit-graph.c                       |  2 +-
 oss-fuzz/fuzz-commit-graph.c         |  2 +-
 repo-settings.c                      |  6 +++++-
 repository.h                         |  2 +-
 5 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index 30604e4a4c..eaa10bf232 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -9,6 +9,16 @@ commitGraph.maxNewFilters::
 	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
 
 commitGraph.readChangedPaths::
-	If true, then git will use the changed-path Bloom filters in the
-	commit-graph file (if it exists, and they are present). Defaults to
-	true. See linkgit:git-commit-graph[1] for more information.
+	Deprecated. Equivalent to changedPathsVersion=1 if true, and
+	changedPathsVersion=0 if false.
+
+commitGraph.changedPathsVersion::
+	Specifies the version of the changed-path Bloom filters that Git will read and
+	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
+	match the version set in this config variable will be ignored.
++
+Defaults to 1.
++
+If 0, git will write version 1 Bloom filters when instructed to write.
++
+See linkgit:git-commit-graph[1] for more information.
diff --git a/commit-graph.c b/commit-graph.c
index c11b59f28b..bd448047f1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -399,7 +399,7 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 			graph->read_generation_data = 1;
 	}
 
-	if (s->commit_graph_read_changed_paths) {
+	if (s->commit_graph_changed_paths_version == 1) {
 		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
 			   &graph->chunk_bloom_indexes);
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
diff --git a/oss-fuzz/fuzz-commit-graph.c b/oss-fuzz/fuzz-commit-graph.c
index 914026f5d8..b56731f51a 100644
--- a/oss-fuzz/fuzz-commit-graph.c
+++ b/oss-fuzz/fuzz-commit-graph.c
@@ -18,7 +18,7 @@ int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
 	 * possible.
 	 */
 	the_repository->settings.commit_graph_generation_version = 2;
-	the_repository->settings.commit_graph_read_changed_paths = 1;
+	the_repository->settings.commit_graph_changed_paths_version = 1;
 	g = parse_commit_graph(&the_repository->settings, (void *)data, size);
 	repo_clear(the_repository);
 	free_commit_graph(g);
diff --git a/repo-settings.c b/repo-settings.c
index 3dbd3f0e2e..6cbe02681b 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -24,6 +24,7 @@ void prepare_repo_settings(struct repository *r)
 	int value;
 	const char *strval;
 	int manyfiles;
+	int readChangedPaths;
 
 	if (!r->gitdir)
 		BUG("Cannot add settings for uninitialized repository");
@@ -54,7 +55,10 @@ void prepare_repo_settings(struct repository *r)
 	/* Commit graph config or default, does not cascade (simple) */
 	repo_cfg_bool(r, "core.commitgraph", &r->settings.core_commit_graph, 1);
 	repo_cfg_int(r, "commitgraph.generationversion", &r->settings.commit_graph_generation_version, 2);
-	repo_cfg_bool(r, "commitgraph.readchangedpaths", &r->settings.commit_graph_read_changed_paths, 1);
+	repo_cfg_bool(r, "commitgraph.readchangedpaths", &readChangedPaths, 1);
+	repo_cfg_int(r, "commitgraph.changedpathsversion",
+		     &r->settings.commit_graph_changed_paths_version,
+		     readChangedPaths ? 1 : 0);
 	repo_cfg_bool(r, "gc.writecommitgraph", &r->settings.gc_write_commit_graph, 1);
 	repo_cfg_bool(r, "fetch.writecommitgraph", &r->settings.fetch_write_commit_graph, 0);
 
diff --git a/repository.h b/repository.h
index e8c67ffe16..1f1c32a6dd 100644
--- a/repository.h
+++ b/repository.h
@@ -32,7 +32,7 @@ struct repo_settings {
 
 	int core_commit_graph;
 	int commit_graph_generation_version;
-	int commit_graph_read_changed_paths;
+	int commit_graph_changed_paths_version;
 	int gc_write_commit_graph;
 	int gc_cruft_packs;
 	int fetch_write_commit_graph;
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 3/3] commit-graph: new filter ver. that fixes murmur3
  2023-05-31 23:12 ` [PATCH v2 0/3] " Jonathan Tan
  2023-05-31 23:12   ` [PATCH v2 1/3] t4216: test changed path filters with high bit paths Jonathan Tan
  2023-05-31 23:12   ` [PATCH v2 2/3] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
@ 2023-05-31 23:12   ` Jonathan Tan
  2023-06-03  1:01   ` [PATCH v2 0/3] Changed path filter hash fix and version bump Junio C Hamano
  3 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-05-31 23:12 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano

The murmur3 implementation in bloom.c has a bug when converting series
of 4 bytes into network-order integers when char is signed (which is
controllable by a compiler option, and the default signedness of char is
platform-specific). When a string contains characters with the high bit
set, this bug causes results that, although internally consistent within
Git, does not accord with other implementations of murmur3 and even with
Git binaries that were compiled with different signedness of char. This
bug affects both how Git writes changed path filters to disk and how Git
interprets changed path filters on disk.

Therefore, introduce a new version (2) of changed path filters that
corrects this problem. The existing version (1) is still supported and
is still the default, but users should migrate away from it as soon
as possible.

Because this bug only manifests with characters that have the high bit
set, it may be possible that some (or all) commits in a given repo would
have the same changed path filter both before and after this fix is
applied. However, in order to determine whether this is the case, the
changed paths would first have to be computed, at which point it is not
much more expensive to just compute a new changed path filter.

So this patch does not include any mechanism to "salvage" changed path
filters from repositories. There is also no "mixed" mode - for each
invocation of Git, reading and writing changed path filters are done
with the same version number.

There is a change in write_commit_graph(). graph_read_bloom_data()
makes it possible for chunk_bloom_data to be non-NULL but
bloom_filter_settings to be NULL, which causes a segfault later on. I
produced such a segfault while developing this patch, but couldn't find
a way to reproduce it neither after this complete patch (or before),
but in any case it seemed like a good thing to include that might help
future patch authors.

The value in t0095 was obtained from another murmur3 implementation
using the following Go source code:

  package main

  import "fmt"
  import "github.com/spaolacci/murmur3"

  func main() {
          fmt.Printf("%x\n", murmur3.Sum32([]byte("Hello world!")))
          fmt.Printf("%x\n", murmur3.Sum32([]byte{0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff}))
  }

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/config/commitgraph.txt |  2 +-
 bloom.c                              | 65 ++++++++++++++++++++++++++--
 bloom.h                              |  8 ++--
 commit-graph.c                       | 29 ++++++++++---
 t/helper/test-bloom.c                |  9 +++-
 t/t0095-bloom.sh                     |  8 ++++
 t/t4216-log-bloom.sh                 | 31 +++++++++++++
 7 files changed, 139 insertions(+), 13 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index eaa10bf232..c64ee4f459 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -14,7 +14,7 @@ commitGraph.readChangedPaths::
 
 commitGraph.changedPathsVersion::
 	Specifies the version of the changed-path Bloom filters that Git will read and
-	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
+	write. May be 0, 1, or 2. Any changed-path Bloom filters on disk that do not
 	match the version set in this config variable will be ignored.
 +
 Defaults to 1.
diff --git a/bloom.c b/bloom.c
index d0730525da..915d8e5a31 100644
--- a/bloom.c
+++ b/bloom.c
@@ -65,7 +65,64 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
  * Not considered to be cryptographically secure.
  * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  */
-uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
+uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
+{
+	const uint32_t c1 = 0xcc9e2d51;
+	const uint32_t c2 = 0x1b873593;
+	const uint32_t r1 = 15;
+	const uint32_t r2 = 13;
+	const uint32_t m = 5;
+	const uint32_t n = 0xe6546b64;
+	int i;
+	uint32_t k1 = 0;
+	const char *tail;
+
+	int len4 = len / sizeof(uint32_t);
+
+	uint32_t k;
+	for (i = 0; i < len4; i++) {
+		uint32_t byte1 = (uint32_t)(unsigned char)data[4*i];
+		uint32_t byte2 = ((uint32_t)(unsigned char)data[4*i + 1]) << 8;
+		uint32_t byte3 = ((uint32_t)(unsigned char)data[4*i + 2]) << 16;
+		uint32_t byte4 = ((uint32_t)(unsigned char)data[4*i + 3]) << 24;
+		k = byte1 | byte2 | byte3 | byte4;
+		k *= c1;
+		k = rotate_left(k, r1);
+		k *= c2;
+
+		seed ^= k;
+		seed = rotate_left(seed, r2) * m + n;
+	}
+
+	tail = (data + len4 * sizeof(uint32_t));
+
+	switch (len & (sizeof(uint32_t) - 1)) {
+	case 3:
+		k1 ^= ((uint32_t)(unsigned char)tail[2]) << 16;
+		/*-fallthrough*/
+	case 2:
+		k1 ^= ((uint32_t)(unsigned char)tail[1]) << 8;
+		/*-fallthrough*/
+	case 1:
+		k1 ^= ((uint32_t)(unsigned char)tail[0]) << 0;
+		k1 *= c1;
+		k1 = rotate_left(k1, r1);
+		k1 *= c2;
+		seed ^= k1;
+		break;
+	}
+
+	seed ^= (uint32_t)len;
+	seed ^= (seed >> 16);
+	seed *= 0x85ebca6b;
+	seed ^= (seed >> 13);
+	seed *= 0xc2b2ae35;
+	seed ^= (seed >> 16);
+
+	return seed;
+}
+
+static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
 {
 	const uint32_t c1 = 0xcc9e2d51;
 	const uint32_t c2 = 0x1b873593;
@@ -130,8 +187,10 @@ void fill_bloom_key(const char *data,
 	int i;
 	const uint32_t seed0 = 0x293ae76f;
 	const uint32_t seed1 = 0x7e646e2c;
-	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
-	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
+	const uint32_t hash0 = (settings->hash_version == 2
+		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed0, data, len);
+	const uint32_t hash1 = (settings->hash_version == 2
+		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed1, data, len);
 
 	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
 	for (i = 0; i < settings->num_hashes; i++)
diff --git a/bloom.h b/bloom.h
index adde6dfe21..0c33ae282c 100644
--- a/bloom.h
+++ b/bloom.h
@@ -7,9 +7,11 @@ struct repository;
 struct bloom_filter_settings {
 	/*
 	 * The version of the hashing technique being used.
-	 * We currently only support version = 1 which is
+	 * The newest version is 2, which is
 	 * the seeded murmur3 hashing technique implemented
-	 * in bloom.c.
+	 * in bloom.c. Bloom filters of version 1 were created
+	 * with prior versions of Git, which had a bug in the
+	 * implementation of the hash function.
 	 */
 	uint32_t hash_version;
 
@@ -75,7 +77,7 @@ struct bloom_key {
  * Not considered to be cryptographically secure.
  * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  */
-uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len);
+uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len);
 
 void fill_bloom_key(const char *data,
 		    size_t len,
diff --git a/commit-graph.c b/commit-graph.c
index bd448047f1..36e6d09e74 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -302,15 +302,21 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
 	return 0;
 }
 
+struct graph_read_bloom_data_data {
+	struct commit_graph *g;
+	int commit_graph_changed_paths_version;
+};
+
 static int graph_read_bloom_data(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
-	struct commit_graph *g = data;
+	struct graph_read_bloom_data_data *d = data;
+	struct commit_graph *g = d->g;
 	uint32_t hash_version;
 	g->chunk_bloom_data = chunk_start;
 	hash_version = get_be32(chunk_start);
 
-	if (hash_version != 1)
+	if (hash_version != d->commit_graph_changed_paths_version)
 		return 0;
 
 	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
@@ -399,11 +405,16 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 			graph->read_generation_data = 1;
 	}
 
-	if (s->commit_graph_changed_paths_version == 1) {
+	if (s->commit_graph_changed_paths_version == 1
+	    || s->commit_graph_changed_paths_version == 2) {
+		struct graph_read_bloom_data_data data = {
+			.g = graph,
+			.commit_graph_changed_paths_version = s->commit_graph_changed_paths_version
+		};
 		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
 			   &graph->chunk_bloom_indexes);
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
-			   graph_read_bloom_data, graph);
+			   graph_read_bloom_data, &data);
 	}
 
 	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
@@ -2302,6 +2313,14 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->write_generation_data = (get_configured_generation_version(r) == 2);
 	ctx->num_generation_data_overflows = 0;
 
+	if (r->settings.commit_graph_changed_paths_version < 0
+	    || r->settings.commit_graph_changed_paths_version > 2) {
+		warning(_("attempting to write a commit-graph, but 'commitgraph.changedPathsVersion' (%d) is not supported"),
+			r->settings.commit_graph_changed_paths_version);
+		return 0;
+	}
+	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2
+		? 2 : 1;
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
 						      bloom_settings.bits_per_entry);
 	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
@@ -2331,7 +2350,7 @@ int write_commit_graph(struct object_directory *odb,
 		g = ctx->r->objects->commit_graph;
 
 		/* We have changed-paths already. Keep them in the next graph */
-		if (g && g->chunk_bloom_data) {
+		if (g && g->bloom_filter_settings) {
 			ctx->changed_paths = 1;
 			ctx->bloom_settings = g->bloom_filter_settings;
 		}
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 6c900ca668..34b8dd9164 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -48,6 +48,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 
 static const char *bloom_usage = "\n"
 "  test-tool bloom get_murmur3 <string>\n"
+"  test-tool bloom get_murmur3_seven_highbit\n"
 "  test-tool bloom generate_filter <string> [<string>...]\n"
 "  test-tool bloom get_filter_for_commit <commit-hex>\n";
 
@@ -62,7 +63,13 @@ int cmd__bloom(int argc, const char **argv)
 		uint32_t hashed;
 		if (argc < 3)
 			usage(bloom_usage);
-		hashed = murmur3_seeded(0, argv[2], strlen(argv[2]));
+		hashed = murmur3_seeded_v2(0, argv[2], strlen(argv[2]));
+		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
+	}
+
+	if (!strcmp(argv[1], "get_murmur3_seven_highbit")) {
+		uint32_t hashed;
+		hashed = murmur3_seeded_v2(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7);
 		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
 	}
 
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index b567383eb8..c8d84ab606 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -29,6 +29,14 @@ test_expect_success 'compute unseeded murmur3 hash for test string 2' '
 	test_cmp expect actual
 '
 
+test_expect_success 'compute unseeded murmur3 hash for test string 3' '
+	cat >expect <<-\EOF &&
+	Murmur3 Hash with seed=0:0xa183ccfd
+	EOF
+	test-tool bloom get_murmur3_seven_highbit >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'compute bloom key for empty string' '
 	cat >expect <<-\EOF &&
 	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 2ec5b5b5e7..764c6dee0f 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -438,4 +438,35 @@ test_expect_success 'version 1 changed-path used when version 1 requested' '
 		test_bloom_filters_used "-- $CENT")
 '
 
+test_expect_success 'version 1 changed-path not used when version 2 requested' '
+	(cd highbit1 &&
+		git config --add commitgraph.changedPathsVersion 2 &&
+		test_bloom_filters_not_used "-- $CENT")
+'
+
+test_expect_success 'set up repo with high bit path, version 2 changed-path' '
+	git init highbit2 &&
+	git -C highbit2 config --add commitgraph.changedPathsVersion 2 &&
+	test_commit -C highbit2 c2 "$CENT" &&
+	git -C highbit2 commit-graph write --reachable --changed-paths
+'
+
+test_expect_success 'check value of version 2 changed-path' '
+	(cd highbit2 &&
+		printf "c01f" >expect &&
+		get_first_changed_path_filter >actual &&
+		test_cmp expect actual)
+'
+
+test_expect_success 'version 2 changed-path used when version 2 requested' '
+	(cd highbit2 &&
+		test_bloom_filters_used "-- $CENT")
+'
+
+test_expect_success 'version 2 changed-path not used when version 1 requested' '
+	(cd highbit2 &&
+		git config --add commitgraph.changedPathsVersion 1 &&
+		test_bloom_filters_not_used "-- $CENT")
+'
+
 test_done
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 0/3] Changed path filter hash fix and version bump
  2023-05-31 23:12 ` [PATCH v2 0/3] " Jonathan Tan
                     ` (2 preceding siblings ...)
  2023-05-31 23:12   ` [PATCH v2 3/3] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
@ 2023-06-03  1:01   ` Junio C Hamano
  2023-06-03  2:24     ` Junio C Hamano
  3 siblings, 1 reply; 116+ messages in thread
From: Junio C Hamano @ 2023-06-03  1:01 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee

Jonathan Tan <jonathantanmy@google.com> writes:

> Here's a new version. With this, Git can function with both version
> 1 (incorrect murmur3) and version 2 (correct murmur3) changed path
> filters, but not at the same time: the user can set a config variable to
> choose which one, and Git will ignore existing changed path filters of
> the wrong version (and always write the version that the config variable
> says).

Hmph.  On a system with unsigned char, we should be able to keep
using version 1 without losing correctness, I suspect, but probably
they are in the minority we do not have to care about?  I can see
the desire to simplify the migration plan (i.e. essentially have no
migration---this will give us just a flag day per repository), but
I'll let others to comment.

> In patch 1, the test assumes that char is signed. I'm not sure if it's
> worth asserting on the contents of the filter, since it depends on
> whether char is signed, but I've included it anyway (since it's easy
> to remove).

So, on a system with unsigned char, would these tests fail?  Do we
need a prereq to skip them?

Thanks.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 0/3] Changed path filter hash fix and version bump
  2023-06-03  1:01   ` [PATCH v2 0/3] Changed path filter hash fix and version bump Junio C Hamano
@ 2023-06-03  2:24     ` Junio C Hamano
  2023-06-07 16:30       ` Jonathan Tan
  0 siblings, 1 reply; 116+ messages in thread
From: Junio C Hamano @ 2023-06-03  2:24 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

> Jonathan Tan <jonathantanmy@google.com> writes:
>
>> Here's a new version. With this, Git can function with both version
>> 1 (incorrect murmur3) and version 2 (correct murmur3) changed path
>> filters, but not at the same time: the user can set a config variable to
>> choose which one, and Git will ignore existing changed path filters of
>> the wrong version (and always write the version that the config variable
>> says).

Seems to break t4216 when merged to 'seen' to replace the previous
round.  Could you take a look?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 0/3] Changed path filter hash fix and version bump
  2023-06-03  2:24     ` Junio C Hamano
@ 2023-06-07 16:30       ` Jonathan Tan
  2023-06-07 21:37         ` Jonathan Tan
  0 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-06-07 16:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> > Jonathan Tan <jonathantanmy@google.com> writes:
> >
> >> Here's a new version. With this, Git can function with both version
> >> 1 (incorrect murmur3) and version 2 (correct murmur3) changed path
> >> filters, but not at the same time: the user can set a config variable to
> >> choose which one, and Git will ignore existing changed path filters of
> >> the wrong version (and always write the version that the config variable
> >> says).
> 
> Seems to break t4216 when merged to 'seen' to replace the previous
> round.  Could you take a look?

OK - I see that it fails CI when I upload the merge to GitHub (although
it passes locally). I'll take a look.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 0/3] Changed path filter hash fix and version bump
  2023-06-07 16:30       ` Jonathan Tan
@ 2023-06-07 21:37         ` Jonathan Tan
  0 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-06-07 21:37 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Junio C Hamano, git, Derrick Stolee

Jonathan Tan <jonathantanmy@google.com> writes:
> Junio C Hamano <gitster@pobox.com> writes:
> > Seems to break t4216 when merged to 'seen' to replace the previous
> > round.  Could you take a look?
> 
> OK - I see that it fails CI when I upload the merge to GitHub (although
> it passes locally). I'll take a look.

Hmm...so when the test writes a file with a high bit filename, the
filename written is different. Here's the output from my machine
(Linux):

        ++ git -C highbit1/ commit -m c1                                                                                                                          
        [main (root-commit) 808e481] c1                                                                                                                           
         Author: A U Thor <author@example.com>                                                                                                                    
         1 file changed, 1 insertion(+)                                                                                                                           
         create mode 100644 "\302\242"                                                                                                                            
        ++ case "$tag" in                                                                                                                                         
        ++ git -C highbit1/ tag c1                                                                                                                                
        ++ touch $'\302\242\302\242'                                                                                                                              
        ++ ls                                                                                                                                                     
         A        expect   file5_renamed   limits        log_wo_bloom   trace2-auto.txt  ''$'\302\242\302\242'                                                    
         actual   file4    highbit1        log_w_bloom   trace.perf     trace2.txt                                                                                
        ++ git -C highbit1 commit-graph write --reachable --changed-paths                                                                                         
        bloom calculating -62,-94 version 1                                                                                                                       
        ok 141 - set up repo with high bit path, version 1 changed-path       

and here's the output from the CI on GitHub on linux-gcc-default:

        [main (root-commit) 3f37a43] c1
         Author: A U Thor <author@example.com>
         1 file changed, 1 insertion(+)
         create mode 100644 "\\xc2\\xa2"
        + git -C highbit1/ tag c1
        + touch \xc2\xa2\xc2\xa2
        + ls
        A
        \xc2\xa2\xc2\xa2
        actual
        expect
        file4
        file5_renamed
        highbit1
        limits
        log_w_bloom
        log_wo_bloom
        trace.perf
        trace2-auto.txt
        trace2.txt
        + git -C highbit1 commit-graph write --reachable --changed-paths
        bloom calculating 92,120 version 1

        ok 141 - set up repo with high bit path, version 1 changed-path

These were run with some extra code, here for completeness:

        diff --git a/bloom.c b/bloom.c
        index dea883d8d6..ccc3e0ce80 100644
        --- a/bloom.c
        +++ b/bloom.c
        @@ -192,6 +192,8 @@ void fill_bloom_key(const char *data,
                        ? murmur3_seeded_v2 : murmur3_seeded_v1)(seed0, data, len);
                const uint32_t hash1 = (settings->hash_version == 2
                        ? murmur3_seeded_v2 : murmur3_seeded_v1)(seed1, data, len);
        +       if (len > 0)
        +               fprintf(stderr, "bloom calculating %d,%d version %d\n", data[0], data[1], settings->hash_version);
 
                key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
                for (i = 0; i < settings->num_hashes; i++)
        diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
        index 764c6dee0f..1121e7d4cd 100755
        --- a/t/t4216-log-bloom.sh
        +++ b/t/t4216-log-bloom.sh
        @@ -423,6 +423,8 @@ CENT=$(printf "\xc2\xa2")
         test_expect_success 'set up repo with high bit path, version 1 changed-path' '
                git init highbit1 &&
                test_commit -C highbit1 c1 "$CENT" &&
        +       touch $CENT$CENT &&
        +       ls &&
                git -C highbit1 commit-graph write --reachable --changed-paths
         '

Notice the different filename after "create mode", and the different
"ls" output. I'll investigate some more, but if anyone offhand knows
what's going on (and/or knows how to write a high-bit filename portably,
even across  Linux), please let me know.

An alternative is to drop the tests from these patches - so, leave them
in during the review period and reviewers would see that the tests pass
the CI jobs for Windows and Mac OS X pass, and then before we merge it,
delete the tests from the patches. This also solves needing to prevent
unsigned char platforms from running the version 1 tests - there's no
prereq for that yet and we would have to investigate making one.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v3 0/4] Changed path filter hash fix and version bump
  2023-05-22 21:48 [PATCH 0/2] Changed path filter hash fix and version bump Jonathan Tan
                   ` (3 preceding siblings ...)
  2023-05-31 23:12 ` [PATCH v2 0/3] " Jonathan Tan
@ 2023-06-08 19:21 ` Jonathan Tan
  2023-06-08 19:21   ` [PATCH v3 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
                     ` (4 more replies)
  2023-06-13 17:39 ` [PATCH v4 " Jonathan Tan
                   ` (3 subsequent siblings)
  8 siblings, 5 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-06-08 19:21 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Junio C Hamano

Here's an updated version with changes only in the tests:
 - resilience against unsigned-by-default (a "skip" will be shown)
 - for systems that have different quoting behavior, the tests
   will be skipped (see patch 2 for more information)

Some of the test changes may seem a bit of a hack, so if you have
a better way of doing things, please let me know.

I've also included a change to the file format specification. To
reviewers, if you generally agree that we need a version 2 but are still
unsure about how we should migrate, consider saying so and then perhaps
we can merge the first patch while the rest remain under review. This
will give other projects, like JGit, more certainty as to the direction
that the Git project wants to take.

Jonathan Tan (4):
  gitformat-commit-graph: describe version 2 of BDAT
  t4216: test changed path filters with high bit paths
  repo-settings: introduce commitgraph.changedPathsVersion
  commit-graph: new filter ver. that fixes murmur3

 Documentation/config/commitgraph.txt     | 16 ++++-
 Documentation/gitformat-commit-graph.txt |  9 ++-
 bloom.c                                  | 65 +++++++++++++++++-
 bloom.h                                  |  8 ++-
 commit-graph.c                           | 29 ++++++--
 oss-fuzz/fuzz-commit-graph.c             |  2 +-
 repo-settings.c                          |  6 +-
 repository.h                             |  2 +-
 t/helper/test-bloom.c                    |  9 ++-
 t/t0095-bloom.sh                         |  8 +++
 t/t4216-log-bloom.sh                     | 86 ++++++++++++++++++++++++
 11 files changed, 219 insertions(+), 21 deletions(-)

Range-diff against v2:
-:  ---------- > 1:  d4b63945f6 gitformat-commit-graph: describe version 2 of BDAT
1:  c587eb3470 ! 2:  aa4535776e t4216: test changed path filters with high bit paths
    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm
     +# chosen to be the same under all Unicode normalization forms
     +CENT=$(printf "\xc2\xa2")
     +
    -+test_expect_success 'set up repo with high bit path, version 1 changed-path' '
    ++# Some systems (in particular, Linux on the CI running on GitHub at the time of
    ++# writing) store into CENT a literal backslash, then "x", and so on (instead of
    ++# the high-bit characters needed). In these systems, do not run the following
    ++# tests.
    ++if test "$(printf $CENT | perl -0777 -ne 'no utf8; print ord($_)')" = "194"
    ++then
    ++	test_set_prereq HIGH_BIT
    ++fi
    ++
    ++test_expect_success HIGH_BIT 'set up repo with high bit path, version 1 changed-path' '
     +	git init highbit1 &&
     +	test_commit -C highbit1 c1 "$CENT" &&
     +	git -C highbit1 commit-graph write --reachable --changed-paths
     +'
     +
    -+test_expect_success 'check value of version 1 changed-path' '
    ++test_expect_success HIGH_BIT 'setup check value of version 1 changed-path' '
     +	(cd highbit1 &&
     +		printf "52a9" >expect &&
    -+		get_first_changed_path_filter >actual &&
    -+		test_cmp expect actual)
    ++		get_first_changed_path_filter >actual)
    ++'
    ++
    ++# expect will not match actual if int is unsigned by default. Write the test
    ++# in this way, so that a user running this test script can still see if the two
    ++# files match. (It will appear as an ordinary success if they match, and a skip
    ++# if not.)
    ++if test_cmp highbit1/expect highbit1/actual
    ++then
    ++	test_set_prereq SIGNED_INT_BY_DEFAULT
    ++fi
    ++test_expect_success SIGNED_INT_BY_DEFAULT 'check value of version 1 changed-path' '
    ++	# Only the prereq matters for this test.
    ++	true
     +'
     +
    -+test_expect_success 'version 1 changed-path used when version 1 requested' '
    ++test_expect_success HIGH_BIT 'version 1 changed-path used when version 1 requested' '
     +	(cd highbit1 &&
     +		test_bloom_filters_used "-- $CENT")
     +'
2:  d0e5dd20dc = 3:  d6982268a4 repo-settings: introduce commitgraph.changedPathsVersion
3:  eb19f8a35b ! 4:  e879483c42 commit-graph: new filter ver. that fixes murmur3
    @@ t/t0095-bloom.sh: test_expect_success 'compute unseeded murmur3 hash for test st
      	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
     
      ## t/t4216-log-bloom.sh ##
    -@@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when version 1 requested' '
    +@@ t/t4216-log-bloom.sh: test_expect_success HIGH_BIT 'version 1 changed-path used when version 1 request
      		test_bloom_filters_used "-- $CENT")
      '
      
    -+test_expect_success 'version 1 changed-path not used when version 2 requested' '
    ++test_expect_success HIGH_BIT 'version 1 changed-path not used when version 2 requested' '
     +	(cd highbit1 &&
     +		git config --add commitgraph.changedPathsVersion 2 &&
     +		test_bloom_filters_not_used "-- $CENT")
     +'
     +
    -+test_expect_success 'set up repo with high bit path, version 2 changed-path' '
    ++test_expect_success HIGH_BIT 'set up repo with high bit path, version 2 changed-path' '
     +	git init highbit2 &&
     +	git -C highbit2 config --add commitgraph.changedPathsVersion 2 &&
     +	test_commit -C highbit2 c2 "$CENT" &&
     +	git -C highbit2 commit-graph write --reachable --changed-paths
     +'
     +
    -+test_expect_success 'check value of version 2 changed-path' '
    ++test_expect_success HIGH_BIT 'check value of version 2 changed-path' '
     +	(cd highbit2 &&
     +		printf "c01f" >expect &&
     +		get_first_changed_path_filter >actual &&
     +		test_cmp expect actual)
     +'
     +
    -+test_expect_success 'version 2 changed-path used when version 2 requested' '
    ++test_expect_success HIGH_BIT 'version 2 changed-path used when version 2 requested' '
     +	(cd highbit2 &&
     +		test_bloom_filters_used "-- $CENT")
     +'
     +
    -+test_expect_success 'version 2 changed-path not used when version 1 requested' '
    ++test_expect_success HIGH_BIT 'version 2 changed-path not used when version 1 requested' '
     +	(cd highbit2 &&
     +		git config --add commitgraph.changedPathsVersion 1 &&
     +		test_bloom_filters_not_used "-- $CENT")
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v3 1/4] gitformat-commit-graph: describe version 2 of BDAT
  2023-06-08 19:21 ` [PATCH v3 0/4] " Jonathan Tan
@ 2023-06-08 19:21   ` Jonathan Tan
  2023-06-08 19:52     ` Ramsay Jones
  2023-06-08 19:21   ` [PATCH v3 2/4] t4216: test changed path filters with high bit paths Jonathan Tan
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-06-08 19:21 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Junio C Hamano

The code change to Git to support version 2 will be done in subsequent
commits.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/gitformat-commit-graph.txt | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/Documentation/gitformat-commit-graph.txt b/Documentation/gitformat-commit-graph.txt
index 31cad585e2..9dab222830 100644
--- a/Documentation/gitformat-commit-graph.txt
+++ b/Documentation/gitformat-commit-graph.txt
@@ -142,13 +142,16 @@ All multi-byte numbers are in network byte order.
 
 ==== Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
     * It starts with header consisting of three unsigned 32-bit integers:
-      - Version of the hash algorithm being used. We currently only support
-	value 1 which corresponds to the 32-bit version of the murmur3 hash
+      - Version of the hash algorithm being used. We currently support
+	value 2 which corresponds to the 32-bit version of the murmur3 hash
 	implemented exactly as described in
 	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
 	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
 	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
-	in Probabilistic Verification"
+	in Probabilistic Verification". Version 1 bloom filters have a bug that appears
+	when int is signed and the repository has path names that have characters >=
+	0x80; Git supports reading and writing them, but this ability will be removed
+	in a future version of Git.
       - The number of times a path is hashed and hence the number of bit positions
 	      that cumulatively determine whether a file is present in the commit.
       - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v3 2/4] t4216: test changed path filters with high bit paths
  2023-06-08 19:21 ` [PATCH v3 0/4] " Jonathan Tan
  2023-06-08 19:21   ` [PATCH v3 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
@ 2023-06-08 19:21   ` Jonathan Tan
  2023-06-08 19:21   ` [PATCH v3 3/4] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-06-08 19:21 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Junio C Hamano

Subsequent commits will teach Git another version of changed path
filter that has different behavior with paths that contain at least
one character with its high bit set, so test the existing behavior as
a baseline.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 t/t4216-log-bloom.sh | 55 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fa9d32facf..f68df24bd5 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -404,4 +404,59 @@ test_expect_success 'Bloom generation backfills empty commits' '
 	)
 '
 
+get_bdat_offset () {
+	perl -0777 -ne \
+		'print unpack("N", "$1") if /BDAT\0\0\0\0(....)/ or exit 1' \
+		.git/objects/info/commit-graph
+}
+
+get_first_changed_path_filter () {
+	BDAT_OFFSET=$(get_bdat_offset) &&
+	perl -0777 -ne \
+		'print unpack("H*", substr($_, '$BDAT_OFFSET' + 12, 2))' \
+		.git/objects/info/commit-graph
+}
+
+# chosen to be the same under all Unicode normalization forms
+CENT=$(printf "\xc2\xa2")
+
+# Some systems (in particular, Linux on the CI running on GitHub at the time of
+# writing) store into CENT a literal backslash, then "x", and so on (instead of
+# the high-bit characters needed). In these systems, do not run the following
+# tests.
+if test "$(printf $CENT | perl -0777 -ne 'no utf8; print ord($_)')" = "194"
+then
+	test_set_prereq HIGH_BIT
+fi
+
+test_expect_success HIGH_BIT 'set up repo with high bit path, version 1 changed-path' '
+	git init highbit1 &&
+	test_commit -C highbit1 c1 "$CENT" &&
+	git -C highbit1 commit-graph write --reachable --changed-paths
+'
+
+test_expect_success HIGH_BIT 'setup check value of version 1 changed-path' '
+	(cd highbit1 &&
+		printf "52a9" >expect &&
+		get_first_changed_path_filter >actual)
+'
+
+# expect will not match actual if int is unsigned by default. Write the test
+# in this way, so that a user running this test script can still see if the two
+# files match. (It will appear as an ordinary success if they match, and a skip
+# if not.)
+if test_cmp highbit1/expect highbit1/actual
+then
+	test_set_prereq SIGNED_INT_BY_DEFAULT
+fi
+test_expect_success SIGNED_INT_BY_DEFAULT 'check value of version 1 changed-path' '
+	# Only the prereq matters for this test.
+	true
+'
+
+test_expect_success HIGH_BIT 'version 1 changed-path used when version 1 requested' '
+	(cd highbit1 &&
+		test_bloom_filters_used "-- $CENT")
+'
+
 test_done
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v3 3/4] repo-settings: introduce commitgraph.changedPathsVersion
  2023-06-08 19:21 ` [PATCH v3 0/4] " Jonathan Tan
  2023-06-08 19:21   ` [PATCH v3 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
  2023-06-08 19:21   ` [PATCH v3 2/4] t4216: test changed path filters with high bit paths Jonathan Tan
@ 2023-06-08 19:21   ` Jonathan Tan
  2023-06-08 19:21   ` [PATCH v3 4/4] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
  2023-06-08 19:50   ` [PATCH v3 0/4] Changed path filter hash fix and version bump Ramsay Jones
  4 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-06-08 19:21 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Junio C Hamano

A subsequent commit will introduce another version of the changed-path
filter in the commit graph file. In order to control which version is
to be accepted when read (and which version to write), a config variable
is needed.

Therefore, introduce this config variable. For forwards compatibility,
teach Git to not read commit graphs when the config variable
is set to an unsupported version. Because we teach Git this,
commitgraph.readChangedPaths is now redundant, so deprecate it and
define its behavior in terms of the config variable we introduce.

This commit does not change the behavior of writing (Git writes changed
path filters when explicitly instructed regardless of any config
variable), but a subsequent commit will restrict Git such that it will
only write when commitgraph.changedPathsVersion is 0, 1, or 2.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config/commitgraph.txt | 16 +++++++++++++---
 commit-graph.c                       |  2 +-
 oss-fuzz/fuzz-commit-graph.c         |  2 +-
 repo-settings.c                      |  6 +++++-
 repository.h                         |  2 +-
 5 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index 30604e4a4c..eaa10bf232 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -9,6 +9,16 @@ commitGraph.maxNewFilters::
 	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
 
 commitGraph.readChangedPaths::
-	If true, then git will use the changed-path Bloom filters in the
-	commit-graph file (if it exists, and they are present). Defaults to
-	true. See linkgit:git-commit-graph[1] for more information.
+	Deprecated. Equivalent to changedPathsVersion=1 if true, and
+	changedPathsVersion=0 if false.
+
+commitGraph.changedPathsVersion::
+	Specifies the version of the changed-path Bloom filters that Git will read and
+	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
+	match the version set in this config variable will be ignored.
++
+Defaults to 1.
++
+If 0, git will write version 1 Bloom filters when instructed to write.
++
+See linkgit:git-commit-graph[1] for more information.
diff --git a/commit-graph.c b/commit-graph.c
index c11b59f28b..bd448047f1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -399,7 +399,7 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 			graph->read_generation_data = 1;
 	}
 
-	if (s->commit_graph_read_changed_paths) {
+	if (s->commit_graph_changed_paths_version == 1) {
 		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
 			   &graph->chunk_bloom_indexes);
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
diff --git a/oss-fuzz/fuzz-commit-graph.c b/oss-fuzz/fuzz-commit-graph.c
index 914026f5d8..b56731f51a 100644
--- a/oss-fuzz/fuzz-commit-graph.c
+++ b/oss-fuzz/fuzz-commit-graph.c
@@ -18,7 +18,7 @@ int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
 	 * possible.
 	 */
 	the_repository->settings.commit_graph_generation_version = 2;
-	the_repository->settings.commit_graph_read_changed_paths = 1;
+	the_repository->settings.commit_graph_changed_paths_version = 1;
 	g = parse_commit_graph(&the_repository->settings, (void *)data, size);
 	repo_clear(the_repository);
 	free_commit_graph(g);
diff --git a/repo-settings.c b/repo-settings.c
index 3dbd3f0e2e..6cbe02681b 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -24,6 +24,7 @@ void prepare_repo_settings(struct repository *r)
 	int value;
 	const char *strval;
 	int manyfiles;
+	int readChangedPaths;
 
 	if (!r->gitdir)
 		BUG("Cannot add settings for uninitialized repository");
@@ -54,7 +55,10 @@ void prepare_repo_settings(struct repository *r)
 	/* Commit graph config or default, does not cascade (simple) */
 	repo_cfg_bool(r, "core.commitgraph", &r->settings.core_commit_graph, 1);
 	repo_cfg_int(r, "commitgraph.generationversion", &r->settings.commit_graph_generation_version, 2);
-	repo_cfg_bool(r, "commitgraph.readchangedpaths", &r->settings.commit_graph_read_changed_paths, 1);
+	repo_cfg_bool(r, "commitgraph.readchangedpaths", &readChangedPaths, 1);
+	repo_cfg_int(r, "commitgraph.changedpathsversion",
+		     &r->settings.commit_graph_changed_paths_version,
+		     readChangedPaths ? 1 : 0);
 	repo_cfg_bool(r, "gc.writecommitgraph", &r->settings.gc_write_commit_graph, 1);
 	repo_cfg_bool(r, "fetch.writecommitgraph", &r->settings.fetch_write_commit_graph, 0);
 
diff --git a/repository.h b/repository.h
index e8c67ffe16..1f1c32a6dd 100644
--- a/repository.h
+++ b/repository.h
@@ -32,7 +32,7 @@ struct repo_settings {
 
 	int core_commit_graph;
 	int commit_graph_generation_version;
-	int commit_graph_read_changed_paths;
+	int commit_graph_changed_paths_version;
 	int gc_write_commit_graph;
 	int gc_cruft_packs;
 	int fetch_write_commit_graph;
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v3 4/4] commit-graph: new filter ver. that fixes murmur3
  2023-06-08 19:21 ` [PATCH v3 0/4] " Jonathan Tan
                     ` (2 preceding siblings ...)
  2023-06-08 19:21   ` [PATCH v3 3/4] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
@ 2023-06-08 19:21   ` Jonathan Tan
  2023-06-08 19:50   ` [PATCH v3 0/4] Changed path filter hash fix and version bump Ramsay Jones
  4 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-06-08 19:21 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Junio C Hamano

The murmur3 implementation in bloom.c has a bug when converting series
of 4 bytes into network-order integers when char is signed (which is
controllable by a compiler option, and the default signedness of char is
platform-specific). When a string contains characters with the high bit
set, this bug causes results that, although internally consistent within
Git, does not accord with other implementations of murmur3 and even with
Git binaries that were compiled with different signedness of char. This
bug affects both how Git writes changed path filters to disk and how Git
interprets changed path filters on disk.

Therefore, introduce a new version (2) of changed path filters that
corrects this problem. The existing version (1) is still supported and
is still the default, but users should migrate away from it as soon
as possible.

Because this bug only manifests with characters that have the high bit
set, it may be possible that some (or all) commits in a given repo would
have the same changed path filter both before and after this fix is
applied. However, in order to determine whether this is the case, the
changed paths would first have to be computed, at which point it is not
much more expensive to just compute a new changed path filter.

So this patch does not include any mechanism to "salvage" changed path
filters from repositories. There is also no "mixed" mode - for each
invocation of Git, reading and writing changed path filters are done
with the same version number.

There is a change in write_commit_graph(). graph_read_bloom_data()
makes it possible for chunk_bloom_data to be non-NULL but
bloom_filter_settings to be NULL, which causes a segfault later on. I
produced such a segfault while developing this patch, but couldn't find
a way to reproduce it neither after this complete patch (or before),
but in any case it seemed like a good thing to include that might help
future patch authors.

The value in t0095 was obtained from another murmur3 implementation
using the following Go source code:

  package main

  import "fmt"
  import "github.com/spaolacci/murmur3"

  func main() {
          fmt.Printf("%x\n", murmur3.Sum32([]byte("Hello world!")))
          fmt.Printf("%x\n", murmur3.Sum32([]byte{0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff}))
  }

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config/commitgraph.txt |  2 +-
 bloom.c                              | 65 ++++++++++++++++++++++++++--
 bloom.h                              |  8 ++--
 commit-graph.c                       | 29 ++++++++++---
 t/helper/test-bloom.c                |  9 +++-
 t/t0095-bloom.sh                     |  8 ++++
 t/t4216-log-bloom.sh                 | 31 +++++++++++++
 7 files changed, 139 insertions(+), 13 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index eaa10bf232..c64ee4f459 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -14,7 +14,7 @@ commitGraph.readChangedPaths::
 
 commitGraph.changedPathsVersion::
 	Specifies the version of the changed-path Bloom filters that Git will read and
-	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
+	write. May be 0, 1, or 2. Any changed-path Bloom filters on disk that do not
 	match the version set in this config variable will be ignored.
 +
 Defaults to 1.
diff --git a/bloom.c b/bloom.c
index d0730525da..915d8e5a31 100644
--- a/bloom.c
+++ b/bloom.c
@@ -65,7 +65,64 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
  * Not considered to be cryptographically secure.
  * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  */
-uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
+uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
+{
+	const uint32_t c1 = 0xcc9e2d51;
+	const uint32_t c2 = 0x1b873593;
+	const uint32_t r1 = 15;
+	const uint32_t r2 = 13;
+	const uint32_t m = 5;
+	const uint32_t n = 0xe6546b64;
+	int i;
+	uint32_t k1 = 0;
+	const char *tail;
+
+	int len4 = len / sizeof(uint32_t);
+
+	uint32_t k;
+	for (i = 0; i < len4; i++) {
+		uint32_t byte1 = (uint32_t)(unsigned char)data[4*i];
+		uint32_t byte2 = ((uint32_t)(unsigned char)data[4*i + 1]) << 8;
+		uint32_t byte3 = ((uint32_t)(unsigned char)data[4*i + 2]) << 16;
+		uint32_t byte4 = ((uint32_t)(unsigned char)data[4*i + 3]) << 24;
+		k = byte1 | byte2 | byte3 | byte4;
+		k *= c1;
+		k = rotate_left(k, r1);
+		k *= c2;
+
+		seed ^= k;
+		seed = rotate_left(seed, r2) * m + n;
+	}
+
+	tail = (data + len4 * sizeof(uint32_t));
+
+	switch (len & (sizeof(uint32_t) - 1)) {
+	case 3:
+		k1 ^= ((uint32_t)(unsigned char)tail[2]) << 16;
+		/*-fallthrough*/
+	case 2:
+		k1 ^= ((uint32_t)(unsigned char)tail[1]) << 8;
+		/*-fallthrough*/
+	case 1:
+		k1 ^= ((uint32_t)(unsigned char)tail[0]) << 0;
+		k1 *= c1;
+		k1 = rotate_left(k1, r1);
+		k1 *= c2;
+		seed ^= k1;
+		break;
+	}
+
+	seed ^= (uint32_t)len;
+	seed ^= (seed >> 16);
+	seed *= 0x85ebca6b;
+	seed ^= (seed >> 13);
+	seed *= 0xc2b2ae35;
+	seed ^= (seed >> 16);
+
+	return seed;
+}
+
+static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
 {
 	const uint32_t c1 = 0xcc9e2d51;
 	const uint32_t c2 = 0x1b873593;
@@ -130,8 +187,10 @@ void fill_bloom_key(const char *data,
 	int i;
 	const uint32_t seed0 = 0x293ae76f;
 	const uint32_t seed1 = 0x7e646e2c;
-	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
-	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
+	const uint32_t hash0 = (settings->hash_version == 2
+		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed0, data, len);
+	const uint32_t hash1 = (settings->hash_version == 2
+		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed1, data, len);
 
 	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
 	for (i = 0; i < settings->num_hashes; i++)
diff --git a/bloom.h b/bloom.h
index adde6dfe21..0c33ae282c 100644
--- a/bloom.h
+++ b/bloom.h
@@ -7,9 +7,11 @@ struct repository;
 struct bloom_filter_settings {
 	/*
 	 * The version of the hashing technique being used.
-	 * We currently only support version = 1 which is
+	 * The newest version is 2, which is
 	 * the seeded murmur3 hashing technique implemented
-	 * in bloom.c.
+	 * in bloom.c. Bloom filters of version 1 were created
+	 * with prior versions of Git, which had a bug in the
+	 * implementation of the hash function.
 	 */
 	uint32_t hash_version;
 
@@ -75,7 +77,7 @@ struct bloom_key {
  * Not considered to be cryptographically secure.
  * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  */
-uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len);
+uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len);
 
 void fill_bloom_key(const char *data,
 		    size_t len,
diff --git a/commit-graph.c b/commit-graph.c
index bd448047f1..36e6d09e74 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -302,15 +302,21 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
 	return 0;
 }
 
+struct graph_read_bloom_data_data {
+	struct commit_graph *g;
+	int commit_graph_changed_paths_version;
+};
+
 static int graph_read_bloom_data(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
-	struct commit_graph *g = data;
+	struct graph_read_bloom_data_data *d = data;
+	struct commit_graph *g = d->g;
 	uint32_t hash_version;
 	g->chunk_bloom_data = chunk_start;
 	hash_version = get_be32(chunk_start);
 
-	if (hash_version != 1)
+	if (hash_version != d->commit_graph_changed_paths_version)
 		return 0;
 
 	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
@@ -399,11 +405,16 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 			graph->read_generation_data = 1;
 	}
 
-	if (s->commit_graph_changed_paths_version == 1) {
+	if (s->commit_graph_changed_paths_version == 1
+	    || s->commit_graph_changed_paths_version == 2) {
+		struct graph_read_bloom_data_data data = {
+			.g = graph,
+			.commit_graph_changed_paths_version = s->commit_graph_changed_paths_version
+		};
 		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
 			   &graph->chunk_bloom_indexes);
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
-			   graph_read_bloom_data, graph);
+			   graph_read_bloom_data, &data);
 	}
 
 	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
@@ -2302,6 +2313,14 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->write_generation_data = (get_configured_generation_version(r) == 2);
 	ctx->num_generation_data_overflows = 0;
 
+	if (r->settings.commit_graph_changed_paths_version < 0
+	    || r->settings.commit_graph_changed_paths_version > 2) {
+		warning(_("attempting to write a commit-graph, but 'commitgraph.changedPathsVersion' (%d) is not supported"),
+			r->settings.commit_graph_changed_paths_version);
+		return 0;
+	}
+	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2
+		? 2 : 1;
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
 						      bloom_settings.bits_per_entry);
 	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
@@ -2331,7 +2350,7 @@ int write_commit_graph(struct object_directory *odb,
 		g = ctx->r->objects->commit_graph;
 
 		/* We have changed-paths already. Keep them in the next graph */
-		if (g && g->chunk_bloom_data) {
+		if (g && g->bloom_filter_settings) {
 			ctx->changed_paths = 1;
 			ctx->bloom_settings = g->bloom_filter_settings;
 		}
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 6c900ca668..34b8dd9164 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -48,6 +48,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 
 static const char *bloom_usage = "\n"
 "  test-tool bloom get_murmur3 <string>\n"
+"  test-tool bloom get_murmur3_seven_highbit\n"
 "  test-tool bloom generate_filter <string> [<string>...]\n"
 "  test-tool bloom get_filter_for_commit <commit-hex>\n";
 
@@ -62,7 +63,13 @@ int cmd__bloom(int argc, const char **argv)
 		uint32_t hashed;
 		if (argc < 3)
 			usage(bloom_usage);
-		hashed = murmur3_seeded(0, argv[2], strlen(argv[2]));
+		hashed = murmur3_seeded_v2(0, argv[2], strlen(argv[2]));
+		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
+	}
+
+	if (!strcmp(argv[1], "get_murmur3_seven_highbit")) {
+		uint32_t hashed;
+		hashed = murmur3_seeded_v2(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7);
 		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
 	}
 
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index b567383eb8..c8d84ab606 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -29,6 +29,14 @@ test_expect_success 'compute unseeded murmur3 hash for test string 2' '
 	test_cmp expect actual
 '
 
+test_expect_success 'compute unseeded murmur3 hash for test string 3' '
+	cat >expect <<-\EOF &&
+	Murmur3 Hash with seed=0:0xa183ccfd
+	EOF
+	test-tool bloom get_murmur3_seven_highbit >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'compute bloom key for empty string' '
 	cat >expect <<-\EOF &&
 	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index f68df24bd5..4c8f039eb4 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -459,4 +459,35 @@ test_expect_success HIGH_BIT 'version 1 changed-path used when version 1 request
 		test_bloom_filters_used "-- $CENT")
 '
 
+test_expect_success HIGH_BIT 'version 1 changed-path not used when version 2 requested' '
+	(cd highbit1 &&
+		git config --add commitgraph.changedPathsVersion 2 &&
+		test_bloom_filters_not_used "-- $CENT")
+'
+
+test_expect_success HIGH_BIT 'set up repo with high bit path, version 2 changed-path' '
+	git init highbit2 &&
+	git -C highbit2 config --add commitgraph.changedPathsVersion 2 &&
+	test_commit -C highbit2 c2 "$CENT" &&
+	git -C highbit2 commit-graph write --reachable --changed-paths
+'
+
+test_expect_success HIGH_BIT 'check value of version 2 changed-path' '
+	(cd highbit2 &&
+		printf "c01f" >expect &&
+		get_first_changed_path_filter >actual &&
+		test_cmp expect actual)
+'
+
+test_expect_success HIGH_BIT 'version 2 changed-path used when version 2 requested' '
+	(cd highbit2 &&
+		test_bloom_filters_used "-- $CENT")
+'
+
+test_expect_success HIGH_BIT 'version 2 changed-path not used when version 1 requested' '
+	(cd highbit2 &&
+		git config --add commitgraph.changedPathsVersion 1 &&
+		test_bloom_filters_not_used "-- $CENT")
+'
+
 test_done
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 0/4] Changed path filter hash fix and version bump
  2023-06-08 19:21 ` [PATCH v3 0/4] " Jonathan Tan
                     ` (3 preceding siblings ...)
  2023-06-08 19:21   ` [PATCH v3 4/4] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
@ 2023-06-08 19:50   ` Ramsay Jones
  2023-06-09  0:08     ` Jonathan Tan
  2023-06-12 21:31     ` Junio C Hamano
  4 siblings, 2 replies; 116+ messages in thread
From: Ramsay Jones @ 2023-06-08 19:50 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: Junio C Hamano



On 08/06/2023 20:21, Jonathan Tan wrote:
> Here's an updated version with changes only in the tests:
>  - resilience against unsigned-by-default (a "skip" will be shown)
>  - for systems that have different quoting behavior, the tests
>    will be skipped (see patch 2 for more information)
> 
> Some of the test changes may seem a bit of a hack, so if you have
> a better way of doing things, please let me know.
> 
> I've also included a change to the file format specification. To
> reviewers, if you generally agree that we need a version 2 but are still
> unsure about how we should migrate, consider saying so and then perhaps
> we can merge the first patch while the rest remain under review. This
> will give other projects, like JGit, more certainty as to the direction
> that the Git project wants to take.
> 
> Jonathan Tan (4):
>   gitformat-commit-graph: describe version 2 of BDAT
>   t4216: test changed path filters with high bit paths
>   repo-settings: introduce commitgraph.changedPathsVersion
>   commit-graph: new filter ver. that fixes murmur3
> 
>  Documentation/config/commitgraph.txt     | 16 ++++-
>  Documentation/gitformat-commit-graph.txt |  9 ++-
>  bloom.c                                  | 65 +++++++++++++++++-
>  bloom.h                                  |  8 ++-
>  commit-graph.c                           | 29 ++++++--
>  oss-fuzz/fuzz-commit-graph.c             |  2 +-
>  repo-settings.c                          |  6 +-
>  repository.h                             |  2 +-
>  t/helper/test-bloom.c                    |  9 ++-
>  t/t0095-bloom.sh                         |  8 +++
>  t/t4216-log-bloom.sh                     | 86 ++++++++++++++++++++++++
>  11 files changed, 219 insertions(+), 21 deletions(-)
> 
> Range-diff against v2:
> -:  ---------- > 1:  d4b63945f6 gitformat-commit-graph: describe version 2 of BDAT
> 1:  c587eb3470 ! 2:  aa4535776e t4216: test changed path filters with high bit paths
>     @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm
>      +# chosen to be the same under all Unicode normalization forms
>      +CENT=$(printf "\xc2\xa2")

I think the only change you need to make here (because /usr/bin/sh
on Ubuntu is usually 'dash' not 'bash') is to use octal rather than
hexadecimal. ie: CENT=$(printf "\302\242")

HTH

ATB,
Ramsay Jones


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] gitformat-commit-graph: describe version 2 of BDAT
  2023-06-08 19:21   ` [PATCH v3 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
@ 2023-06-08 19:52     ` Ramsay Jones
  2023-06-12 21:26       ` Junio C Hamano
  0 siblings, 1 reply; 116+ messages in thread
From: Ramsay Jones @ 2023-06-08 19:52 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: Junio C Hamano



On 08/06/2023 20:21, Jonathan Tan wrote:
> The code change to Git to support version 2 will be done in subsequent
> commits.
> 
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  Documentation/gitformat-commit-graph.txt | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/gitformat-commit-graph.txt b/Documentation/gitformat-commit-graph.txt
> index 31cad585e2..9dab222830 100644
> --- a/Documentation/gitformat-commit-graph.txt
> +++ b/Documentation/gitformat-commit-graph.txt
> @@ -142,13 +142,16 @@ All multi-byte numbers are in network byte order.
>  
>  ==== Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
>      * It starts with header consisting of three unsigned 32-bit integers:
> -      - Version of the hash algorithm being used. We currently only support
> -	value 1 which corresponds to the 32-bit version of the murmur3 hash
> +      - Version of the hash algorithm being used. We currently support
> +	value 2 which corresponds to the 32-bit version of the murmur3 hash
>  	implemented exactly as described in
>  	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
>  	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
>  	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
> -	in Probabilistic Verification"
> +	in Probabilistic Verification". Version 1 bloom filters have a bug that appears
> +	when int is signed and the repository has path names that have characters >=

s/int is signed/char is signed/ ?

ATB,
Ramsay Jones

> +	0x80; Git supports reading and writing them, but this ability will be removed
> +	in a future version of Git.
>        - The number of times a path is hashed and hence the number of bit positions
>  	      that cumulatively determine whether a file is present in the commit.
>        - The minimum number of bits 'b' per entry in the Bloom filter. If the filter

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 0/4] Changed path filter hash fix and version bump
  2023-06-08 19:50   ` [PATCH v3 0/4] Changed path filter hash fix and version bump Ramsay Jones
@ 2023-06-09  0:08     ` Jonathan Tan
  2023-06-12 21:31     ` Junio C Hamano
  1 sibling, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-06-09  0:08 UTC (permalink / raw)
  To: Ramsay Jones; +Cc: Jonathan Tan, git, Junio C Hamano

Ramsay Jones <ramsay@ramsayjones.plus.com> writes:
> >      +CENT=$(printf "\xc2\xa2")
> 
> I think the only change you need to make here (because /usr/bin/sh
> on Ubuntu is usually 'dash' not 'bash') is to use octal rather than
> hexadecimal. ie: CENT=$(printf "\302\242")
> 
> HTH
> 
> ATB,
> Ramsay Jones

Thanks! Yes, it works. I've checked that it passes CI [1] and will wait
for other reviews before sending out a new version to the list.

[1] https://github.com/jonathantanmy/git/actions/runs/5216513514

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] gitformat-commit-graph: describe version 2 of BDAT
  2023-06-08 19:52     ` Ramsay Jones
@ 2023-06-12 21:26       ` Junio C Hamano
  0 siblings, 0 replies; 116+ messages in thread
From: Junio C Hamano @ 2023-06-12 21:26 UTC (permalink / raw)
  To: Ramsay Jones; +Cc: Jonathan Tan, git

Ramsay Jones <ramsay@ramsayjones.plus.com> writes:

>>  	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
>> -	in Probabilistic Verification"
>> +	in Probabilistic Verification". Version 1 bloom filters have a bug that appears
>> +	when int is signed and the repository has path names that have characters >=
>
> s/int is signed/char is signed/ ?

Indeed.  Thanks for sharp eyes.  Will tweak while queuing.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 0/4] Changed path filter hash fix and version bump
  2023-06-08 19:50   ` [PATCH v3 0/4] Changed path filter hash fix and version bump Ramsay Jones
  2023-06-09  0:08     ` Jonathan Tan
@ 2023-06-12 21:31     ` Junio C Hamano
  2023-06-13 17:16       ` Jonathan Tan
  1 sibling, 1 reply; 116+ messages in thread
From: Junio C Hamano @ 2023-06-12 21:31 UTC (permalink / raw)
  To: Ramsay Jones; +Cc: Jonathan Tan, git

Ramsay Jones <ramsay@ramsayjones.plus.com> writes:

> I think the only change you need to make here (because /usr/bin/sh
> on Ubuntu is usually 'dash' not 'bash') is to use octal rather than
> hexadecimal. ie: CENT=$(printf "\302\242")

Perhaps an addition to Documentation/CodingGuidelines is in order?

 Documentation/CodingGuidelines | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git c/Documentation/CodingGuidelines w/Documentation/CodingGuidelines
index 9d5c27807a..78bc60665d 100644
--- c/Documentation/CodingGuidelines
+++ w/Documentation/CodingGuidelines
@@ -188,6 +188,15 @@ For shell scripts specifically (not exhaustive):
    hopefully nobody starts using "local" before they are reimplemented
    in C ;-)
 
+ - In 'printf' format string, do not use hexadecimals, as they are not
+   portable.  Write 
+
+     CENT=$(printf "\302\242")
+
+   not
+
+     CENT=$(printf "\xc2\xa2")
+
 
 For C programs:
 



^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 0/4] Changed path filter hash fix and version bump
  2023-06-12 21:31     ` Junio C Hamano
@ 2023-06-13 17:16       ` Jonathan Tan
  2023-06-13 17:29         ` [PATCH] CodingGuidelines: use octal escapes, not hex Jonathan Tan
  2023-06-13 19:16         ` [PATCH v3 0/4] Changed path filter hash fix and version bump Junio C Hamano
  0 siblings, 2 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-06-13 17:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, Ramsay Jones, git

Junio C Hamano <gitster@pobox.com> writes:
> Perhaps an addition to Documentation/CodingGuidelines is in order?
> 
>  Documentation/CodingGuidelines | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git c/Documentation/CodingGuidelines w/Documentation/CodingGuidelines
> index 9d5c27807a..78bc60665d 100644
> --- c/Documentation/CodingGuidelines
> +++ w/Documentation/CodingGuidelines
> @@ -188,6 +188,15 @@ For shell scripts specifically (not exhaustive):
>     hopefully nobody starts using "local" before they are reimplemented
>     in C ;-)
>  
> + - In 'printf' format string, do not use hexadecimals, as they are not
> +   portable.  Write 
> +
> +     CENT=$(printf "\302\242")
> +
> +   not
> +
> +     CENT=$(printf "\xc2\xa2")

I've checked with "dash" and this applies to any quoted string, not just
when passed to printf. I'll prepare a patch describing this.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH] CodingGuidelines: use octal escapes, not hex
  2023-06-13 17:16       ` Jonathan Tan
@ 2023-06-13 17:29         ` Jonathan Tan
  2023-06-13 18:16           ` Eric Sunshine
  2023-06-13 19:16         ` [PATCH v3 0/4] Changed path filter hash fix and version bump Junio C Hamano
  1 sibling, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-06-13 17:29 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

Hexadecimal escapes in shell scripts are not portable across shells (in
particular, "dash" does not support them). Write in the CodingGuidelines
document that we should be using octal escapes instead.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/CodingGuidelines | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/Documentation/CodingGuidelines b/Documentation/CodingGuidelines
index 003393ed16..1c54abd7c5 100644
--- a/Documentation/CodingGuidelines
+++ b/Documentation/CodingGuidelines
@@ -188,6 +188,9 @@ For shell scripts specifically (not exhaustive):
    hopefully nobody starts using "local" before they are reimplemented
    in C ;-)
 
+ - Use octal escape sequences (e.g. "\302\242"), not hexadecimal (e.g.
+   "\xc2\xa2"), as the latter is not portable.
+
 
 For C programs:
 
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v4 0/4] Changed path filter hash fix and version bump
  2023-05-22 21:48 [PATCH 0/2] Changed path filter hash fix and version bump Jonathan Tan
                   ` (4 preceding siblings ...)
  2023-06-08 19:21 ` [PATCH v3 0/4] " Jonathan Tan
@ 2023-06-13 17:39 ` Jonathan Tan
  2023-06-13 17:39   ` [PATCH v4 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
                     ` (4 more replies)
  2023-07-13 21:42 ` [PATCH v5 " Jonathan Tan
                   ` (2 subsequent siblings)
  8 siblings, 5 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-06-13 17:39 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Ramsay Jones, Junio C Hamano

Thanks Ramsay for spotting the errors and mentioning that I can use
octal escapes. Here's an update taking into account their comments.

Jonathan Tan (4):
  gitformat-commit-graph: describe version 2 of BDAT
  t4216: test changed path filters with high bit paths
  repo-settings: introduce commitgraph.changedPathsVersion
  commit-graph: new filter ver. that fixes murmur3

 Documentation/config/commitgraph.txt     | 16 ++++-
 Documentation/gitformat-commit-graph.txt |  9 ++-
 bloom.c                                  | 65 +++++++++++++++++++-
 bloom.h                                  |  8 ++-
 commit-graph.c                           | 29 +++++++--
 oss-fuzz/fuzz-commit-graph.c             |  2 +-
 repo-settings.c                          |  6 +-
 repository.h                             |  2 +-
 t/helper/test-bloom.c                    |  9 ++-
 t/t0095-bloom.sh                         |  8 +++
 t/t4216-log-bloom.sh                     | 77 ++++++++++++++++++++++++
 11 files changed, 210 insertions(+), 21 deletions(-)

Range-diff against v3:
1:  d4b63945f6 ! 1:  a3b52af4c9 gitformat-commit-graph: describe version 2 of BDAT
    @@ Documentation/gitformat-commit-graph.txt: All multi-byte numbers are in network
      	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
     -	in Probabilistic Verification"
     +	in Probabilistic Verification". Version 1 bloom filters have a bug that appears
    -+	when int is signed and the repository has path names that have characters >=
    ++	when char is signed and the repository has path names that have characters >=
     +	0x80; Git supports reading and writing them, but this ability will be removed
     +	in a future version of Git.
            - The number of times a path is hashed and hence the number of bit positions
2:  aa4535776e ! 2:  f095e2b486 t4216: test changed path filters with high bit paths
    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm
     +}
     +
     +# chosen to be the same under all Unicode normalization forms
    -+CENT=$(printf "\xc2\xa2")
    ++CENT=$(printf "\302\242")
     +
    -+# Some systems (in particular, Linux on the CI running on GitHub at the time of
    -+# writing) store into CENT a literal backslash, then "x", and so on (instead of
    -+# the high-bit characters needed). In these systems, do not run the following
    -+# tests.
    -+if test "$(printf $CENT | perl -0777 -ne 'no utf8; print ord($_)')" = "194"
    -+then
    -+	test_set_prereq HIGH_BIT
    -+fi
    -+
    -+test_expect_success HIGH_BIT 'set up repo with high bit path, version 1 changed-path' '
    ++test_expect_success 'set up repo with high bit path, version 1 changed-path' '
     +	git init highbit1 &&
     +	test_commit -C highbit1 c1 "$CENT" &&
     +	git -C highbit1 commit-graph write --reachable --changed-paths
     +'
     +
    -+test_expect_success HIGH_BIT 'setup check value of version 1 changed-path' '
    ++test_expect_success 'setup check value of version 1 changed-path' '
     +	(cd highbit1 &&
     +		printf "52a9" >expect &&
     +		get_first_changed_path_filter >actual)
     +'
     +
    -+# expect will not match actual if int is unsigned by default. Write the test
    ++# expect will not match actual if char is unsigned by default. Write the test
     +# in this way, so that a user running this test script can still see if the two
     +# files match. (It will appear as an ordinary success if they match, and a skip
     +# if not.)
     +if test_cmp highbit1/expect highbit1/actual
     +then
    -+	test_set_prereq SIGNED_INT_BY_DEFAULT
    ++	test_set_prereq SIGNED_CHAR_BY_DEFAULT
     +fi
    -+test_expect_success SIGNED_INT_BY_DEFAULT 'check value of version 1 changed-path' '
    ++test_expect_success SIGNED_CHAR_BY_DEFAULT 'check value of version 1 changed-path' '
     +	# Only the prereq matters for this test.
     +	true
     +'
     +
    -+test_expect_success HIGH_BIT 'version 1 changed-path used when version 1 requested' '
    ++test_expect_success 'version 1 changed-path used when version 1 requested' '
     +	(cd highbit1 &&
     +		test_bloom_filters_used "-- $CENT")
     +'
3:  d6982268a4 = 3:  6adfa53daf repo-settings: introduce commitgraph.changedPathsVersion
4:  e879483c42 ! 4:  5c65bf8a22 commit-graph: new filter ver. that fixes murmur3
    @@ t/t0095-bloom.sh: test_expect_success 'compute unseeded murmur3 hash for test st
      	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
     
      ## t/t4216-log-bloom.sh ##
    -@@ t/t4216-log-bloom.sh: test_expect_success HIGH_BIT 'version 1 changed-path used when version 1 request
    +@@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when version 1 requested' '
      		test_bloom_filters_used "-- $CENT")
      '
      
    -+test_expect_success HIGH_BIT 'version 1 changed-path not used when version 2 requested' '
    ++test_expect_success 'version 1 changed-path not used when version 2 requested' '
     +	(cd highbit1 &&
     +		git config --add commitgraph.changedPathsVersion 2 &&
     +		test_bloom_filters_not_used "-- $CENT")
     +'
     +
    -+test_expect_success HIGH_BIT 'set up repo with high bit path, version 2 changed-path' '
    ++test_expect_success 'set up repo with high bit path, version 2 changed-path' '
     +	git init highbit2 &&
     +	git -C highbit2 config --add commitgraph.changedPathsVersion 2 &&
     +	test_commit -C highbit2 c2 "$CENT" &&
     +	git -C highbit2 commit-graph write --reachable --changed-paths
     +'
     +
    -+test_expect_success HIGH_BIT 'check value of version 2 changed-path' '
    ++test_expect_success 'check value of version 2 changed-path' '
     +	(cd highbit2 &&
     +		printf "c01f" >expect &&
     +		get_first_changed_path_filter >actual &&
     +		test_cmp expect actual)
     +'
     +
    -+test_expect_success HIGH_BIT 'version 2 changed-path used when version 2 requested' '
    ++test_expect_success 'version 2 changed-path used when version 2 requested' '
     +	(cd highbit2 &&
     +		test_bloom_filters_used "-- $CENT")
     +'
     +
    -+test_expect_success HIGH_BIT 'version 2 changed-path not used when version 1 requested' '
    ++test_expect_success 'version 2 changed-path not used when version 1 requested' '
     +	(cd highbit2 &&
     +		git config --add commitgraph.changedPathsVersion 1 &&
     +		test_bloom_filters_not_used "-- $CENT")
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v4 1/4] gitformat-commit-graph: describe version 2 of BDAT
  2023-06-13 17:39 ` [PATCH v4 " Jonathan Tan
@ 2023-06-13 17:39   ` Jonathan Tan
  2023-06-13 21:58     ` Junio C Hamano
  2023-06-13 17:39   ` [PATCH v4 2/4] t4216: test changed path filters with high bit paths Jonathan Tan
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-06-13 17:39 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Ramsay Jones, Junio C Hamano

The code change to Git to support version 2 will be done in subsequent
commits.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/gitformat-commit-graph.txt | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/Documentation/gitformat-commit-graph.txt b/Documentation/gitformat-commit-graph.txt
index 31cad585e2..112e6d36a6 100644
--- a/Documentation/gitformat-commit-graph.txt
+++ b/Documentation/gitformat-commit-graph.txt
@@ -142,13 +142,16 @@ All multi-byte numbers are in network byte order.
 
 ==== Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
     * It starts with header consisting of three unsigned 32-bit integers:
-      - Version of the hash algorithm being used. We currently only support
-	value 1 which corresponds to the 32-bit version of the murmur3 hash
+      - Version of the hash algorithm being used. We currently support
+	value 2 which corresponds to the 32-bit version of the murmur3 hash
 	implemented exactly as described in
 	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
 	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
 	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
-	in Probabilistic Verification"
+	in Probabilistic Verification". Version 1 bloom filters have a bug that appears
+	when char is signed and the repository has path names that have characters >=
+	0x80; Git supports reading and writing them, but this ability will be removed
+	in a future version of Git.
       - The number of times a path is hashed and hence the number of bit positions
 	      that cumulatively determine whether a file is present in the commit.
       - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v4 2/4] t4216: test changed path filters with high bit paths
  2023-06-13 17:39 ` [PATCH v4 " Jonathan Tan
  2023-06-13 17:39   ` [PATCH v4 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
@ 2023-06-13 17:39   ` Jonathan Tan
  2023-06-13 17:39   ` [PATCH v4 3/4] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-06-13 17:39 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Ramsay Jones, Junio C Hamano

Subsequent commits will teach Git another version of changed path
filter that has different behavior with paths that contain at least
one character with its high bit set, so test the existing behavior as
a baseline.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 t/t4216-log-bloom.sh | 46 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fa9d32facf..baa0c48897 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -404,4 +404,50 @@ test_expect_success 'Bloom generation backfills empty commits' '
 	)
 '
 
+get_bdat_offset () {
+	perl -0777 -ne \
+		'print unpack("N", "$1") if /BDAT\0\0\0\0(....)/ or exit 1' \
+		.git/objects/info/commit-graph
+}
+
+get_first_changed_path_filter () {
+	BDAT_OFFSET=$(get_bdat_offset) &&
+	perl -0777 -ne \
+		'print unpack("H*", substr($_, '$BDAT_OFFSET' + 12, 2))' \
+		.git/objects/info/commit-graph
+}
+
+# chosen to be the same under all Unicode normalization forms
+CENT=$(printf "\302\242")
+
+test_expect_success 'set up repo with high bit path, version 1 changed-path' '
+	git init highbit1 &&
+	test_commit -C highbit1 c1 "$CENT" &&
+	git -C highbit1 commit-graph write --reachable --changed-paths
+'
+
+test_expect_success 'setup check value of version 1 changed-path' '
+	(cd highbit1 &&
+		printf "52a9" >expect &&
+		get_first_changed_path_filter >actual)
+'
+
+# expect will not match actual if char is unsigned by default. Write the test
+# in this way, so that a user running this test script can still see if the two
+# files match. (It will appear as an ordinary success if they match, and a skip
+# if not.)
+if test_cmp highbit1/expect highbit1/actual
+then
+	test_set_prereq SIGNED_CHAR_BY_DEFAULT
+fi
+test_expect_success SIGNED_CHAR_BY_DEFAULT 'check value of version 1 changed-path' '
+	# Only the prereq matters for this test.
+	true
+'
+
+test_expect_success 'version 1 changed-path used when version 1 requested' '
+	(cd highbit1 &&
+		test_bloom_filters_used "-- $CENT")
+'
+
 test_done
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v4 3/4] repo-settings: introduce commitgraph.changedPathsVersion
  2023-06-13 17:39 ` [PATCH v4 " Jonathan Tan
  2023-06-13 17:39   ` [PATCH v4 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
  2023-06-13 17:39   ` [PATCH v4 2/4] t4216: test changed path filters with high bit paths Jonathan Tan
@ 2023-06-13 17:39   ` Jonathan Tan
  2023-06-20 13:28     ` Derrick Stolee
  2023-06-21 12:14     ` Taylor Blau
  2023-06-13 17:39   ` [PATCH v4 4/4] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
  2023-06-13 19:21   ` [PATCH v4 0/4] Changed path filter hash fix and version bump Junio C Hamano
  4 siblings, 2 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-06-13 17:39 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Ramsay Jones, Junio C Hamano

A subsequent commit will introduce another version of the changed-path
filter in the commit graph file. In order to control which version is
to be accepted when read (and which version to write), a config variable
is needed.

Therefore, introduce this config variable. For forwards compatibility,
teach Git to not read commit graphs when the config variable
is set to an unsupported version. Because we teach Git this,
commitgraph.readChangedPaths is now redundant, so deprecate it and
define its behavior in terms of the config variable we introduce.

This commit does not change the behavior of writing (Git writes changed
path filters when explicitly instructed regardless of any config
variable), but a subsequent commit will restrict Git such that it will
only write when commitgraph.changedPathsVersion is 0, 1, or 2.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config/commitgraph.txt | 16 +++++++++++++---
 commit-graph.c                       |  2 +-
 oss-fuzz/fuzz-commit-graph.c         |  2 +-
 repo-settings.c                      |  6 +++++-
 repository.h                         |  2 +-
 5 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index 30604e4a4c..eaa10bf232 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -9,6 +9,16 @@ commitGraph.maxNewFilters::
 	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
 
 commitGraph.readChangedPaths::
-	If true, then git will use the changed-path Bloom filters in the
-	commit-graph file (if it exists, and they are present). Defaults to
-	true. See linkgit:git-commit-graph[1] for more information.
+	Deprecated. Equivalent to changedPathsVersion=1 if true, and
+	changedPathsVersion=0 if false.
+
+commitGraph.changedPathsVersion::
+	Specifies the version of the changed-path Bloom filters that Git will read and
+	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
+	match the version set in this config variable will be ignored.
++
+Defaults to 1.
++
+If 0, git will write version 1 Bloom filters when instructed to write.
++
+See linkgit:git-commit-graph[1] for more information.
diff --git a/commit-graph.c b/commit-graph.c
index c11b59f28b..bd448047f1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -399,7 +399,7 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 			graph->read_generation_data = 1;
 	}
 
-	if (s->commit_graph_read_changed_paths) {
+	if (s->commit_graph_changed_paths_version == 1) {
 		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
 			   &graph->chunk_bloom_indexes);
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
diff --git a/oss-fuzz/fuzz-commit-graph.c b/oss-fuzz/fuzz-commit-graph.c
index 914026f5d8..b56731f51a 100644
--- a/oss-fuzz/fuzz-commit-graph.c
+++ b/oss-fuzz/fuzz-commit-graph.c
@@ -18,7 +18,7 @@ int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
 	 * possible.
 	 */
 	the_repository->settings.commit_graph_generation_version = 2;
-	the_repository->settings.commit_graph_read_changed_paths = 1;
+	the_repository->settings.commit_graph_changed_paths_version = 1;
 	g = parse_commit_graph(&the_repository->settings, (void *)data, size);
 	repo_clear(the_repository);
 	free_commit_graph(g);
diff --git a/repo-settings.c b/repo-settings.c
index 3dbd3f0e2e..6cbe02681b 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -24,6 +24,7 @@ void prepare_repo_settings(struct repository *r)
 	int value;
 	const char *strval;
 	int manyfiles;
+	int readChangedPaths;
 
 	if (!r->gitdir)
 		BUG("Cannot add settings for uninitialized repository");
@@ -54,7 +55,10 @@ void prepare_repo_settings(struct repository *r)
 	/* Commit graph config or default, does not cascade (simple) */
 	repo_cfg_bool(r, "core.commitgraph", &r->settings.core_commit_graph, 1);
 	repo_cfg_int(r, "commitgraph.generationversion", &r->settings.commit_graph_generation_version, 2);
-	repo_cfg_bool(r, "commitgraph.readchangedpaths", &r->settings.commit_graph_read_changed_paths, 1);
+	repo_cfg_bool(r, "commitgraph.readchangedpaths", &readChangedPaths, 1);
+	repo_cfg_int(r, "commitgraph.changedpathsversion",
+		     &r->settings.commit_graph_changed_paths_version,
+		     readChangedPaths ? 1 : 0);
 	repo_cfg_bool(r, "gc.writecommitgraph", &r->settings.gc_write_commit_graph, 1);
 	repo_cfg_bool(r, "fetch.writecommitgraph", &r->settings.fetch_write_commit_graph, 0);
 
diff --git a/repository.h b/repository.h
index e8c67ffe16..1f1c32a6dd 100644
--- a/repository.h
+++ b/repository.h
@@ -32,7 +32,7 @@ struct repo_settings {
 
 	int core_commit_graph;
 	int commit_graph_generation_version;
-	int commit_graph_read_changed_paths;
+	int commit_graph_changed_paths_version;
 	int gc_write_commit_graph;
 	int gc_cruft_packs;
 	int fetch_write_commit_graph;
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v4 4/4] commit-graph: new filter ver. that fixes murmur3
  2023-06-13 17:39 ` [PATCH v4 " Jonathan Tan
                     ` (2 preceding siblings ...)
  2023-06-13 17:39   ` [PATCH v4 3/4] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
@ 2023-06-13 17:39   ` Jonathan Tan
  2023-06-20 13:39     ` Derrick Stolee
  2023-06-13 19:21   ` [PATCH v4 0/4] Changed path filter hash fix and version bump Junio C Hamano
  4 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-06-13 17:39 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Ramsay Jones, Junio C Hamano

The murmur3 implementation in bloom.c has a bug when converting series
of 4 bytes into network-order integers when char is signed (which is
controllable by a compiler option, and the default signedness of char is
platform-specific). When a string contains characters with the high bit
set, this bug causes results that, although internally consistent within
Git, does not accord with other implementations of murmur3 and even with
Git binaries that were compiled with different signedness of char. This
bug affects both how Git writes changed path filters to disk and how Git
interprets changed path filters on disk.

Therefore, introduce a new version (2) of changed path filters that
corrects this problem. The existing version (1) is still supported and
is still the default, but users should migrate away from it as soon
as possible.

Because this bug only manifests with characters that have the high bit
set, it may be possible that some (or all) commits in a given repo would
have the same changed path filter both before and after this fix is
applied. However, in order to determine whether this is the case, the
changed paths would first have to be computed, at which point it is not
much more expensive to just compute a new changed path filter.

So this patch does not include any mechanism to "salvage" changed path
filters from repositories. There is also no "mixed" mode - for each
invocation of Git, reading and writing changed path filters are done
with the same version number.

There is a change in write_commit_graph(). graph_read_bloom_data()
makes it possible for chunk_bloom_data to be non-NULL but
bloom_filter_settings to be NULL, which causes a segfault later on. I
produced such a segfault while developing this patch, but couldn't find
a way to reproduce it neither after this complete patch (or before),
but in any case it seemed like a good thing to include that might help
future patch authors.

The value in t0095 was obtained from another murmur3 implementation
using the following Go source code:

  package main

  import "fmt"
  import "github.com/spaolacci/murmur3"

  func main() {
          fmt.Printf("%x\n", murmur3.Sum32([]byte("Hello world!")))
          fmt.Printf("%x\n", murmur3.Sum32([]byte{0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff}))
  }

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config/commitgraph.txt |  2 +-
 bloom.c                              | 65 ++++++++++++++++++++++++++--
 bloom.h                              |  8 ++--
 commit-graph.c                       | 29 ++++++++++---
 t/helper/test-bloom.c                |  9 +++-
 t/t0095-bloom.sh                     |  8 ++++
 t/t4216-log-bloom.sh                 | 31 +++++++++++++
 7 files changed, 139 insertions(+), 13 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index eaa10bf232..c64ee4f459 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -14,7 +14,7 @@ commitGraph.readChangedPaths::
 
 commitGraph.changedPathsVersion::
 	Specifies the version of the changed-path Bloom filters that Git will read and
-	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
+	write. May be 0, 1, or 2. Any changed-path Bloom filters on disk that do not
 	match the version set in this config variable will be ignored.
 +
 Defaults to 1.
diff --git a/bloom.c b/bloom.c
index d0730525da..915d8e5a31 100644
--- a/bloom.c
+++ b/bloom.c
@@ -65,7 +65,64 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
  * Not considered to be cryptographically secure.
  * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  */
-uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
+uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
+{
+	const uint32_t c1 = 0xcc9e2d51;
+	const uint32_t c2 = 0x1b873593;
+	const uint32_t r1 = 15;
+	const uint32_t r2 = 13;
+	const uint32_t m = 5;
+	const uint32_t n = 0xe6546b64;
+	int i;
+	uint32_t k1 = 0;
+	const char *tail;
+
+	int len4 = len / sizeof(uint32_t);
+
+	uint32_t k;
+	for (i = 0; i < len4; i++) {
+		uint32_t byte1 = (uint32_t)(unsigned char)data[4*i];
+		uint32_t byte2 = ((uint32_t)(unsigned char)data[4*i + 1]) << 8;
+		uint32_t byte3 = ((uint32_t)(unsigned char)data[4*i + 2]) << 16;
+		uint32_t byte4 = ((uint32_t)(unsigned char)data[4*i + 3]) << 24;
+		k = byte1 | byte2 | byte3 | byte4;
+		k *= c1;
+		k = rotate_left(k, r1);
+		k *= c2;
+
+		seed ^= k;
+		seed = rotate_left(seed, r2) * m + n;
+	}
+
+	tail = (data + len4 * sizeof(uint32_t));
+
+	switch (len & (sizeof(uint32_t) - 1)) {
+	case 3:
+		k1 ^= ((uint32_t)(unsigned char)tail[2]) << 16;
+		/*-fallthrough*/
+	case 2:
+		k1 ^= ((uint32_t)(unsigned char)tail[1]) << 8;
+		/*-fallthrough*/
+	case 1:
+		k1 ^= ((uint32_t)(unsigned char)tail[0]) << 0;
+		k1 *= c1;
+		k1 = rotate_left(k1, r1);
+		k1 *= c2;
+		seed ^= k1;
+		break;
+	}
+
+	seed ^= (uint32_t)len;
+	seed ^= (seed >> 16);
+	seed *= 0x85ebca6b;
+	seed ^= (seed >> 13);
+	seed *= 0xc2b2ae35;
+	seed ^= (seed >> 16);
+
+	return seed;
+}
+
+static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
 {
 	const uint32_t c1 = 0xcc9e2d51;
 	const uint32_t c2 = 0x1b873593;
@@ -130,8 +187,10 @@ void fill_bloom_key(const char *data,
 	int i;
 	const uint32_t seed0 = 0x293ae76f;
 	const uint32_t seed1 = 0x7e646e2c;
-	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
-	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
+	const uint32_t hash0 = (settings->hash_version == 2
+		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed0, data, len);
+	const uint32_t hash1 = (settings->hash_version == 2
+		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed1, data, len);
 
 	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
 	for (i = 0; i < settings->num_hashes; i++)
diff --git a/bloom.h b/bloom.h
index adde6dfe21..0c33ae282c 100644
--- a/bloom.h
+++ b/bloom.h
@@ -7,9 +7,11 @@ struct repository;
 struct bloom_filter_settings {
 	/*
 	 * The version of the hashing technique being used.
-	 * We currently only support version = 1 which is
+	 * The newest version is 2, which is
 	 * the seeded murmur3 hashing technique implemented
-	 * in bloom.c.
+	 * in bloom.c. Bloom filters of version 1 were created
+	 * with prior versions of Git, which had a bug in the
+	 * implementation of the hash function.
 	 */
 	uint32_t hash_version;
 
@@ -75,7 +77,7 @@ struct bloom_key {
  * Not considered to be cryptographically secure.
  * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  */
-uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len);
+uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len);
 
 void fill_bloom_key(const char *data,
 		    size_t len,
diff --git a/commit-graph.c b/commit-graph.c
index bd448047f1..36e6d09e74 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -302,15 +302,21 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
 	return 0;
 }
 
+struct graph_read_bloom_data_data {
+	struct commit_graph *g;
+	int commit_graph_changed_paths_version;
+};
+
 static int graph_read_bloom_data(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
-	struct commit_graph *g = data;
+	struct graph_read_bloom_data_data *d = data;
+	struct commit_graph *g = d->g;
 	uint32_t hash_version;
 	g->chunk_bloom_data = chunk_start;
 	hash_version = get_be32(chunk_start);
 
-	if (hash_version != 1)
+	if (hash_version != d->commit_graph_changed_paths_version)
 		return 0;
 
 	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
@@ -399,11 +405,16 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 			graph->read_generation_data = 1;
 	}
 
-	if (s->commit_graph_changed_paths_version == 1) {
+	if (s->commit_graph_changed_paths_version == 1
+	    || s->commit_graph_changed_paths_version == 2) {
+		struct graph_read_bloom_data_data data = {
+			.g = graph,
+			.commit_graph_changed_paths_version = s->commit_graph_changed_paths_version
+		};
 		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
 			   &graph->chunk_bloom_indexes);
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
-			   graph_read_bloom_data, graph);
+			   graph_read_bloom_data, &data);
 	}
 
 	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
@@ -2302,6 +2313,14 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->write_generation_data = (get_configured_generation_version(r) == 2);
 	ctx->num_generation_data_overflows = 0;
 
+	if (r->settings.commit_graph_changed_paths_version < 0
+	    || r->settings.commit_graph_changed_paths_version > 2) {
+		warning(_("attempting to write a commit-graph, but 'commitgraph.changedPathsVersion' (%d) is not supported"),
+			r->settings.commit_graph_changed_paths_version);
+		return 0;
+	}
+	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2
+		? 2 : 1;
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
 						      bloom_settings.bits_per_entry);
 	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
@@ -2331,7 +2350,7 @@ int write_commit_graph(struct object_directory *odb,
 		g = ctx->r->objects->commit_graph;
 
 		/* We have changed-paths already. Keep them in the next graph */
-		if (g && g->chunk_bloom_data) {
+		if (g && g->bloom_filter_settings) {
 			ctx->changed_paths = 1;
 			ctx->bloom_settings = g->bloom_filter_settings;
 		}
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 6c900ca668..34b8dd9164 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -48,6 +48,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 
 static const char *bloom_usage = "\n"
 "  test-tool bloom get_murmur3 <string>\n"
+"  test-tool bloom get_murmur3_seven_highbit\n"
 "  test-tool bloom generate_filter <string> [<string>...]\n"
 "  test-tool bloom get_filter_for_commit <commit-hex>\n";
 
@@ -62,7 +63,13 @@ int cmd__bloom(int argc, const char **argv)
 		uint32_t hashed;
 		if (argc < 3)
 			usage(bloom_usage);
-		hashed = murmur3_seeded(0, argv[2], strlen(argv[2]));
+		hashed = murmur3_seeded_v2(0, argv[2], strlen(argv[2]));
+		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
+	}
+
+	if (!strcmp(argv[1], "get_murmur3_seven_highbit")) {
+		uint32_t hashed;
+		hashed = murmur3_seeded_v2(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7);
 		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
 	}
 
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index b567383eb8..c8d84ab606 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -29,6 +29,14 @@ test_expect_success 'compute unseeded murmur3 hash for test string 2' '
 	test_cmp expect actual
 '
 
+test_expect_success 'compute unseeded murmur3 hash for test string 3' '
+	cat >expect <<-\EOF &&
+	Murmur3 Hash with seed=0:0xa183ccfd
+	EOF
+	test-tool bloom get_murmur3_seven_highbit >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'compute bloom key for empty string' '
 	cat >expect <<-\EOF &&
 	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index baa0c48897..5cb35c3a3e 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -450,4 +450,35 @@ test_expect_success 'version 1 changed-path used when version 1 requested' '
 		test_bloom_filters_used "-- $CENT")
 '
 
+test_expect_success 'version 1 changed-path not used when version 2 requested' '
+	(cd highbit1 &&
+		git config --add commitgraph.changedPathsVersion 2 &&
+		test_bloom_filters_not_used "-- $CENT")
+'
+
+test_expect_success 'set up repo with high bit path, version 2 changed-path' '
+	git init highbit2 &&
+	git -C highbit2 config --add commitgraph.changedPathsVersion 2 &&
+	test_commit -C highbit2 c2 "$CENT" &&
+	git -C highbit2 commit-graph write --reachable --changed-paths
+'
+
+test_expect_success 'check value of version 2 changed-path' '
+	(cd highbit2 &&
+		printf "c01f" >expect &&
+		get_first_changed_path_filter >actual &&
+		test_cmp expect actual)
+'
+
+test_expect_success 'version 2 changed-path used when version 2 requested' '
+	(cd highbit2 &&
+		test_bloom_filters_used "-- $CENT")
+'
+
+test_expect_success 'version 2 changed-path not used when version 1 requested' '
+	(cd highbit2 &&
+		git config --add commitgraph.changedPathsVersion 1 &&
+		test_bloom_filters_not_used "-- $CENT")
+'
+
 test_done
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH] CodingGuidelines: use octal escapes, not hex
  2023-06-13 17:29         ` [PATCH] CodingGuidelines: use octal escapes, not hex Jonathan Tan
@ 2023-06-13 18:16           ` Eric Sunshine
  2023-06-13 18:43             ` Jonathan Tan
  0 siblings, 1 reply; 116+ messages in thread
From: Eric Sunshine @ 2023-06-13 18:16 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster

On Tue, Jun 13, 2023 at 1:44 PM Jonathan Tan <jonathantanmy@google.com> wrote:
> Hexadecimal escapes in shell scripts are not portable across shells (in
> particular, "dash" does not support them). Write in the CodingGuidelines
> document that we should be using octal escapes instead.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> diff --git a/Documentation/CodingGuidelines b/Documentation/CodingGuidelines
> @@ -188,6 +188,9 @@ For shell scripts specifically (not exhaustive):
> + - Use octal escape sequences (e.g. "\302\242"), not hexadecimal (e.g.
> +   "\xc2\xa2"), as the latter is not portable.

The shell itself doesn't interpret these sequences, so this
description feels too generic. Perhaps it would make more sense to
cite specific tools for which octal sequences are needed for
portability reasons, such as `printf`, `sed`, `tr`, etc.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] CodingGuidelines: use octal escapes, not hex
  2023-06-13 18:16           ` Eric Sunshine
@ 2023-06-13 18:43             ` Jonathan Tan
  2023-06-13 19:15               ` Eric Sunshine
  0 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-06-13 18:43 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Jonathan Tan, git, gitster

Eric Sunshine <sunshine@sunshineco.com> writes:
> On Tue, Jun 13, 2023 at 1:44 PM Jonathan Tan <jonathantanmy@google.com> wrote:
> > Hexadecimal escapes in shell scripts are not portable across shells (in
> > particular, "dash" does not support them). Write in the CodingGuidelines
> > document that we should be using octal escapes instead.
> >
> > Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> > ---
> > diff --git a/Documentation/CodingGuidelines b/Documentation/CodingGuidelines
> > @@ -188,6 +188,9 @@ For shell scripts specifically (not exhaustive):
> > + - Use octal escape sequences (e.g. "\302\242"), not hexadecimal (e.g.
> > +   "\xc2\xa2"), as the latter is not portable.
> 
> The shell itself doesn't interpret these sequences, so this
> description feels too generic. Perhaps it would make more sense to
> cite specific tools for which octal sequences are needed for
> portability reasons, such as `printf`, `sed`, `tr`, etc.

Ah...good point. I checked with "echo" in "dash" and assumed that it
was "dash" that was interpreting the escapes, but indeed it is the
"echo" (and "printf") builtins in "dash" that are actually interpreting
them. What do you think of the following in the commit message:

  Hexadecimal escapes in shell scripts are not portable across shell builtins (in
  particular, the "printf" of "dash" does not support them). Write in the CodingGuidelines
  document that we should be using octal escapes instead.

and in the CodingGuidelines doc:

+ - Use octal escape sequences (e.g. "\302\242"), not hexadecimal (e.g.
+   "\xc2\xa2"), as the latter is not portable across some shell builtins like printf.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] CodingGuidelines: use octal escapes, not hex
  2023-06-13 18:43             ` Jonathan Tan
@ 2023-06-13 19:15               ` Eric Sunshine
  2023-06-13 19:29                 ` Junio C Hamano
  0 siblings, 1 reply; 116+ messages in thread
From: Eric Sunshine @ 2023-06-13 19:15 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster

On Tue, Jun 13, 2023 at 2:43 PM Jonathan Tan <jonathantanmy@google.com> wrote:
> Eric Sunshine <sunshine@sunshineco.com> writes:
> > On Tue, Jun 13, 2023 at 1:44 PM Jonathan Tan <jonathantanmy@google.com> wrote:
> > > + - Use octal escape sequences (e.g. "\302\242"), not hexadecimal (e.g.
> > > +   "\xc2\xa2"), as the latter is not portable.
> >
> > The shell itself doesn't interpret these sequences, so this
> > description feels too generic. Perhaps it would make more sense to
> > cite specific tools for which octal sequences are needed for
> > portability reasons, such as `printf`, `sed`, `tr`, etc.
>
> Ah...good point. I checked with "echo" in "dash" and assumed that it
> was "dash" that was interpreting the escapes, but indeed it is the
> "echo" (and "printf") builtins in "dash" that are actually interpreting
> them. What do you think of the following in the commit message:
>
>   Hexadecimal escapes in shell scripts are not portable across shell builtins (in
>   particular, the "printf" of "dash" does not support them). Write in the CodingGuidelines
>   document that we should be using octal escapes instead.
>
> and in the CodingGuidelines doc:
>
> + - Use octal escape sequences (e.g. "\302\242"), not hexadecimal (e.g.
> +   "\xc2\xa2"), as the latter is not portable across some shell builtins like printf.

The portability concern is not specific to a certain shell or whether
a command is builtin, so it feels misleading to single out "dash" and
builtins. The same portability problems can crop up, as well, with
older (non-"dash") shells, and with commands which may or may not be
builtins (such as "echo" which, historically, was not always a
builtin), and non-builtins commands, such as "sed" and "tr".

So, for the commit message, perhaps simply:

    Extend the shell-scripting section of CodingGuidelines to suggest
    octal escape sequences (e.g. "\302\242") over hexadecimal
    (e.g. "\xc2\xa2") since the latter can be a source of portability
    problems.

As for the change to CodingGuidelines, this would probably be sufficient:

    Use octal escape sequences (e.g. "\302\242"), not hexadecimal
    (e.g. "\xc2\xa2"), since the latter is not portable across some
    commands, such as `printf`, `sed`, `tr`, etc.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 0/4] Changed path filter hash fix and version bump
  2023-06-13 17:16       ` Jonathan Tan
  2023-06-13 17:29         ` [PATCH] CodingGuidelines: use octal escapes, not hex Jonathan Tan
@ 2023-06-13 19:16         ` Junio C Hamano
  1 sibling, 0 replies; 116+ messages in thread
From: Junio C Hamano @ 2023-06-13 19:16 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Ramsay Jones, git

Jonathan Tan <jonathantanmy@google.com> writes:

>> + - In 'printf' format string, do not use hexadecimals, as they are not
>> +   portable.  Write 
>> +
>> +     CENT=$(printf "\302\242")
>> +
>> +   not
>> +
>> +     CENT=$(printf "\xc2\xa2")
>
> I've checked with "dash" and this applies to any quoted string, not just
> when passed to printf. I'll prepare a patch describing this.

What do you mean by "any quoted string"?

I think built-in 'echo' of dash takes "\302\242" and emits the
cent-sign (but /bin/echo may not), but I do not think it is "any
quoted string".  To wit:

    dash$ echo '\302\242'
    ¢

The echo built into dash shows the cent-sign.  The example does not
let us tell if is the shell (i.e. the 'echo' command sees the
cent-sign in its argv[1]) or if it is the command (i.e. the 'echo'
sees 8-byte string "bs three zero two bs two four two" in its
argv[1], and shows that as the cent-sign), though.

    dash$ /bin/echo '\302\242'
    \302\242

This shows that shell is not doing anything fancy.  /bin/echo gets
the 8-byte string in its argv[1] and emits that as-is.

    dash$ /bin/echo "\302\242"
    \302\242

Again this shows the same; I added this example to demonstrate that
it is _not_ like the shell interprets the backslashed octal strings
depending on how they are quoted.

    dash$ printf "%s\n" '\302\242'
    \302\242
    dash$ printf "%s\n" "\302\242"
    \302\242

And these demonstrates that argv[2] given to printf in these cases
are 8-byte string "bs three zero two bs two four two" and the shell
is not doing anything fancy.

So, I would suggest not saying "any quoted string".  In addition,
even though dash's built-in echo that recognizes "\0num" seems to be
conformant to what POSIX specifies (cf. [*1*]), GNU requires "-e" in
order to do so in both standalone 'echo' binary and one built into
'bash', so we cannot rely on this POSIX behaviour when writing a
portable script.  Hence, I would recommend us to focus on giving a
piece of advice on use of printf in this part of the documentation.


[References]

*1* https://pubs.opengroup.org/onlinepubs/9699919799/utilities/echo.html
*2* https://www.gnu.org/software/coreutils/manual/html_node/echo-invocation.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v4 0/4] Changed path filter hash fix and version bump
  2023-06-13 17:39 ` [PATCH v4 " Jonathan Tan
                     ` (3 preceding siblings ...)
  2023-06-13 17:39   ` [PATCH v4 4/4] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
@ 2023-06-13 19:21   ` Junio C Hamano
  2023-06-20 13:43     ` Derrick Stolee
  4 siblings, 1 reply; 116+ messages in thread
From: Junio C Hamano @ 2023-06-13 19:21 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Ramsay Jones, Derrick Stolee

Jonathan Tan <jonathantanmy@google.com> writes:

> Thanks Ramsay for spotting the errors and mentioning that I can use
> octal escapes. Here's an update taking into account their comments.

The changes look good.  Will queue.

Stolee, you had comments on an earlier round---how does this one
look?

Thanks.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] CodingGuidelines: use octal escapes, not hex
  2023-06-13 19:15               ` Eric Sunshine
@ 2023-06-13 19:29                 ` Junio C Hamano
  0 siblings, 0 replies; 116+ messages in thread
From: Junio C Hamano @ 2023-06-13 19:29 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Jonathan Tan, git

Eric Sunshine <sunshine@sunshineco.com> writes:

> So, for the commit message, perhaps simply:
>
>     Extend the shell-scripting section of CodingGuidelines to suggest
>     octal escape sequences (e.g. "\302\242") over hexadecimal
>     (e.g. "\xc2\xa2") since the latter can be a source of portability
>     problems.
>
> As for the change to CodingGuidelines, this would probably be sufficient:
>
>     Use octal escape sequences (e.g. "\302\242"), not hexadecimal
>     (e.g. "\xc2\xa2"), since the latter is not portable across some
>     commands, such as `printf`, `sed`, `tr`, etc.

I'd prefer singling out `printf`, actually, and not talking about
"across some commands".

As I said in a separate message, we certainly do *not* want to rely
on `echo` interpreting bs-escaped octal sequences without '-e', even
though it may be expected on a POSIX systems, because it is not
portable across systems our users commonly encounter.

And `printf` has been what we chose to turn bs-escaped octal
sequence into binary.  I'd prefer not having to even worry about
`sed`, `tr`, etc. behaving differently and not allowing to expect
these other commands to be usable for turning bs-escaped octal
sequence into binary would be one way to achieve that goal.

Thanks.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v4 1/4] gitformat-commit-graph: describe version 2 of BDAT
  2023-06-13 17:39   ` [PATCH v4 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
@ 2023-06-13 21:58     ` Junio C Hamano
  2023-06-20 13:22       ` Derrick Stolee
  2023-06-21 12:08       ` Taylor Blau
  0 siblings, 2 replies; 116+ messages in thread
From: Junio C Hamano @ 2023-06-13 21:58 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Ramsay Jones

Jonathan Tan <jonathantanmy@google.com> writes:

> The code change to Git to support version 2 will be done in subsequent
> commits.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  Documentation/gitformat-commit-graph.txt | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/gitformat-commit-graph.txt b/Documentation/gitformat-commit-graph.txt
> index 31cad585e2..112e6d36a6 100644
> --- a/Documentation/gitformat-commit-graph.txt
> +++ b/Documentation/gitformat-commit-graph.txt
> @@ -142,13 +142,16 @@ All multi-byte numbers are in network byte order.
>  
>  ==== Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
>      * It starts with header consisting of three unsigned 32-bit integers:
> -      - Version of the hash algorithm being used. We currently only support
> -	value 1 which corresponds to the 32-bit version of the murmur3 hash
> +      - Version of the hash algorithm being used. We currently support
> +	value 2 which corresponds to the 32-bit version of the murmur3 hash
>  	implemented exactly as described in
>  	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
>  	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
>  	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
> -	in Probabilistic Verification"
> +	in Probabilistic Verification". Version 1 bloom filters have a bug that appears

"bloom" -> "Bloom", probably, as the name comes from the name of its
inventor (just like we spell "Boolean", not "boolean").

> +	when char is signed and the repository has path names that have characters >=
> +	0x80; Git supports reading and writing them, but this ability will be removed
> +	in a future version of Git.

Makes sense.

I wonder if we want to mention what the undesired misbehaviour the
"bug" causes and what we do to avoid getting affected by the bug
here.  If we can say something like "When querying for a pathname
with a byte with high-bit set, the buggy filter may produce false
negative, making the filter unusable, but asking for a pathname
without such a byte produces no false negatives (even though we may
get false positives).  When Git reads version 1 filter data, it
refrains from using it for processing paths with high-bit set to
avoid triggering the bug", then it would be ideal.  Or "When the
repository has even a single pathname with high-bit set anywhere in
its history, version 1 Bloom can give false negative when querying
any paths and becomes unusable.  You can use $THIS configuration
variable to disable use of Bloom filter data in such a case" would
also be fine.  The point is to give actionable piece of information
to the readers.

Thanks.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v4 1/4] gitformat-commit-graph: describe version 2 of BDAT
  2023-06-13 21:58     ` Junio C Hamano
@ 2023-06-20 13:22       ` Derrick Stolee
  2023-06-21 12:08       ` Taylor Blau
  1 sibling, 0 replies; 116+ messages in thread
From: Derrick Stolee @ 2023-06-20 13:22 UTC (permalink / raw)
  To: Junio C Hamano, Jonathan Tan; +Cc: git, Ramsay Jones

On 6/13/2023 5:58 PM, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:

>> +	in Probabilistic Verification". Version 1 bloom filters have a bug that appears
> 
> "bloom" -> "Bloom", probably, as the name comes from the name of its
> inventor (just like we spell "Boolean", not "boolean").

The ultimate recognition comes from when the term named after you
becomes lower-case ("boolean" is sometimes in this category, but
definitely "abelian" is an example).

In this case, you are right that we should capitalize Bloom.

>> +	when char is signed and the repository has path names that have characters >=
>> +	0x80; Git supports reading and writing them, but this ability will be removed
>> +	in a future version of Git.
> 
> Makes sense.

I also like how you organized this: "We support version 2. 1 is
still around but not for long."
 
> I wonder if we want to mention what the undesired misbehaviour the
> "bug" causes and what we do to avoid getting affected by the bug
> here.  If we can say something like "When querying for a pathname
> with a byte with high-bit set, the buggy filter may produce false
> negative, making the filter unusable, but asking for a pathname
> without such a byte produces no false negatives (even though we may
> get false positives).  When Git reads version 1 filter data, it
> refrains from using it for processing paths with high-bit set to
> avoid triggering the bug", then it would be ideal.  Or "When the
> repository has even a single pathname with high-bit set anywhere in
> its history, version 1 Bloom can give false negative when querying
> any paths and becomes unusable.  You can use $THIS configuration
> variable to disable use of Bloom filter data in such a case" would
> also be fine.  The point is to give actionable piece of information
> to the readers.

This is definitely helpful, but if someone is having issues we
would say "try version 2 and see if it still happens" and not
over-index on the underlying reason.

That's to say, I'm OK with the shorter description of the problem.
Feel free to expand if you're interested, though.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v4 3/4] repo-settings: introduce commitgraph.changedPathsVersion
  2023-06-13 17:39   ` [PATCH v4 3/4] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
@ 2023-06-20 13:28     ` Derrick Stolee
  2023-06-21 12:14     ` Taylor Blau
  1 sibling, 0 replies; 116+ messages in thread
From: Derrick Stolee @ 2023-06-20 13:28 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: Ramsay Jones, Junio C Hamano

On 6/13/2023 1:39 PM, Jonathan Tan wrote:
> A subsequent commit will introduce another version of the changed-path
> filter in the commit graph file. In order to control which version is
> to be accepted when read (and which version to write), a config variable
> is needed.
> 
> Therefore, introduce this config variable. For forwards compatibility,
> teach Git to not read commit graphs when the config variable
> is set to an unsupported version. Because we teach Git this,
> commitgraph.readChangedPaths is now redundant, so deprecate it and
> define its behavior in terms of the config variable we introduce.

I'm late to the party, but I support this change. The safety valve
of "I want to turn this off if something goes wrong" is long overdue
for deletion (it would help someone in this high-bit situation).

Having a new replacement is a good way to preserve the safety valve
behavior while also promoting the new versioning scheme.
 
> This commit does not change the behavior of writing (Git writes changed
> path filters when explicitly instructed regardless of any config
> variable), but a subsequent commit will restrict Git such that it will
> only write when commitgraph.changedPathsVersion is 0, 1, or 2.

>  commitGraph.readChangedPaths::
> -	If true, then git will use the changed-path Bloom filters in the
> -	commit-graph file (if it exists, and they are present). Defaults to
> -	true. See linkgit:git-commit-graph[1] for more information.
> +	Deprecated. Equivalent to changedPathsVersion=1 if true, and
> +	changedPathsVersion=0 if false.

This defaulted to true before...

> +commitGraph.changedPathsVersion::
> +	Specifies the version of the changed-path Bloom filters that Git will read and
> +	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
> +	match the version set in this config variable will be ignored.
> ++
> +Defaults to 1.

So this version defaults to 1. Good.

> ++
> +If 0, git will write version 1 Bloom filters when instructed to write.

> -	if (s->commit_graph_read_changed_paths) {
> +	if (s->commit_graph_changed_paths_version == 1) {
>  		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
>  			   &graph->chunk_bloom_indexes);
>  		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,


>  	/* Commit graph config or default, does not cascade (simple) */
>  	repo_cfg_bool(r, "core.commitgraph", &r->settings.core_commit_graph, 1);
>  	repo_cfg_int(r, "commitgraph.generationversion", &r->settings.commit_graph_generation_version, 2);
> -	repo_cfg_bool(r, "commitgraph.readchangedpaths", &r->settings.commit_graph_read_changed_paths, 1);
> +	repo_cfg_bool(r, "commitgraph.readchangedpaths", &readChangedPaths, 1);
> +	repo_cfg_int(r, "commitgraph.changedpathsversion",
> +		     &r->settings.commit_graph_changed_paths_version,
> +		     readChangedPaths ? 1 : 0);

And here we default to 'true' and '1', respectively. This allows 
'commitGraph.readChangedPaths=false' to override
'commitGraph.changedPathsVersion=1'. Should that implication be
documented?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v4 4/4] commit-graph: new filter ver. that fixes murmur3
  2023-06-13 17:39   ` [PATCH v4 4/4] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
@ 2023-06-20 13:39     ` Derrick Stolee
  2023-06-20 18:37       ` Junio C Hamano
  0 siblings, 1 reply; 116+ messages in thread
From: Derrick Stolee @ 2023-06-20 13:39 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: Ramsay Jones, Junio C Hamano

On 6/13/2023 1:39 PM, Jonathan Tan wrote:

>  commitGraph.changedPathsVersion::
>  	Specifies the version of the changed-path Bloom filters that Git will read and
> -	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
> +	write. May be 0, 1, or 2. Any changed-path Bloom filters on disk that do not
>  	match the version set in this config variable will be ignored.
>  +
>  Defaults to 1.

Is this a good place to document the planned modification of this default in
future versions of Git?

> +static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
>  {
>  	const uint32_t c1 = 0xcc9e2d51;
>  	const uint32_t c2 = 0x1b873593;
> @@ -130,8 +187,10 @@ void fill_bloom_key(const char *data,
>  	int i;
>  	const uint32_t seed0 = 0x293ae76f;
>  	const uint32_t seed1 = 0x7e646e2c;
> -	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
> -	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
> +	const uint32_t hash0 = (settings->hash_version == 2
> +		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed0, data, len);
> +	const uint32_t hash1 = (settings->hash_version == 2
> +		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed1, data, len);

This is the critical step. I've not seen this ternary trick within
a function name choice, but it makes for a compact check. Nice.

However, I think the 'settings->hash_version' is the wrong place
to look for the condition. We should be getting this value from the
commit-graph we are reading. (More on this later.)

>  struct bloom_filter_settings {
>  	/*
>  	 * The version of the hashing technique being used.
> -	 * We currently only support version = 1 which is
> +	 * The newest version is 2, which is
>  	 * the seeded murmur3 hashing technique implemented
> -	 * in bloom.c.
> +	 * in bloom.c. Bloom filters of version 1 were created
> +	 * with prior versions of Git, which had a bug in the
> +	 * implementation of the hash function.

...

> +struct graph_read_bloom_data_data {
> +	struct commit_graph *g;
> +	int commit_graph_changed_paths_version;
> +};
> +
>  static int graph_read_bloom_data(const unsigned char *chunk_start,
>  				  size_t chunk_size, void *data)
>  {
> -	struct commit_graph *g = data;
> +	struct graph_read_bloom_data_data *d = data;
> +	struct commit_graph *g = d->g;
>  	uint32_t hash_version;
>  	g->chunk_bloom_data = chunk_start;
>  	hash_version = get_be32(chunk_start);
>  
> -	if (hash_version != 1)
> +	if (hash_version != d->commit_graph_changed_paths_version)
>  		return 0;

This makes it appear like we cannot read a commit-graph that has
a Bloom filter version that doesn't match the configured version.

This seems incorrect. If we want to configure to _write_ v2, we
should still be able to _read_ v1 concurrently until those v2
filters are written.

This check should be:

	if (hash_version <= 0 || hash_version > 2)
		return 0;

and then 

	g->filter_hash_version = hash_version;

to store this hash version somewhere in the graph. This way, we
can read any commit-graph file and will not suffer performance
problems in the time between setting the config value and writing
the new commit-graph file.

> -	if (s->commit_graph_changed_paths_version == 1) {
> +	if (s->commit_graph_changed_paths_version == 1
> +	    || s->commit_graph_changed_paths_version == 2) {

Perhaps this could just be

	if (s->commit_graph_changed_paths_version) {

to say "not zero means we still read the filters". Though, since
this config _should_ mean "which version do we _write_?" it might
be good to go back on the "unifying the two config options".

> +		struct graph_read_bloom_data_data data = {
> +			.g = graph,
> +			.commit_graph_changed_paths_version = s->commit_graph_changed_paths_version
> +		};
>  		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
>  			   &graph->chunk_bloom_indexes);
>  		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
> -			   graph_read_bloom_data, graph);
> +			   graph_read_bloom_data, &data);
>  	}

Much of this block will not be necessary as we don't need to
send the repo settings into the read_chunk() method.

> +	if (r->settings.commit_graph_changed_paths_version < 0
> +	    || r->settings.commit_graph_changed_paths_version > 2) {
> +		warning(_("attempting to write a commit-graph, but 'commitgraph.changedPathsVersion' (%d) is not supported"),
> +			r->settings.commit_graph_changed_paths_version);
> +		return 0;

I see the "< 0" means we aren't considering the case of disabling 
writes with the zero value. We should exit early if we see a zero
here, for extra safety.

> +	}
> +	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2
> +		? 2 : 1;

Once we've checked that this value is not zero, we can do this:

	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version;


>  		/* We have changed-paths already. Keep them in the next graph */
> -		if (g && g->chunk_bloom_data) {
> +		if (g && g->bloom_filter_settings) {
>  			ctx->changed_paths = 1;
>  			ctx->bloom_settings = g->bloom_filter_settings;
>  		}


> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -450,4 +450,35 @@ test_expect_success 'version 1 changed-path used when version 1 requested' '
>  		test_bloom_filters_used "-- $CENT")
>  '
>  
> +test_expect_success 'version 1 changed-path not used when version 2 requested' '
> +	(cd highbit1 &&
> +		git config --add commitgraph.changedPathsVersion 2 &&
> +		test_bloom_filters_not_used "-- $CENT")
> +'

I think this test is the wrong behavior. We should be able to use version 1
when version 2 is requested.

Instead, start with version 1, and _upgrade_ to version 2 and then check
which version exists in the file.

We should only _not_ use the filters if the version is 0 (or
commitGraph.readChangedPaths=false). I think this might be a good enough reason
to 

> +
> +test_expect_success 'set up repo with high bit path, version 2 changed-path' '
> +	git init highbit2 &&
> +	git -C highbit2 config --add commitgraph.changedPathsVersion 2 &&
> +	test_commit -C highbit2 c2 "$CENT" &&
> +	git -C highbit2 commit-graph write --reachable --changed-paths
> +'
> +
> +test_expect_success 'check value of version 2 changed-path' '
> +	(cd highbit2 &&
> +		printf "c01f" >expect &&
> +		get_first_changed_path_filter >actual &&
> +		test_cmp expect actual)
> +'
> +
> +test_expect_success 'version 2 changed-path used when version 2 requested' '
> +	(cd highbit2 &&
> +		test_bloom_filters_used "-- $CENT")
> +'
> +
> +test_expect_success 'version 2 changed-path not used when version 1 requested' '
> +	(cd highbit2 &&
> +		git config --add commitgraph.changedPathsVersion 1 &&
> +		test_bloom_filters_not_used "-- $CENT")
> +'

Again, this is also the wrong situation.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v4 0/4] Changed path filter hash fix and version bump
  2023-06-13 19:21   ` [PATCH v4 0/4] Changed path filter hash fix and version bump Junio C Hamano
@ 2023-06-20 13:43     ` Derrick Stolee
  2023-06-20 21:56       ` Jonathan Tan
  2023-06-21 12:19       ` Taylor Blau
  0 siblings, 2 replies; 116+ messages in thread
From: Derrick Stolee @ 2023-06-20 13:43 UTC (permalink / raw)
  To: Junio C Hamano, Jonathan Tan; +Cc: git, Ramsay Jones

On 6/13/2023 3:21 PM, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
> 
>> Thanks Ramsay for spotting the errors and mentioning that I can use
>> octal escapes. Here's an update taking into account their comments.
> 
> The changes look good.  Will queue.
> 
> Stolee, you had comments on an earlier round---how does this one
> look?

I'm sorry I'm so late to this. I've been meaning to get to it, but
it's been a crazy couple of weeks.

This version is not ready. The backwards compatibility story is
incomplete.

When commitGraph.changedPathsVersion is set, it does not allow
reading a previous filter version, leaving us in a poor performance
state until the commit-graph file can be rewritten.

While I was reviewing, it seemed reasonable to deprecate
commitGraph.readChangedPaths, but this use of "also restrict writes
to this version" didn't make sense to me at the time. Instead, it
would be good to have this clarity between the config options:

 commitGraph.readChangedPaths: should we read and use the filters
 that exist on disk? Defaults to 'true'.

 commitGraph.changedPathsVersion: Which version should we _write_
 when writing a new commit-graph? Defaults to '1' but will default
 to '2' in the next major verion, then '1' will no longer be an
 accepted value in the version after that.

The tricky part is that during the commit-graph write, you will
need to check the existing filter value to see if it matches. If
not, the filters will need to be recomputed from scratch. This
will change patch 4 a bit, but it's the right thing to do.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v4 4/4] commit-graph: new filter ver. that fixes murmur3
  2023-06-20 13:39     ` Derrick Stolee
@ 2023-06-20 18:37       ` Junio C Hamano
  0 siblings, 0 replies; 116+ messages in thread
From: Junio C Hamano @ 2023-06-20 18:37 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Jonathan Tan, git, Ramsay Jones

Derrick Stolee <derrickstolee@github.com> writes:

> However, I think the 'settings->hash_version' is the wrong place
> to look for the condition. We should be getting this value from the
> commit-graph we are reading. (More on this later.)
> ...
>> -	if (hash_version != 1)
>> +	if (hash_version != d->commit_graph_changed_paths_version)
>>  		return 0;
>
> This makes it appear like we cannot read a commit-graph that has
> a Bloom filter version that doesn't match the configured version.
>
> This seems incorrect. If we want to configure to _write_ v2, we
> should still be able to _read_ v1 concurrently until those v2
> filters are written.

Good eyes.  Thanks for carefully reading.



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v4 0/4] Changed path filter hash fix and version bump
  2023-06-20 13:43     ` Derrick Stolee
@ 2023-06-20 21:56       ` Jonathan Tan
  2023-06-21 12:19       ` Taylor Blau
  1 sibling, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-06-20 21:56 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Jonathan Tan, Junio C Hamano, git, Ramsay Jones

Derrick Stolee <derrickstolee@github.com> writes:
> When commitGraph.changedPathsVersion is set, it does not allow
> reading a previous filter version, leaving us in a poor performance
> state until the commit-graph file can be rewritten.
> 
> While I was reviewing, it seemed reasonable to deprecate
> commitGraph.readChangedPaths, but this use of "also restrict writes
> to this version" didn't make sense to me at the time. Instead, it
> would be good to have this clarity between the config options:
> 
>  commitGraph.readChangedPaths: should we read and use the filters
>  that exist on disk? Defaults to 'true'.
> 
>  commitGraph.changedPathsVersion: Which version should we _write_
>  when writing a new commit-graph? Defaults to '1' but will default
>  to '2' in the next major verion, then '1' will no longer be an
>  accepted value in the version after that.
> 
> The tricky part is that during the commit-graph write, you will
> need to check the existing filter value to see if it matches. If
> not, the filters will need to be recomputed from scratch. This
> will change patch 4 a bit, but it's the right thing to do.
> 
> Thanks,
> -Stolee

OK - this sounds reasonable. I'll take a look.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v4 1/4] gitformat-commit-graph: describe version 2 of BDAT
  2023-06-13 21:58     ` Junio C Hamano
  2023-06-20 13:22       ` Derrick Stolee
@ 2023-06-21 12:08       ` Taylor Blau
  2023-06-22 22:26         ` Jonathan Tan
  1 sibling, 1 reply; 116+ messages in thread
From: Taylor Blau @ 2023-06-21 12:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git, Ramsay Jones

On Tue, Jun 13, 2023 at 02:58:24PM -0700, Junio C Hamano wrote:
> "bloom" -> "Bloom", probably, as the name comes from the name of its
> inventor (just like we spell "Boolean", not "boolean").

Indeed.

> > +	when char is signed and the repository has path names that have characters >=
> > +	0x80; Git supports reading and writing them, but this ability will be removed
> > +	in a future version of Git.
>
> Makes sense.
>
> I wonder if we want to mention what the undesired misbehaviour the
> "bug" causes and what we do to avoid getting affected by the bug
> here.  If we can say something like "When querying for a pathname
> with a byte with high-bit set, the buggy filter may produce false
> negative, making the filter unusable, but asking for a pathname
> without such a byte produces no false negatives (even though we may
> get false positives).  When Git reads version 1 filter data, it
> refrains from using it for processing paths with high-bit set to
> avoid triggering the bug", then it would be ideal.

Your description of the bug matches my understanding of the issue, that
a corrupt filter would produce false negatives and thus be unusable.

I skimmed through the rest of the series, and couldn't find a spot where
we do the latter, i.e. still use v1 filters as long as we don't have any
characters in the path with high-order bits set.

I think this would be as simple as modifying the Bloom filter query
function to return "maybe" before even trying to hash a path with at
least one character with its high-bit set.

Apologies if this functionality is implemented and I just missed it.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v4 3/4] repo-settings: introduce commitgraph.changedPathsVersion
  2023-06-13 17:39   ` [PATCH v4 3/4] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
  2023-06-20 13:28     ` Derrick Stolee
@ 2023-06-21 12:14     ` Taylor Blau
  1 sibling, 0 replies; 116+ messages in thread
From: Taylor Blau @ 2023-06-21 12:14 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Ramsay Jones, Junio C Hamano

On Tue, Jun 13, 2023 at 10:39:57AM -0700, Jonathan Tan wrote:
> diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
> index 30604e4a4c..eaa10bf232 100644
> --- a/Documentation/config/commitgraph.txt
> +++ b/Documentation/config/commitgraph.txt
> @@ -9,6 +9,16 @@ commitGraph.maxNewFilters::
>  	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
>
>  commitGraph.readChangedPaths::
> -	If true, then git will use the changed-path Bloom filters in the
> -	commit-graph file (if it exists, and they are present). Defaults to
> -	true. See linkgit:git-commit-graph[1] for more information.
> +	Deprecated. Equivalent to changedPathsVersion=1 if true, and
> +	changedPathsVersion=0 if false.
> +
> +commitGraph.changedPathsVersion::
> +	Specifies the version of the changed-path Bloom filters that Git will read and
> +	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
> +	match the version set in this config variable will be ignored.
> ++
> +Defaults to 1.
> ++
> +If 0, git will write version 1 Bloom filters when instructed to write.
> ++
> +See linkgit:git-commit-graph[1] for more information.

Hmm. I'm a little confused: we should still be able to use the old
broken filters if (and only if) the paths we're querying don't have any
bytes with their high-order bit set, no?

That should be true with the caveat that querying such a path would need
to result in our querying function returning "maybe" instead of
"definitely not" to protect against the false-negatives described
earlier.

As I read this, it seems to imply that as soon as this change lands that
we'll stop reading old Bloom filters altogether. Is that the case?

If so, I wonder if we can do this with without needing this
configuration setting at all (by writing the newest version of Bloom
filters possible, and working around the existing ones with the
aforementioned workaround).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v4 0/4] Changed path filter hash fix and version bump
  2023-06-20 13:43     ` Derrick Stolee
  2023-06-20 21:56       ` Jonathan Tan
@ 2023-06-21 12:19       ` Taylor Blau
  2023-06-21 17:53         ` Derrick Stolee
  1 sibling, 1 reply; 116+ messages in thread
From: Taylor Blau @ 2023-06-21 12:19 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Junio C Hamano, Jonathan Tan, git, Ramsay Jones

On Tue, Jun 20, 2023 at 09:43:46AM -0400, Derrick Stolee wrote:
> This version is not ready. The backwards compatibility story is
> incomplete.

I'm also late to the party, but I agree with Stolee here, having come to
the same conclusion about needing to support reading older (corrupt) Bloom
filters when possible (i.e. when paths contain no bytes which have their
high-order bits set), and assuming the filter contains all paths
otherwise.

>  commitGraph.changedPathsVersion: Which version should we _write_
>  when writing a new commit-graph? Defaults to '1' but will default
>  to '2' in the next major verion, then '1' will no longer be an
>  accepted value in the version after that.

I am not sure if there's a situation where we'd ever want to not write
the newer versions when starting a new commit-graph (or chain) from
scratch.

I think that follows from what you and I are both suggesting w.r.t
backwards compatibility. If that's the case, I think that we could in
theory drop this configuration setting altogether.

Or, at the very least, we should be able to change it change only what
version we *write*, not read. I think this is what you are suggesting
above, but I am not 100% sure, so apologies if I'm just repeating what
you've already suggested.

> The tricky part is that during the commit-graph write, you will
> need to check the existing filter value to see if it matches. If
> not, the filters will need to be recomputed from scratch. This
> will change patch 4 a bit, but it's the right thing to do.

Yup, good suggestion.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v4 0/4] Changed path filter hash fix and version bump
  2023-06-21 12:19       ` Taylor Blau
@ 2023-06-21 17:53         ` Derrick Stolee
  2023-06-22 22:27           ` Jonathan Tan
  0 siblings, 1 reply; 116+ messages in thread
From: Derrick Stolee @ 2023-06-21 17:53 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, Jonathan Tan, git, Ramsay Jones

On 6/21/2023 8:19 AM, Taylor Blau wrote:
> On Tue, Jun 20, 2023 at 09:43:46AM -0400, Derrick Stolee wrote:

>>  commitGraph.changedPathsVersion: Which version should we _write_
>>  when writing a new commit-graph? Defaults to '1' but will default
>>  to '2' in the next major verion, then '1' will no longer be an
>>  accepted value in the version after that.
> 
> I am not sure if there's a situation where we'd ever want to not write
> the newer versions when starting a new commit-graph (or chain) from
> scratch.

I'd rather have the choice to start writing the new filter mode be
made by config rather than a change to the Git binary. Makes for a
more gradual rollout to be sure there aren't issues with the new
version.

So please keep the configuration value, but have it indicate the
mode used when writing filters.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v4 1/4] gitformat-commit-graph: describe version 2 of BDAT
  2023-06-21 12:08       ` Taylor Blau
@ 2023-06-22 22:26         ` Jonathan Tan
  2023-06-23 13:05           ` Derrick Stolee
  0 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-06-22 22:26 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, Junio C Hamano, git, Ramsay Jones

Taylor Blau <me@ttaylorr.com> writes:
> > I wonder if we want to mention what the undesired misbehaviour the
> > "bug" causes and what we do to avoid getting affected by the bug
> > here.  If we can say something like "When querying for a pathname
> > with a byte with high-bit set, the buggy filter may produce false
> > negative, making the filter unusable, but asking for a pathname
> > without such a byte produces no false negatives (even though we may
> > get false positives).  When Git reads version 1 filter data, it
> > refrains from using it for processing paths with high-bit set to
> > avoid triggering the bug", then it would be ideal.
> 
> Your description of the bug matches my understanding of the issue, that
> a corrupt filter would produce false negatives and thus be unusable.
> 
> I skimmed through the rest of the series, and couldn't find a spot where
> we do the latter, i.e. still use v1 filters as long as we don't have any
> characters in the path with high-order bits set.
> 
> I think this would be as simple as modifying the Bloom filter query
> function to return "maybe" before even trying to hash a path with at
> least one character with its high-bit set.
> 
> Apologies if this functionality is implemented and I just missed it.
> 
> Thanks,
> Taylor

Thanks for the suggestion - yeah, this might work.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v4 0/4] Changed path filter hash fix and version bump
  2023-06-21 17:53         ` Derrick Stolee
@ 2023-06-22 22:27           ` Jonathan Tan
  2023-06-23 13:18             ` Derrick Stolee
  0 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-06-22 22:27 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Jonathan Tan, Taylor Blau, Junio C Hamano, git, Ramsay Jones

Derrick Stolee <derrickstolee@github.com> writes:
> On 6/21/2023 8:19 AM, Taylor Blau wrote:
> > On Tue, Jun 20, 2023 at 09:43:46AM -0400, Derrick Stolee wrote:
> 
> >>  commitGraph.changedPathsVersion: Which version should we _write_
> >>  when writing a new commit-graph? Defaults to '1' but will default
> >>  to '2' in the next major verion, then '1' will no longer be an
> >>  accepted value in the version after that.
> > 
> > I am not sure if there's a situation where we'd ever want to not write
> > the newer versions when starting a new commit-graph (or chain) from
> > scratch.
> 
> I'd rather have the choice to start writing the new filter mode be
> made by config rather than a change to the Git binary. Makes for a
> more gradual rollout to be sure there aren't issues with the new
> version.
> 
> So please keep the configuration value, but have it indicate the
> mode used when writing filters.
> 
> Thanks,
> -Stolee

It looks like we can't avoid writing both versions (we need to write
version 1 so that we can reuse existing Bloom filters when writing, if
the repo has version 1 Bloom filters) so a config that tells us which to
write sounds doable. I'll take a look.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v4 1/4] gitformat-commit-graph: describe version 2 of BDAT
  2023-06-22 22:26         ` Jonathan Tan
@ 2023-06-23 13:05           ` Derrick Stolee
  0 siblings, 0 replies; 116+ messages in thread
From: Derrick Stolee @ 2023-06-23 13:05 UTC (permalink / raw)
  To: Jonathan Tan, Taylor Blau; +Cc: Junio C Hamano, git, Ramsay Jones

On 6/22/2023 6:26 PM, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>>> I wonder if we want to mention what the undesired misbehaviour the
>>> "bug" causes and what we do to avoid getting affected by the bug
>>> here.  If we can say something like "When querying for a pathname
>>> with a byte with high-bit set, the buggy filter may produce false
>>> negative, making the filter unusable, but asking for a pathname
>>> without such a byte produces no false negatives (even though we may
>>> get false positives).  When Git reads version 1 filter data, it
>>> refrains from using it for processing paths with high-bit set to
>>> avoid triggering the bug", then it would be ideal.
>>
>> Your description of the bug matches my understanding of the issue, that
>> a corrupt filter would produce false negatives and thus be unusable.
>>
>> I skimmed through the rest of the series, and couldn't find a spot where
>> we do the latter, i.e. still use v1 filters as long as we don't have any
>> characters in the path with high-order bits set.
>>
>> I think this would be as simple as modifying the Bloom filter query
>> function to return "maybe" before even trying to hash a path with at
>> least one character with its high-bit set.
>>
>> Apologies if this functionality is implemented and I just missed it.
>>
>> Thanks,
>> Taylor
> 
> Thanks for the suggestion - yeah, this might work.

If I understand the situation correctly, the high bits can make the
hashes "not very random" but they are still effective at identifying
the "maybe" case consistently for the inputs it is given (it would not
present a "no" when it should not, but it might say "maybe" more often
than it should). The behavior is only incorrect if the same commit-graph
file is used with two different Git versions that were compiled with
different signed-ness.

If that is the case, then ignoring the Bloom filters when you see a
high bit would change the performance implication from "probably
slower" to "definitely slower" but not affect the correctness in a
system that doesn't have competing Git versions with different
compiler semantics.

That is to say, doing this extra work doesn't seem to be critical to
making this change. The ROI seems too low.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v4 0/4] Changed path filter hash fix and version bump
  2023-06-22 22:27           ` Jonathan Tan
@ 2023-06-23 13:18             ` Derrick Stolee
  0 siblings, 0 replies; 116+ messages in thread
From: Derrick Stolee @ 2023-06-23 13:18 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Taylor Blau, Junio C Hamano, git, Ramsay Jones

On 6/22/2023 6:27 PM, Jonathan Tan wrote:

> It looks like we can't avoid writing both versions (we need to write
> version 1 so that we can reuse existing Bloom filters when writing, if
> the repo has version 1 Bloom filters) so a config that tells us which to
> write sounds doable. I'll take a look.

I don't fully understand what you're saying here.

We need to be able to write both versions (not simultaneously, but
toggled via config) so we can roll out this change carefully instead
of suddenly due to the Git executable changing.

But we don't need to be able to write version 1 just so we can reuse
version 1 filters. In fact, we should be able to upgrade to version 2
if the config points at that, but we should _not_ re-use the filters
in that case.

This does present an interesting challenge for the upgrade: we have
the commitGraph.maxNewFilters option to limit the amount of new filters
we write at a given time. When shifting to version 2, we will start
from scratch, so that could have some effect. I will consider how to
handle this, perhaps by raising the number temporarily.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v5 0/4] Changed path filter hash fix and version bump
  2023-05-22 21:48 [PATCH 0/2] Changed path filter hash fix and version bump Jonathan Tan
                   ` (5 preceding siblings ...)
  2023-06-13 17:39 ` [PATCH v4 " Jonathan Tan
@ 2023-07-13 21:42 ` Jonathan Tan
  2023-07-13 21:42   ` [PATCH v5 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
                     ` (4 more replies)
  2023-07-20 21:46 ` [PATCH v6 0/7] " Jonathan Tan
  2023-08-01 18:41 ` [PATCH v7 " Jonathan Tan
  8 siblings, 5 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-07-13 21:42 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano, Taylor Blau

Sorry it took me a while to get back to this. Looking at the existing
code, Bloom filters are passed around a lot without context, especially
when writing - they are generated into a commit slab and then when it is
time to write them to disk, they are taken from that commit slab. And
rather than annotating where they are passed around, I thought it better
to stick to the single-version approach in version 4 (per Git invocation
and per repo, only one version), which also sidesteps what happens if
there so happens to be multiple commit graphs each with their own Bloom
filter version (not possible to be generated by Git but possible with
a hex editor) and what happens if we want to write a different version
than what is currently stored in the commit slab. But with an auto-
detection of that version, I think we have what we need; in regular
operation, Git will run with whatever the version on disk is, and when
it is time to migrate, the user can explicitly specify the version.

I did not implement the mitigation of not using the Bloom filters when
a high-bit path is sought because, as Stolee says, this is useful only
when mixing Git implementations and will slow down operations (without
any increase in correctness) in the absence of such a mix [1]. But I can
implement this if need be.

[1] https://lore.kernel.org/git/e57b2272-b269-b705-3d42-d32e0b410f03@github.com/

Jonathan Tan (4):
  gitformat-commit-graph: describe version 2 of BDAT
  t4216: test changed path filters with high bit paths
  repo-settings: introduce commitgraph.changedPathsVersion
  commit-graph: new filter ver. that fixes murmur3

 Documentation/config/commitgraph.txt     |  19 +++-
 Documentation/gitformat-commit-graph.txt |   9 +-
 bloom.c                                  |  65 ++++++++++++-
 bloom.h                                  |   8 +-
 commit-graph.c                           |  33 +++++--
 oss-fuzz/fuzz-commit-graph.c             |   2 +-
 repo-settings.c                          |   6 +-
 repository.h                             |   2 +-
 t/helper/test-bloom.c                    |   9 +-
 t/t0095-bloom.sh                         |   8 ++
 t/t4216-log-bloom.sh                     | 117 +++++++++++++++++++++++
 11 files changed, 256 insertions(+), 22 deletions(-)

Range-diff against v4:
1:  a5955cda3d ! 1:  52e281eef0 gitformat-commit-graph: describe version 2 of BDAT
    @@ Documentation/gitformat-commit-graph.txt: All multi-byte numbers are in network
      	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
      	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
     -	in Probabilistic Verification"
    -+	in Probabilistic Verification". Version 1 bloom filters have a bug that appears
    ++	in Probabilistic Verification". Version 1 Bloom filters have a bug that appears
     +	when char is signed and the repository has path names that have characters >=
     +	0x80; Git supports reading and writing them, but this ability will be removed
     +	in a future version of Git.
2:  68732120f9 ! 2:  94a4c7af38 t4216: test changed path filters with high bit paths
    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm
     +test_expect_success 'setup check value of version 1 changed-path' '
     +	(cd highbit1 &&
     +		printf "52a9" >expect &&
    -+		get_first_changed_path_filter >actual)
    ++		get_first_changed_path_filter >actual &&
    ++		test_cmp expect actual)
     +'
     +
     +# expect will not match actual if char is unsigned by default. Write the test
3:  44cbcc6a69 ! 3:  131095666d repo-settings: introduce commitgraph.changedPathsVersion
    @@ Commit message
         repo-settings: introduce commitgraph.changedPathsVersion
     
         A subsequent commit will introduce another version of the changed-path
    -    filter in the commit graph file. In order to control which version is
    -    to be accepted when read (and which version to write), a config variable
    -    is needed.
    +    filter in the commit graph file. In order to control which version to
    +    write (and read), a config variable is needed.
     
         Therefore, introduce this config variable. For forwards compatibility,
         teach Git to not read commit graphs when the config variable
    @@ Commit message
         This commit does not change the behavior of writing (Git writes changed
         path filters when explicitly instructed regardless of any config
         variable), but a subsequent commit will restrict Git such that it will
    -    only write when commitgraph.changedPathsVersion is 0, 1, or 2.
    +    only write when commitgraph.changedPathsVersion is a recognized value.
     
         Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
         Signed-off-by: Junio C Hamano <gitster@pobox.com>
    @@ Documentation/config/commitgraph.txt: commitGraph.maxNewFilters::
     -	If true, then git will use the changed-path Bloom filters in the
     -	commit-graph file (if it exists, and they are present). Defaults to
     -	true. See linkgit:git-commit-graph[1] for more information.
    -+	Deprecated. Equivalent to changedPathsVersion=1 if true, and
    ++	Deprecated. Equivalent to changedPathsVersion=-1 if true, and
     +	changedPathsVersion=0 if false.
     +
     +commitGraph.changedPathsVersion::
     +	Specifies the version of the changed-path Bloom filters that Git will read and
    -+	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
    ++	write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not
     +	match the version set in this config variable will be ignored.
     ++
    -+Defaults to 1.
    ++Defaults to -1.
    +++
    ++If -1, Git will use the version of the changed-path Bloom filters in the
    ++repository, defaulting to 1 if there are none.
     ++
     +If 0, git will write version 1 Bloom filters when instructed to write.
     ++
    @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s,
      	}
      
     -	if (s->commit_graph_read_changed_paths) {
    -+	if (s->commit_graph_changed_paths_version == 1) {
    ++	if (s->commit_graph_changed_paths_version != 0) {
      		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
      			   &graph->chunk_bloom_indexes);
      		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
    @@ repo-settings.c: void prepare_repo_settings(struct repository *r)
     +	repo_cfg_bool(r, "commitgraph.readchangedpaths", &readChangedPaths, 1);
     +	repo_cfg_int(r, "commitgraph.changedpathsversion",
     +		     &r->settings.commit_graph_changed_paths_version,
    -+		     readChangedPaths ? 1 : 0);
    ++		     readChangedPaths ? -1 : 0);
      	repo_cfg_bool(r, "gc.writecommitgraph", &r->settings.gc_write_commit_graph, 1);
      	repo_cfg_bool(r, "fetch.writecommitgraph", &r->settings.fetch_write_commit_graph, 0);
      
4:  6dee3bfa70 ! 4:  47ba89c565 commit-graph: new filter ver. that fixes murmur3
    @@ Commit message
         So this patch does not include any mechanism to "salvage" changed path
         filters from repositories. There is also no "mixed" mode - for each
         invocation of Git, reading and writing changed path filters are done
    -    with the same version number.
    +    with the same version number; this version number may be explicitly
    +    stated (typically if the user knows which version they need) or
    +    automatically determined from the version of the existing changed path
    +    filters in the repository.
     
         There is a change in write_commit_graph(). graph_read_bloom_data()
         makes it possible for chunk_bloom_data to be non-NULL but
    @@ Documentation/config/commitgraph.txt: commitGraph.readChangedPaths::
      
      commitGraph.changedPathsVersion::
      	Specifies the version of the changed-path Bloom filters that Git will read and
    --	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
    -+	write. May be 0, 1, or 2. Any changed-path Bloom filters on disk that do not
    +-	write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not
    ++	write. May be -1, 0, 1, or 2. Any changed-path Bloom filters on disk that do not
      	match the version set in this config variable will be ignored.
      +
    - Defaults to 1.
    + Defaults to -1.
     
      ## bloom.c ##
     @@ bloom.c: static int load_bloom_filter_from_graph(struct commit_graph *g,
    @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star
      
     +struct graph_read_bloom_data_data {
     +	struct commit_graph *g;
    -+	int commit_graph_changed_paths_version;
    ++	int *commit_graph_changed_paths_version;
     +};
     +
      static int graph_read_bloom_data(const unsigned char *chunk_start,
    @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star
      	hash_version = get_be32(chunk_start);
      
     -	if (hash_version != 1)
    -+	if (hash_version != d->commit_graph_changed_paths_version)
    - 		return 0;
    +-		return 0;
    ++	if (*d->commit_graph_changed_paths_version == -1) {
    ++		*d->commit_graph_changed_paths_version = hash_version;
    ++	} else if (hash_version != *d->commit_graph_changed_paths_version) {
    ++ 		return 0;
    ++	}
      
      	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
    + 	g->bloom_filter_settings->hash_version = hash_version;
     @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s,
    - 			graph->read_generation_data = 1;
      	}
      
    --	if (s->commit_graph_changed_paths_version == 1) {
    -+	if (s->commit_graph_changed_paths_version == 1
    -+	    || s->commit_graph_changed_paths_version == 2) {
    + 	if (s->commit_graph_changed_paths_version != 0) {
     +		struct graph_read_bloom_data_data data = {
     +			.g = graph,
    -+			.commit_graph_changed_paths_version = s->commit_graph_changed_paths_version
    ++			.commit_graph_changed_paths_version = &s->commit_graph_changed_paths_version
     +		};
      		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
      			   &graph->chunk_bloom_indexes);
    @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
      	ctx->write_generation_data = (get_configured_generation_version(r) == 2);
      	ctx->num_generation_data_overflows = 0;
      
    -+	if (r->settings.commit_graph_changed_paths_version < 0
    ++	if (r->settings.commit_graph_changed_paths_version < -1
     +	    || r->settings.commit_graph_changed_paths_version > 2) {
     +		warning(_("attempting to write a commit-graph, but 'commitgraph.changedPathsVersion' (%d) is not supported"),
     +			r->settings.commit_graph_changed_paths_version);
    @@ t/t0095-bloom.sh: test_expect_success 'compute unseeded murmur3 hash for test st
      	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
     
      ## t/t4216-log-bloom.sh ##
    +@@ t/t4216-log-bloom.sh: get_bdat_offset () {
    + 		.git/objects/info/commit-graph
    + }
    + 
    ++get_changed_path_filter_version () {
    ++	BDAT_OFFSET=$(get_bdat_offset) &&
    ++	perl -0777 -ne \
    ++		'print unpack("H*", substr($_, '$BDAT_OFFSET', 4))' \
    ++		.git/objects/info/commit-graph
    ++}
    ++
    + get_first_changed_path_filter () {
    + 	BDAT_OFFSET=$(get_bdat_offset) &&
    + 	perl -0777 -ne \
    +@@ t/t4216-log-bloom.sh: test_expect_success 'set up repo with high bit path, version 1 changed-path' '
    + 	git -C highbit1 commit-graph write --reachable --changed-paths
    + '
    + 
    +-test_expect_success 'setup check value of version 1 changed-path' '
    ++test_expect_success 'check value of version 1 changed-path' '
    + 	(cd highbit1 &&
    + 		printf "52a9" >expect &&
    + 		get_first_changed_path_filter >actual &&
     @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when version 1 requested' '
      		test_bloom_filters_used "-- $CENT")
      '
    @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when vers
     +		test_bloom_filters_not_used "-- $CENT")
     +'
     +
    ++test_expect_success 'version 1 changed-path used when autodetect requested' '
    ++	(cd highbit1 &&
    ++		git config --add commitgraph.changedPathsVersion -1 &&
    ++		test_bloom_filters_used "-- $CENT")
    ++'
    ++
    ++test_expect_success 'when writing another commit graph, preserve existing version 1 of changed-path' '
    ++	test_commit -C highbit1 c1double "$CENT$CENT" &&
    ++	git -C highbit1 commit-graph write --reachable --changed-paths &&
    ++	(cd highbit1 &&
    ++		git config --add commitgraph.changedPathsVersion -1 &&
    ++		printf "00000001" >expect &&
    ++		get_changed_path_filter_version >actual &&
    ++		test_cmp expect actual)
    ++'
    ++
     +test_expect_success 'set up repo with high bit path, version 2 changed-path' '
     +	git init highbit2 &&
     +	git -C highbit2 config --add commitgraph.changedPathsVersion 2 &&
    @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when vers
     +		git config --add commitgraph.changedPathsVersion 1 &&
     +		test_bloom_filters_not_used "-- $CENT")
     +'
    ++
    ++test_expect_success 'version 2 changed-path used when autodetect requested' '
    ++	(cd highbit2 &&
    ++		git config --add commitgraph.changedPathsVersion -1 &&
    ++		test_bloom_filters_used "-- $CENT")
    ++'
    ++
    ++test_expect_success 'when writing another commit graph, preserve existing version 2 of changed-path' '
    ++	test_commit -C highbit2 c2double "$CENT$CENT" &&
    ++	git -C highbit2 commit-graph write --reachable --changed-paths &&
    ++	(cd highbit2 &&
    ++		git config --add commitgraph.changedPathsVersion -1 &&
    ++		printf "00000002" >expect &&
    ++		get_changed_path_filter_version >actual &&
    ++		test_cmp expect actual)
    ++'
     +
      test_done
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v5 1/4] gitformat-commit-graph: describe version 2 of BDAT
  2023-07-13 21:42 ` [PATCH v5 " Jonathan Tan
@ 2023-07-13 21:42   ` Jonathan Tan
  2023-07-19 17:25     ` Taylor Blau
  2023-07-13 21:42   ` [PATCH v5 2/4] t4216: test changed path filters with high bit paths Jonathan Tan
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-07-13 21:42 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano, Taylor Blau

The code change to Git to support version 2 will be done in subsequent
commits.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/gitformat-commit-graph.txt | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/Documentation/gitformat-commit-graph.txt b/Documentation/gitformat-commit-graph.txt
index 31cad585e2..3e906e8030 100644
--- a/Documentation/gitformat-commit-graph.txt
+++ b/Documentation/gitformat-commit-graph.txt
@@ -142,13 +142,16 @@ All multi-byte numbers are in network byte order.
 
 ==== Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
     * It starts with header consisting of three unsigned 32-bit integers:
-      - Version of the hash algorithm being used. We currently only support
-	value 1 which corresponds to the 32-bit version of the murmur3 hash
+      - Version of the hash algorithm being used. We currently support
+	value 2 which corresponds to the 32-bit version of the murmur3 hash
 	implemented exactly as described in
 	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
 	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
 	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
-	in Probabilistic Verification"
+	in Probabilistic Verification". Version 1 Bloom filters have a bug that appears
+	when char is signed and the repository has path names that have characters >=
+	0x80; Git supports reading and writing them, but this ability will be removed
+	in a future version of Git.
       - The number of times a path is hashed and hence the number of bit positions
 	      that cumulatively determine whether a file is present in the commit.
       - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v5 2/4] t4216: test changed path filters with high bit paths
  2023-07-13 21:42 ` [PATCH v5 " Jonathan Tan
  2023-07-13 21:42   ` [PATCH v5 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
@ 2023-07-13 21:42   ` Jonathan Tan
  2023-07-13 22:50     ` Junio C Hamano
  2023-07-13 21:42   ` [PATCH v5 3/4] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-07-13 21:42 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano, Taylor Blau

Subsequent commits will teach Git another version of changed path
filter that has different behavior with paths that contain at least
one character with its high bit set, so test the existing behavior as
a baseline.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 t/t4216-log-bloom.sh | 47 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fa9d32facf..0cf208fdf5 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -404,4 +404,51 @@ test_expect_success 'Bloom generation backfills empty commits' '
 	)
 '
 
+get_bdat_offset () {
+	perl -0777 -ne \
+		'print unpack("N", "$1") if /BDAT\0\0\0\0(....)/ or exit 1' \
+		.git/objects/info/commit-graph
+}
+
+get_first_changed_path_filter () {
+	BDAT_OFFSET=$(get_bdat_offset) &&
+	perl -0777 -ne \
+		'print unpack("H*", substr($_, '$BDAT_OFFSET' + 12, 2))' \
+		.git/objects/info/commit-graph
+}
+
+# chosen to be the same under all Unicode normalization forms
+CENT=$(printf "\302\242")
+
+test_expect_success 'set up repo with high bit path, version 1 changed-path' '
+	git init highbit1 &&
+	test_commit -C highbit1 c1 "$CENT" &&
+	git -C highbit1 commit-graph write --reachable --changed-paths
+'
+
+test_expect_success 'setup check value of version 1 changed-path' '
+	(cd highbit1 &&
+		printf "52a9" >expect &&
+		get_first_changed_path_filter >actual &&
+		test_cmp expect actual)
+'
+
+# expect will not match actual if char is unsigned by default. Write the test
+# in this way, so that a user running this test script can still see if the two
+# files match. (It will appear as an ordinary success if they match, and a skip
+# if not.)
+if test_cmp highbit1/expect highbit1/actual
+then
+	test_set_prereq SIGNED_CHAR_BY_DEFAULT
+fi
+test_expect_success SIGNED_CHAR_BY_DEFAULT 'check value of version 1 changed-path' '
+	# Only the prereq matters for this test.
+	true
+'
+
+test_expect_success 'version 1 changed-path used when version 1 requested' '
+	(cd highbit1 &&
+		test_bloom_filters_used "-- $CENT")
+'
+
 test_done
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v5 3/4] repo-settings: introduce commitgraph.changedPathsVersion
  2023-07-13 21:42 ` [PATCH v5 " Jonathan Tan
  2023-07-13 21:42   ` [PATCH v5 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
  2023-07-13 21:42   ` [PATCH v5 2/4] t4216: test changed path filters with high bit paths Jonathan Tan
@ 2023-07-13 21:42   ` Jonathan Tan
  2023-07-19 18:10     ` Taylor Blau
  2023-07-13 21:42   ` [PATCH v5 4/4] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
  2023-07-13 22:16   ` [PATCH v5 0/4] Changed path filter hash fix and version bump Junio C Hamano
  4 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-07-13 21:42 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano, Taylor Blau

A subsequent commit will introduce another version of the changed-path
filter in the commit graph file. In order to control which version to
write (and read), a config variable is needed.

Therefore, introduce this config variable. For forwards compatibility,
teach Git to not read commit graphs when the config variable
is set to an unsupported version. Because we teach Git this,
commitgraph.readChangedPaths is now redundant, so deprecate it and
define its behavior in terms of the config variable we introduce.

This commit does not change the behavior of writing (Git writes changed
path filters when explicitly instructed regardless of any config
variable), but a subsequent commit will restrict Git such that it will
only write when commitgraph.changedPathsVersion is a recognized value.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config/commitgraph.txt | 19 ++++++++++++++++---
 commit-graph.c                       |  2 +-
 oss-fuzz/fuzz-commit-graph.c         |  2 +-
 repo-settings.c                      |  6 +++++-
 repository.h                         |  2 +-
 5 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index 30604e4a4c..07f3799e05 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -9,6 +9,19 @@ commitGraph.maxNewFilters::
 	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
 
 commitGraph.readChangedPaths::
-	If true, then git will use the changed-path Bloom filters in the
-	commit-graph file (if it exists, and they are present). Defaults to
-	true. See linkgit:git-commit-graph[1] for more information.
+	Deprecated. Equivalent to changedPathsVersion=-1 if true, and
+	changedPathsVersion=0 if false.
+
+commitGraph.changedPathsVersion::
+	Specifies the version of the changed-path Bloom filters that Git will read and
+	write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not
+	match the version set in this config variable will be ignored.
++
+Defaults to -1.
++
+If -1, Git will use the version of the changed-path Bloom filters in the
+repository, defaulting to 1 if there are none.
++
+If 0, git will write version 1 Bloom filters when instructed to write.
++
+See linkgit:git-commit-graph[1] for more information.
diff --git a/commit-graph.c b/commit-graph.c
index c11b59f28b..9b72319450 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -399,7 +399,7 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 			graph->read_generation_data = 1;
 	}
 
-	if (s->commit_graph_read_changed_paths) {
+	if (s->commit_graph_changed_paths_version != 0) {
 		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
 			   &graph->chunk_bloom_indexes);
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
diff --git a/oss-fuzz/fuzz-commit-graph.c b/oss-fuzz/fuzz-commit-graph.c
index 914026f5d8..b56731f51a 100644
--- a/oss-fuzz/fuzz-commit-graph.c
+++ b/oss-fuzz/fuzz-commit-graph.c
@@ -18,7 +18,7 @@ int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
 	 * possible.
 	 */
 	the_repository->settings.commit_graph_generation_version = 2;
-	the_repository->settings.commit_graph_read_changed_paths = 1;
+	the_repository->settings.commit_graph_changed_paths_version = 1;
 	g = parse_commit_graph(&the_repository->settings, (void *)data, size);
 	repo_clear(the_repository);
 	free_commit_graph(g);
diff --git a/repo-settings.c b/repo-settings.c
index 3dbd3f0e2e..e3b6565ffc 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -24,6 +24,7 @@ void prepare_repo_settings(struct repository *r)
 	int value;
 	const char *strval;
 	int manyfiles;
+	int readChangedPaths;
 
 	if (!r->gitdir)
 		BUG("Cannot add settings for uninitialized repository");
@@ -54,7 +55,10 @@ void prepare_repo_settings(struct repository *r)
 	/* Commit graph config or default, does not cascade (simple) */
 	repo_cfg_bool(r, "core.commitgraph", &r->settings.core_commit_graph, 1);
 	repo_cfg_int(r, "commitgraph.generationversion", &r->settings.commit_graph_generation_version, 2);
-	repo_cfg_bool(r, "commitgraph.readchangedpaths", &r->settings.commit_graph_read_changed_paths, 1);
+	repo_cfg_bool(r, "commitgraph.readchangedpaths", &readChangedPaths, 1);
+	repo_cfg_int(r, "commitgraph.changedpathsversion",
+		     &r->settings.commit_graph_changed_paths_version,
+		     readChangedPaths ? -1 : 0);
 	repo_cfg_bool(r, "gc.writecommitgraph", &r->settings.gc_write_commit_graph, 1);
 	repo_cfg_bool(r, "fetch.writecommitgraph", &r->settings.fetch_write_commit_graph, 0);
 
diff --git a/repository.h b/repository.h
index e8c67ffe16..1f1c32a6dd 100644
--- a/repository.h
+++ b/repository.h
@@ -32,7 +32,7 @@ struct repo_settings {
 
 	int core_commit_graph;
 	int commit_graph_generation_version;
-	int commit_graph_read_changed_paths;
+	int commit_graph_changed_paths_version;
 	int gc_write_commit_graph;
 	int gc_cruft_packs;
 	int fetch_write_commit_graph;
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v5 4/4] commit-graph: new filter ver. that fixes murmur3
  2023-07-13 21:42 ` [PATCH v5 " Jonathan Tan
                     ` (2 preceding siblings ...)
  2023-07-13 21:42   ` [PATCH v5 3/4] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
@ 2023-07-13 21:42   ` Jonathan Tan
  2023-07-19 18:24     ` Taylor Blau
  2023-07-13 22:16   ` [PATCH v5 0/4] Changed path filter hash fix and version bump Junio C Hamano
  4 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-07-13 21:42 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano, Taylor Blau

The murmur3 implementation in bloom.c has a bug when converting series
of 4 bytes into network-order integers when char is signed (which is
controllable by a compiler option, and the default signedness of char is
platform-specific). When a string contains characters with the high bit
set, this bug causes results that, although internally consistent within
Git, does not accord with other implementations of murmur3 and even with
Git binaries that were compiled with different signedness of char. This
bug affects both how Git writes changed path filters to disk and how Git
interprets changed path filters on disk.

Therefore, introduce a new version (2) of changed path filters that
corrects this problem. The existing version (1) is still supported and
is still the default, but users should migrate away from it as soon
as possible.

Because this bug only manifests with characters that have the high bit
set, it may be possible that some (or all) commits in a given repo would
have the same changed path filter both before and after this fix is
applied. However, in order to determine whether this is the case, the
changed paths would first have to be computed, at which point it is not
much more expensive to just compute a new changed path filter.

So this patch does not include any mechanism to "salvage" changed path
filters from repositories. There is also no "mixed" mode - for each
invocation of Git, reading and writing changed path filters are done
with the same version number; this version number may be explicitly
stated (typically if the user knows which version they need) or
automatically determined from the version of the existing changed path
filters in the repository.

There is a change in write_commit_graph(). graph_read_bloom_data()
makes it possible for chunk_bloom_data to be non-NULL but
bloom_filter_settings to be NULL, which causes a segfault later on. I
produced such a segfault while developing this patch, but couldn't find
a way to reproduce it neither after this complete patch (or before),
but in any case it seemed like a good thing to include that might help
future patch authors.

The value in t0095 was obtained from another murmur3 implementation
using the following Go source code:

  package main

  import "fmt"
  import "github.com/spaolacci/murmur3"

  func main() {
          fmt.Printf("%x\n", murmur3.Sum32([]byte("Hello world!")))
          fmt.Printf("%x\n", murmur3.Sum32([]byte{0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff}))
  }

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config/commitgraph.txt |  2 +-
 bloom.c                              | 65 +++++++++++++++++++++++--
 bloom.h                              |  8 ++--
 commit-graph.c                       | 31 ++++++++++--
 t/helper/test-bloom.c                |  9 +++-
 t/t0095-bloom.sh                     |  8 ++++
 t/t4216-log-bloom.sh                 | 72 +++++++++++++++++++++++++++-
 7 files changed, 181 insertions(+), 14 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index 07f3799e05..b93ccfba8f 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -14,7 +14,7 @@ commitGraph.readChangedPaths::
 
 commitGraph.changedPathsVersion::
 	Specifies the version of the changed-path Bloom filters that Git will read and
-	write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not
+	write. May be -1, 0, 1, or 2. Any changed-path Bloom filters on disk that do not
 	match the version set in this config variable will be ignored.
 +
 Defaults to -1.
diff --git a/bloom.c b/bloom.c
index d0730525da..915d8e5a31 100644
--- a/bloom.c
+++ b/bloom.c
@@ -65,7 +65,64 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
  * Not considered to be cryptographically secure.
  * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  */
-uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
+uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
+{
+	const uint32_t c1 = 0xcc9e2d51;
+	const uint32_t c2 = 0x1b873593;
+	const uint32_t r1 = 15;
+	const uint32_t r2 = 13;
+	const uint32_t m = 5;
+	const uint32_t n = 0xe6546b64;
+	int i;
+	uint32_t k1 = 0;
+	const char *tail;
+
+	int len4 = len / sizeof(uint32_t);
+
+	uint32_t k;
+	for (i = 0; i < len4; i++) {
+		uint32_t byte1 = (uint32_t)(unsigned char)data[4*i];
+		uint32_t byte2 = ((uint32_t)(unsigned char)data[4*i + 1]) << 8;
+		uint32_t byte3 = ((uint32_t)(unsigned char)data[4*i + 2]) << 16;
+		uint32_t byte4 = ((uint32_t)(unsigned char)data[4*i + 3]) << 24;
+		k = byte1 | byte2 | byte3 | byte4;
+		k *= c1;
+		k = rotate_left(k, r1);
+		k *= c2;
+
+		seed ^= k;
+		seed = rotate_left(seed, r2) * m + n;
+	}
+
+	tail = (data + len4 * sizeof(uint32_t));
+
+	switch (len & (sizeof(uint32_t) - 1)) {
+	case 3:
+		k1 ^= ((uint32_t)(unsigned char)tail[2]) << 16;
+		/*-fallthrough*/
+	case 2:
+		k1 ^= ((uint32_t)(unsigned char)tail[1]) << 8;
+		/*-fallthrough*/
+	case 1:
+		k1 ^= ((uint32_t)(unsigned char)tail[0]) << 0;
+		k1 *= c1;
+		k1 = rotate_left(k1, r1);
+		k1 *= c2;
+		seed ^= k1;
+		break;
+	}
+
+	seed ^= (uint32_t)len;
+	seed ^= (seed >> 16);
+	seed *= 0x85ebca6b;
+	seed ^= (seed >> 13);
+	seed *= 0xc2b2ae35;
+	seed ^= (seed >> 16);
+
+	return seed;
+}
+
+static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
 {
 	const uint32_t c1 = 0xcc9e2d51;
 	const uint32_t c2 = 0x1b873593;
@@ -130,8 +187,10 @@ void fill_bloom_key(const char *data,
 	int i;
 	const uint32_t seed0 = 0x293ae76f;
 	const uint32_t seed1 = 0x7e646e2c;
-	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
-	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
+	const uint32_t hash0 = (settings->hash_version == 2
+		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed0, data, len);
+	const uint32_t hash1 = (settings->hash_version == 2
+		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed1, data, len);
 
 	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
 	for (i = 0; i < settings->num_hashes; i++)
diff --git a/bloom.h b/bloom.h
index adde6dfe21..0c33ae282c 100644
--- a/bloom.h
+++ b/bloom.h
@@ -7,9 +7,11 @@ struct repository;
 struct bloom_filter_settings {
 	/*
 	 * The version of the hashing technique being used.
-	 * We currently only support version = 1 which is
+	 * The newest version is 2, which is
 	 * the seeded murmur3 hashing technique implemented
-	 * in bloom.c.
+	 * in bloom.c. Bloom filters of version 1 were created
+	 * with prior versions of Git, which had a bug in the
+	 * implementation of the hash function.
 	 */
 	uint32_t hash_version;
 
@@ -75,7 +77,7 @@ struct bloom_key {
  * Not considered to be cryptographically secure.
  * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  */
-uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len);
+uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len);
 
 void fill_bloom_key(const char *data,
 		    size_t len,
diff --git a/commit-graph.c b/commit-graph.c
index 9b72319450..c50107eed5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -302,16 +302,25 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
 	return 0;
 }
 
+struct graph_read_bloom_data_data {
+	struct commit_graph *g;
+	int *commit_graph_changed_paths_version;
+};
+
 static int graph_read_bloom_data(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
-	struct commit_graph *g = data;
+	struct graph_read_bloom_data_data *d = data;
+	struct commit_graph *g = d->g;
 	uint32_t hash_version;
 	g->chunk_bloom_data = chunk_start;
 	hash_version = get_be32(chunk_start);
 
-	if (hash_version != 1)
-		return 0;
+	if (*d->commit_graph_changed_paths_version == -1) {
+		*d->commit_graph_changed_paths_version = hash_version;
+	} else if (hash_version != *d->commit_graph_changed_paths_version) {
+ 		return 0;
+	}
 
 	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
 	g->bloom_filter_settings->hash_version = hash_version;
@@ -400,10 +409,14 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 	}
 
 	if (s->commit_graph_changed_paths_version != 0) {
+		struct graph_read_bloom_data_data data = {
+			.g = graph,
+			.commit_graph_changed_paths_version = &s->commit_graph_changed_paths_version
+		};
 		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
 			   &graph->chunk_bloom_indexes);
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
-			   graph_read_bloom_data, graph);
+			   graph_read_bloom_data, &data);
 	}
 
 	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
@@ -2302,6 +2315,14 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->write_generation_data = (get_configured_generation_version(r) == 2);
 	ctx->num_generation_data_overflows = 0;
 
+	if (r->settings.commit_graph_changed_paths_version < -1
+	    || r->settings.commit_graph_changed_paths_version > 2) {
+		warning(_("attempting to write a commit-graph, but 'commitgraph.changedPathsVersion' (%d) is not supported"),
+			r->settings.commit_graph_changed_paths_version);
+		return 0;
+	}
+	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2
+		? 2 : 1;
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
 						      bloom_settings.bits_per_entry);
 	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
@@ -2331,7 +2352,7 @@ int write_commit_graph(struct object_directory *odb,
 		g = ctx->r->objects->commit_graph;
 
 		/* We have changed-paths already. Keep them in the next graph */
-		if (g && g->chunk_bloom_data) {
+		if (g && g->bloom_filter_settings) {
 			ctx->changed_paths = 1;
 			ctx->bloom_settings = g->bloom_filter_settings;
 		}
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 6c900ca668..34b8dd9164 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -48,6 +48,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 
 static const char *bloom_usage = "\n"
 "  test-tool bloom get_murmur3 <string>\n"
+"  test-tool bloom get_murmur3_seven_highbit\n"
 "  test-tool bloom generate_filter <string> [<string>...]\n"
 "  test-tool bloom get_filter_for_commit <commit-hex>\n";
 
@@ -62,7 +63,13 @@ int cmd__bloom(int argc, const char **argv)
 		uint32_t hashed;
 		if (argc < 3)
 			usage(bloom_usage);
-		hashed = murmur3_seeded(0, argv[2], strlen(argv[2]));
+		hashed = murmur3_seeded_v2(0, argv[2], strlen(argv[2]));
+		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
+	}
+
+	if (!strcmp(argv[1], "get_murmur3_seven_highbit")) {
+		uint32_t hashed;
+		hashed = murmur3_seeded_v2(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7);
 		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
 	}
 
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index b567383eb8..c8d84ab606 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -29,6 +29,14 @@ test_expect_success 'compute unseeded murmur3 hash for test string 2' '
 	test_cmp expect actual
 '
 
+test_expect_success 'compute unseeded murmur3 hash for test string 3' '
+	cat >expect <<-\EOF &&
+	Murmur3 Hash with seed=0:0xa183ccfd
+	EOF
+	test-tool bloom get_murmur3_seven_highbit >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'compute bloom key for empty string' '
 	cat >expect <<-\EOF &&
 	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 0cf208fdf5..6ff26e5af5 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -410,6 +410,13 @@ get_bdat_offset () {
 		.git/objects/info/commit-graph
 }
 
+get_changed_path_filter_version () {
+	BDAT_OFFSET=$(get_bdat_offset) &&
+	perl -0777 -ne \
+		'print unpack("H*", substr($_, '$BDAT_OFFSET', 4))' \
+		.git/objects/info/commit-graph
+}
+
 get_first_changed_path_filter () {
 	BDAT_OFFSET=$(get_bdat_offset) &&
 	perl -0777 -ne \
@@ -426,7 +433,7 @@ test_expect_success 'set up repo with high bit path, version 1 changed-path' '
 	git -C highbit1 commit-graph write --reachable --changed-paths
 '
 
-test_expect_success 'setup check value of version 1 changed-path' '
+test_expect_success 'check value of version 1 changed-path' '
 	(cd highbit1 &&
 		printf "52a9" >expect &&
 		get_first_changed_path_filter >actual &&
@@ -451,4 +458,67 @@ test_expect_success 'version 1 changed-path used when version 1 requested' '
 		test_bloom_filters_used "-- $CENT")
 '
 
+test_expect_success 'version 1 changed-path not used when version 2 requested' '
+	(cd highbit1 &&
+		git config --add commitgraph.changedPathsVersion 2 &&
+		test_bloom_filters_not_used "-- $CENT")
+'
+
+test_expect_success 'version 1 changed-path used when autodetect requested' '
+	(cd highbit1 &&
+		git config --add commitgraph.changedPathsVersion -1 &&
+		test_bloom_filters_used "-- $CENT")
+'
+
+test_expect_success 'when writing another commit graph, preserve existing version 1 of changed-path' '
+	test_commit -C highbit1 c1double "$CENT$CENT" &&
+	git -C highbit1 commit-graph write --reachable --changed-paths &&
+	(cd highbit1 &&
+		git config --add commitgraph.changedPathsVersion -1 &&
+		printf "00000001" >expect &&
+		get_changed_path_filter_version >actual &&
+		test_cmp expect actual)
+'
+
+test_expect_success 'set up repo with high bit path, version 2 changed-path' '
+	git init highbit2 &&
+	git -C highbit2 config --add commitgraph.changedPathsVersion 2 &&
+	test_commit -C highbit2 c2 "$CENT" &&
+	git -C highbit2 commit-graph write --reachable --changed-paths
+'
+
+test_expect_success 'check value of version 2 changed-path' '
+	(cd highbit2 &&
+		printf "c01f" >expect &&
+		get_first_changed_path_filter >actual &&
+		test_cmp expect actual)
+'
+
+test_expect_success 'version 2 changed-path used when version 2 requested' '
+	(cd highbit2 &&
+		test_bloom_filters_used "-- $CENT")
+'
+
+test_expect_success 'version 2 changed-path not used when version 1 requested' '
+	(cd highbit2 &&
+		git config --add commitgraph.changedPathsVersion 1 &&
+		test_bloom_filters_not_used "-- $CENT")
+'
+
+test_expect_success 'version 2 changed-path used when autodetect requested' '
+	(cd highbit2 &&
+		git config --add commitgraph.changedPathsVersion -1 &&
+		test_bloom_filters_used "-- $CENT")
+'
+
+test_expect_success 'when writing another commit graph, preserve existing version 2 of changed-path' '
+	test_commit -C highbit2 c2double "$CENT$CENT" &&
+	git -C highbit2 commit-graph write --reachable --changed-paths &&
+	(cd highbit2 &&
+		git config --add commitgraph.changedPathsVersion -1 &&
+		printf "00000002" >expect &&
+		get_changed_path_filter_version >actual &&
+		test_cmp expect actual)
+'
+
 test_done
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v5 0/4] Changed path filter hash fix and version bump
  2023-07-13 21:42 ` [PATCH v5 " Jonathan Tan
                     ` (3 preceding siblings ...)
  2023-07-13 21:42   ` [PATCH v5 4/4] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
@ 2023-07-13 22:16   ` Junio C Hamano
  2023-07-13 22:59     ` Junio C Hamano
  2023-07-14 18:48     ` Jonathan Tan
  4 siblings, 2 replies; 116+ messages in thread
From: Junio C Hamano @ 2023-07-13 22:16 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee, Taylor Blau

Jonathan Tan <jonathantanmy@google.com> writes:

> I did not implement the mitigation of not using the Bloom filters when
> a high-bit path is sought because, as Stolee says, this is useful only
> when mixing Git implementations and will slow down operations (without
> any increase in correctness) in the absence of such a mix [1].

Sensible, I guess.

>     @@ Commit message
>          This commit does not change the behavior of writing (Git writes changed
>          path filters when explicitly instructed regardless of any config
>          variable), but a subsequent commit will restrict Git such that it will
>     -    only write when commitgraph.changedPathsVersion is 0, 1, or 2.
>     +    only write when commitgraph.changedPathsVersion is a recognized value.

This is nicer.

>          Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
>          Signed-off-by: Junio C Hamano <gitster@pobox.com>
>     @@ Documentation/config/commitgraph.txt: commitGraph.maxNewFilters::
>      -	If true, then git will use the changed-path Bloom filters in the
>      -	commit-graph file (if it exists, and they are present). Defaults to
>      -	true. See linkgit:git-commit-graph[1] for more information.
>     -+	Deprecated. Equivalent to changedPathsVersion=1 if true, and
>     ++	Deprecated. Equivalent to changedPathsVersion=-1 if true, and
>      +	changedPathsVersion=0 if false.

I forgot to comment on this part earlier, but does the context make
it clear enough that these `changedPathsVersion` references are
about `commitGraph.changedPathsVersion` configuration variable
without fully spelled out?  They sit next to each other right now,
so it may not be too bad.  If they appeared across more distance,
I would be worried, though.

>      +commitGraph.changedPathsVersion::
>      +	Specifies the version of the changed-path Bloom filters that Git will read and
>     -+	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
>     ++	write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not
>      +	match the version set in this config variable will be ignored.

So, any time the user configures this to a different value, we will
start to ignore the existing changed-path-filters data in the
repository, and when we are told to write commit-graph, we will
construct changed-path-filters data using the new version?

>      ++
>     -+Defaults to 1.
>     ++Defaults to -1.
>     +++
>     ++If -1, Git will use the version of the changed-path Bloom filters in the
>     ++repository, defaulting to 1 if there are none.

OK, that was misleading.  The configuration can say "-1" and it does
not mean "I'll ignore anything other than version -1"---it means
"I'll read anything".  The earlier statement should be toned down so
that we do not surprise readers, perhaps

    When set to a positive integer value, any changed-path Bloom
    filters on disk whose version is different from the value are
    ignored.

to signal that 0 and negative are special.  Then the readers can
anticipate that special cases are described next.

    When set to -1, then ...
    When set to 0, then ...
    Defaults to -1.
    
When set to the special value -1, what version will we write?

>      +If 0, git will write version 1 Bloom filters when instructed to write.

And we will only read 0 and refuse to read 1?  Or we will read both
0 and 1?

Thanks.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v5 2/4] t4216: test changed path filters with high bit paths
  2023-07-13 21:42   ` [PATCH v5 2/4] t4216: test changed path filters with high bit paths Jonathan Tan
@ 2023-07-13 22:50     ` Junio C Hamano
  2023-07-19 17:27       ` Taylor Blau
  0 siblings, 1 reply; 116+ messages in thread
From: Junio C Hamano @ 2023-07-13 22:50 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee, Taylor Blau

Jonathan Tan <jonathantanmy@google.com> writes:

> Subsequent commits will teach Git another version of changed path
> filter that has different behavior with paths that contain at least
> one character with its high bit set, so test the existing behavior as
> a baseline.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>  t/t4216-log-bloom.sh | 47 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 47 insertions(+)
>
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index fa9d32facf..0cf208fdf5 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -404,4 +404,51 @@ test_expect_success 'Bloom generation backfills empty commits' '
>  	)
>  '
>  
> +get_bdat_offset () {
> +	perl -0777 -ne \
> +		'print unpack("N", "$1") if /BDAT\0\0\0\0(....)/ or exit 1' \
> +		.git/objects/info/commit-graph
> +}

Hopefully the 8-byte anchoring pattern is unique enough at least for
the purpose of this test script.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v5 0/4] Changed path filter hash fix and version bump
  2023-07-13 22:16   ` [PATCH v5 0/4] Changed path filter hash fix and version bump Junio C Hamano
@ 2023-07-13 22:59     ` Junio C Hamano
  2023-07-14 18:48     ` Jonathan Tan
  1 sibling, 0 replies; 116+ messages in thread
From: Junio C Hamano @ 2023-07-13 22:59 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee, Taylor Blau

Junio C Hamano <gitster@pobox.com> writes:

>>      +If 0, git will write version 1 Bloom filters when instructed to write.
>
> And we will only read 0 and refuse to read 1?  Or we will read both
> 0 and 1?

Answering to myself (only this part).  As setting the "version"
variable to 0 is equivalent to setting "read" variable to "false",
we will refuse to read anything.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v5 0/4] Changed path filter hash fix and version bump
  2023-07-13 22:16   ` [PATCH v5 0/4] Changed path filter hash fix and version bump Junio C Hamano
  2023-07-13 22:59     ` Junio C Hamano
@ 2023-07-14 18:48     ` Jonathan Tan
  1 sibling, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-07-14 18:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git, Derrick Stolee, Taylor Blau

Junio C Hamano <gitster@pobox.com> writes:
> >          Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> >          Signed-off-by: Junio C Hamano <gitster@pobox.com>
> >     @@ Documentation/config/commitgraph.txt: commitGraph.maxNewFilters::
> >      -	If true, then git will use the changed-path Bloom filters in the
> >      -	commit-graph file (if it exists, and they are present). Defaults to
> >      -	true. See linkgit:git-commit-graph[1] for more information.
> >     -+	Deprecated. Equivalent to changedPathsVersion=1 if true, and
> >     ++	Deprecated. Equivalent to changedPathsVersion=-1 if true, and
> >      +	changedPathsVersion=0 if false.
> 
> I forgot to comment on this part earlier, but does the context make
> it clear enough that these `changedPathsVersion` references are
> about `commitGraph.changedPathsVersion` configuration variable
> without fully spelled out?  They sit next to each other right now,
> so it may not be too bad.  If they appeared across more distance,
> I would be worried, though.

Ah, probably better to spell it out. I'll change it.

> >      +commitGraph.changedPathsVersion::
> >      +	Specifies the version of the changed-path Bloom filters that Git will read and
> >     -+	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
> >     ++	write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not
> >      +	match the version set in this config variable will be ignored.
> 
> So, any time the user configures this to a different value, we will
> start to ignore the existing changed-path-filters data in the
> repository, and when we are told to write commit-graph, we will
> construct changed-path-filters data using the new version?

Yes.

> >     -+Defaults to 1.
> >     ++Defaults to -1.
> >     +++
> >     ++If -1, Git will use the version of the changed-path Bloom filters in the
> >     ++repository, defaulting to 1 if there are none.
> 
> OK, that was misleading.  The configuration can say "-1" and it does
> not mean "I'll ignore anything other than version -1"---it means
> "I'll read anything".  The earlier statement should be toned down so
> that we do not surprise readers, perhaps

Ah, good point. Will do.

>     When set to a positive integer value, any changed-path Bloom
>     filters on disk whose version is different from the value are
>     ignored.
> 
> to signal that 0 and negative are special.  Then the readers can
> anticipate that special cases are described next.
> 
>     When set to -1, then ...
>     When set to 0, then ...
>     Defaults to -1.
>     
> When set to the special value -1, what version will we write?
> 
> >      +If 0, git will write version 1 Bloom filters when instructed to write.
> 
> And we will only read 0 and refuse to read 1?  Or we will read both
> 0 and 1?
> 
> Thanks.

Currently, there is only version 1 (no version 0) and after all the
patches in this patch set are applied, there will be version 1 and
version 2. I think that with your suggestions above, it will be clearer
to the reader.
 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v5 1/4] gitformat-commit-graph: describe version 2 of BDAT
  2023-07-13 21:42   ` [PATCH v5 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
@ 2023-07-19 17:25     ` Taylor Blau
  2023-07-20 20:20       ` Jonathan Tan
  0 siblings, 1 reply; 116+ messages in thread
From: Taylor Blau @ 2023-07-19 17:25 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee, Junio C Hamano

On Thu, Jul 13, 2023 at 02:42:08PM -0700, Jonathan Tan wrote:
> The code change to Git to support version 2 will be done in subsequent
> commits.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>  Documentation/gitformat-commit-graph.txt | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/gitformat-commit-graph.txt b/Documentation/gitformat-commit-graph.txt
> index 31cad585e2..3e906e8030 100644
> --- a/Documentation/gitformat-commit-graph.txt
> +++ b/Documentation/gitformat-commit-graph.txt
> @@ -142,13 +142,16 @@ All multi-byte numbers are in network byte order.
>
>  ==== Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]

This is a little beyond the scope of your series, but since we're
changing the on-disk format here a little bit, I think that it might be
worth it to consider whether there are any other changes that we'd like
to perform at the same time.

One that comes to mind is serializing the `max_changed_paths` value of
the Bloom filter settings, which is currently hard-coded as a constant,
c.f. 97ffa4fab50 (commit-graph.c: store maximum changed paths,
2020-09-17).

We always assume that the value there is 512, or the environment
variable GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS, if it is set. But it
might be nice to write it to disk, since it would allow us to do
something like:

--- 8< ---
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fa9d32facfb..a42b0b03cfb 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -178,11 +178,12 @@ test_expect_success 'persist filter settings' '
 	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
 		GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
 		GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=513 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2.txt &&
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":513" trace2.txt &&
 	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2-auto.txt
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":513" trace2-auto.txt
 '

 test_max_changed_paths () {
--- >8 ---

Which is currently not possible (the second grep assertion will fail,
since Git has no way to remember what the value of max_changed_paths is
from the existing commit-graph).

> +	in Probabilistic Verification". Version 1 Bloom filters have a bug that appears
> +	when char is signed and the repository has path names that have characters >=
> +	0x80; Git supports reading and writing them, but this ability will be removed
> +	in a future version of Git.

Makes sense.

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v5 2/4] t4216: test changed path filters with high bit paths
  2023-07-13 22:50     ` Junio C Hamano
@ 2023-07-19 17:27       ` Taylor Blau
  2023-07-19 17:55         ` [PATCH 0/4] commit-graph: avoid looking at Bloom filter data directly Taylor Blau
  0 siblings, 1 reply; 116+ messages in thread
From: Taylor Blau @ 2023-07-19 17:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git, Derrick Stolee

On Thu, Jul 13, 2023 at 03:50:53PM -0700, Junio C Hamano wrote:
> > @@ -404,4 +404,51 @@ test_expect_success 'Bloom generation backfills empty commits' '
> >  	)
> >  '
> >
> > +get_bdat_offset () {
> > +	perl -0777 -ne \
> > +		'print unpack("N", "$1") if /BDAT\0\0\0\0(....)/ or exit 1' \
> > +		.git/objects/info/commit-graph
> > +}
>
> Hopefully the 8-byte anchoring pattern is unique enough at least for
> the purpose of this test script.

I had the same thought myself. I wonder if this would be more
straightforward to implement in a test-helper.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 0/4] commit-graph: avoid looking at Bloom filter data directly
  2023-07-19 17:27       ` Taylor Blau
@ 2023-07-19 17:55         ` Taylor Blau
  2023-07-19 17:55           ` [PATCH 1/4] t/helper/test-read-graph.c: extract `dump_graph_info()` Taylor Blau
                             ` (5 more replies)
  0 siblings, 6 replies; 116+ messages in thread
From: Taylor Blau @ 2023-07-19 17:55 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano

Hi Jonathan,

Here's a few commits (and one fixup!) that could go before the second
patch of your series, with the fixup! getting squashed into the second
patch itself.

The first three are preparatory, but the fourth patch should allow us to
drop the Perl hackery necessary to dump the raw contents of arbitrary
Bloom filters.

Feel free to pick up these patches (or not), just wanted to get these
out there as a possible suggestion.

Taylor Blau (4):
  t/helper/test-read-graph.c: extract `dump_graph_info()`
  bloom.h: make `load_bloom_filter_from_graph()` public
  t/helper/test-read-graph: implement `bloom-filters` mode
  fixup! t4216: test changed path filters with high bit paths

 bloom.c                    |  6 ++--
 bloom.h                    |  5 +++
 t/helper/test-read-graph.c | 67 ++++++++++++++++++++++++++++++--------
 t/t4216-log-bloom.sh       | 14 ++------
 4 files changed, 64 insertions(+), 28 deletions(-)

--
2.41.0.366.g215419bf3c2.dirty

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 1/4] t/helper/test-read-graph.c: extract `dump_graph_info()`
  2023-07-19 17:55         ` [PATCH 0/4] commit-graph: avoid looking at Bloom filter data directly Taylor Blau
@ 2023-07-19 17:55           ` Taylor Blau
  2023-07-19 17:55           ` [PATCH 2/4] bloom.h: make `load_bloom_filter_from_graph()` public Taylor Blau
                             ` (4 subsequent siblings)
  5 siblings, 0 replies; 116+ messages in thread
From: Taylor Blau @ 2023-07-19 17:55 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano

Prepare for the 'read-graph' test helper to perform other tasks besides
dumping high-level information about the commit-graph by extracting its
main routine into a separate function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-read-graph.c | 33 ++++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 8c7a83f578f..c6649284123 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -5,20 +5,8 @@
 #include "bloom.h"
 #include "setup.h"
 
-int cmd__read_graph(int argc UNUSED, const char **argv UNUSED)
+static void dump_graph_info(struct commit_graph *graph)
 {
-	struct commit_graph *graph = NULL;
-	struct object_directory *odb;
-
-	setup_git_directory();
-	odb = the_repository->objects->odb;
-
-	prepare_repo_settings(the_repository);
-
-	graph = read_commit_graph_one(the_repository, odb);
-	if (!graph)
-		return 1;
-
 	printf("header: %08x %d %d %d %d\n",
 		ntohl(*(uint32_t*)graph->data),
 		*(unsigned char*)(graph->data + 4),
@@ -57,8 +45,27 @@ int cmd__read_graph(int argc UNUSED, const char **argv UNUSED)
 	if (graph->topo_levels)
 		printf(" topo_levels");
 	printf("\n");
+}
+
+int cmd__read_graph(int argc UNUSED, const char **argv UNUSED)
+{
+	struct commit_graph *graph = NULL;
+	struct object_directory *odb;
+
+	setup_git_directory();
+	odb = the_repository->objects->odb;
+
+	prepare_repo_settings(the_repository);
+
+	graph = read_commit_graph_one(the_repository, odb);
+	if (!graph)
+		return 1;
+
+	dump_graph_info(graph);
 
 	UNLEAK(graph);
 
 	return 0;
 }
+
+
-- 
2.41.0.366.g215419bf3c2.dirty


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 2/4] bloom.h: make `load_bloom_filter_from_graph()` public
  2023-07-19 17:55         ` [PATCH 0/4] commit-graph: avoid looking at Bloom filter data directly Taylor Blau
  2023-07-19 17:55           ` [PATCH 1/4] t/helper/test-read-graph.c: extract `dump_graph_info()` Taylor Blau
@ 2023-07-19 17:55           ` Taylor Blau
  2023-07-19 17:55           ` [PATCH 3/4] t/helper/test-read-graph: implement `bloom-filters` mode Taylor Blau
                             ` (3 subsequent siblings)
  5 siblings, 0 replies; 116+ messages in thread
From: Taylor Blau @ 2023-07-19 17:55 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano

Prepare for a future commit to use the load_bloom_filter_from_graph()
function directly to load specific Bloom filters out of the commit-graph
for manual inspection (to be used during tests).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.c | 6 +++---
 bloom.h | 5 +++++
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/bloom.c b/bloom.c
index aef6b5fea2d..3e78cfe79d4 100644
--- a/bloom.c
+++ b/bloom.c
@@ -29,9 +29,9 @@ static inline unsigned char get_bitmask(uint32_t pos)
 	return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
 }
 
-static int load_bloom_filter_from_graph(struct commit_graph *g,
-					struct bloom_filter *filter,
-					uint32_t graph_pos)
+int load_bloom_filter_from_graph(struct commit_graph *g,
+				 struct bloom_filter *filter,
+				 uint32_t graph_pos)
 {
 	uint32_t lex_pos, start_index, end_index;
 
diff --git a/bloom.h b/bloom.h
index adde6dfe212..1e4f612d2c2 100644
--- a/bloom.h
+++ b/bloom.h
@@ -3,6 +3,7 @@
 
 struct commit;
 struct repository;
+struct commit_graph;
 
 struct bloom_filter_settings {
 	/*
@@ -68,6 +69,10 @@ struct bloom_key {
 	uint32_t *hashes;
 };
 
+int load_bloom_filter_from_graph(struct commit_graph *g,
+				 struct bloom_filter *filter,
+				 uint32_t graph_pos);
+
 /*
  * Calculate the murmur3 32-bit hash value for the given data
  * using the given seed.
-- 
2.41.0.366.g215419bf3c2.dirty


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 3/4] t/helper/test-read-graph: implement `bloom-filters` mode
  2023-07-19 17:55         ` [PATCH 0/4] commit-graph: avoid looking at Bloom filter data directly Taylor Blau
  2023-07-19 17:55           ` [PATCH 1/4] t/helper/test-read-graph.c: extract `dump_graph_info()` Taylor Blau
  2023-07-19 17:55           ` [PATCH 2/4] bloom.h: make `load_bloom_filter_from_graph()` public Taylor Blau
@ 2023-07-19 17:55           ` Taylor Blau
  2023-07-19 17:55           ` [PATCH 4/4] fixup! t4216: test changed path filters with high bit paths Taylor Blau
                             ` (2 subsequent siblings)
  5 siblings, 0 replies; 116+ messages in thread
From: Taylor Blau @ 2023-07-19 17:55 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano

Implement a mode of the "read-graph" test helper to dump out the
hexadecimal contents of the Bloom filter(s) contained in a commit-graph.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-read-graph.c | 42 +++++++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index c6649284123..da9ac8584da 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -47,10 +47,32 @@ static void dump_graph_info(struct commit_graph *graph)
 	printf("\n");
 }
 
-int cmd__read_graph(int argc UNUSED, const char **argv UNUSED)
+static void dump_graph_bloom_filters(struct commit_graph *graph)
+{
+	uint32_t i;
+
+	for (i = 0; i < graph->num_commits + graph->num_commits_in_base; i++) {
+		struct bloom_filter filter = { 0 };
+		size_t j;
+
+		if (load_bloom_filter_from_graph(graph, &filter, i) < 0) {
+			fprintf(stderr, "missing Bloom filter for graph "
+				"position %"PRIu32"\n", i);
+			continue;
+		}
+
+		for (j = 0; j < filter.len; j++)
+			printf("%02x", filter.data[j]);
+		if (filter.len)
+			printf("\n");
+	}
+}
+
+int cmd__read_graph(int argc, const char **argv)
 {
 	struct commit_graph *graph = NULL;
 	struct object_directory *odb;
+	int ret = 0;
 
 	setup_git_directory();
 	odb = the_repository->objects->odb;
@@ -58,14 +80,24 @@ int cmd__read_graph(int argc UNUSED, const char **argv UNUSED)
 	prepare_repo_settings(the_repository);
 
 	graph = read_commit_graph_one(the_repository, odb);
-	if (!graph)
-		return 1;
+	if (!graph) {
+		ret = 1;
+		goto done;
+	}
 
-	dump_graph_info(graph);
+	if (argc <= 1)
+		dump_graph_info(graph);
+	else if (!strcmp(argv[1], "bloom-filters"))
+		dump_graph_bloom_filters(graph);
+	else {
+		fprintf(stderr, "unknown sub-command: '%s'\n", argv[1]);
+		ret = 1;
+	}
 
+done:
 	UNLEAK(graph);
 
-	return 0;
+	return ret;
 }
 
 
-- 
2.41.0.366.g215419bf3c2.dirty


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 4/4] fixup! t4216: test changed path filters with high bit paths
  2023-07-19 17:55         ` [PATCH 0/4] commit-graph: avoid looking at Bloom filter data directly Taylor Blau
                             ` (2 preceding siblings ...)
  2023-07-19 17:55           ` [PATCH 3/4] t/helper/test-read-graph: implement `bloom-filters` mode Taylor Blau
@ 2023-07-19 17:55           ` Taylor Blau
  2023-07-19 19:24           ` [PATCH 0/4] commit-graph: avoid looking at Bloom filter data directly Junio C Hamano
  2023-07-20 20:22           ` Jonathan Tan
  5 siblings, 0 replies; 116+ messages in thread
From: Taylor Blau @ 2023-07-19 17:55 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t4216-log-bloom.sh | 14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 0cf208fdf55..c49528b947a 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -404,17 +404,9 @@ test_expect_success 'Bloom generation backfills empty commits' '
 	)
 '
 
-get_bdat_offset () {
-	perl -0777 -ne \
-		'print unpack("N", "$1") if /BDAT\0\0\0\0(....)/ or exit 1' \
-		.git/objects/info/commit-graph
-}
-
 get_first_changed_path_filter () {
-	BDAT_OFFSET=$(get_bdat_offset) &&
-	perl -0777 -ne \
-		'print unpack("H*", substr($_, '$BDAT_OFFSET' + 12, 2))' \
-		.git/objects/info/commit-graph
+	test-tool read-graph bloom-filters >filters.dat &&
+	head -n 1 filters.dat
 }
 
 # chosen to be the same under all Unicode normalization forms
@@ -428,7 +420,7 @@ test_expect_success 'set up repo with high bit path, version 1 changed-path' '
 
 test_expect_success 'setup check value of version 1 changed-path' '
 	(cd highbit1 &&
-		printf "52a9" >expect &&
+		echo "52a9" >expect &&
 		get_first_changed_path_filter >actual &&
 		test_cmp expect actual)
 '
-- 
2.41.0.366.g215419bf3c2.dirty

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v5 3/4] repo-settings: introduce commitgraph.changedPathsVersion
  2023-07-13 21:42   ` [PATCH v5 3/4] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
@ 2023-07-19 18:10     ` Taylor Blau
  2023-07-20 20:42       ` Jonathan Tan
  0 siblings, 1 reply; 116+ messages in thread
From: Taylor Blau @ 2023-07-19 18:10 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee, Junio C Hamano

On Thu, Jul 13, 2023 at 02:42:10PM -0700, Jonathan Tan wrote:
> diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
> index 30604e4a4c..07f3799e05 100644
> --- a/Documentation/config/commitgraph.txt
> +++ b/Documentation/config/commitgraph.txt
> @@ -9,6 +9,19 @@ commitGraph.maxNewFilters::
>  	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
>
>  commitGraph.readChangedPaths::
> -	If true, then git will use the changed-path Bloom filters in the
> -	commit-graph file (if it exists, and they are present). Defaults to
> -	true. See linkgit:git-commit-graph[1] for more information.
> +	Deprecated. Equivalent to changedPathsVersion=-1 if true, and
> +	changedPathsVersion=0 if false.

What happens if we have a combination of the two, like:

    [commitGraph]
        readChangedPaths = false
        changedPathsVersion = 1

? From reading the implementation below, I think the answer is that
changedPathsVersion would win out. I think that's fine behavior to
implement (the more modern configuration option taking precedence over
the deprecated one). But I think that we should either (a) note that
precedence in the documentation here, or (b) issue a warning() when both
are set.

For my $.02, I think that doing just (a) would be sufficient.

> diff --git a/commit-graph.c b/commit-graph.c
> index c11b59f28b..9b72319450 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -399,7 +399,7 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
>  			graph->read_generation_data = 1;
>  	}
>
> -	if (s->commit_graph_read_changed_paths) {
> +	if (s->commit_graph_changed_paths_version != 0) {
>  		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
>  			   &graph->chunk_bloom_indexes);
>  		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,

Just a small note, but writing this as

    if (!s->commit_graph_changed_paths_version)

would probably be more in line with our coding guidelines.

> diff --git a/repo-settings.c b/repo-settings.c
> index 3dbd3f0e2e..e3b6565ffc 100644
> --- a/repo-settings.c
> +++ b/repo-settings.c
> @@ -24,6 +24,7 @@ void prepare_repo_settings(struct repository *r)
>  	int value;
>  	const char *strval;
>  	int manyfiles;
> +	int readChangedPaths;

Small note: this should be snake-cased like "read_changed_paths".

> diff --git a/repository.h b/repository.h
> index e8c67ffe16..1f1c32a6dd 100644
> --- a/repository.h
> +++ b/repository.h
> @@ -32,7 +32,7 @@ struct repo_settings {
>
>  	int core_commit_graph;
>  	int commit_graph_generation_version;
> -	int commit_graph_read_changed_paths;

Nice, I'm glad that we're getting rid of this variable and replacing it
with commit_graph_changed_paths_version instead.

> +	int commit_graph_changed_paths_version;

Looking good.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v5 4/4] commit-graph: new filter ver. that fixes murmur3
  2023-07-13 21:42   ` [PATCH v5 4/4] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
@ 2023-07-19 18:24     ` Taylor Blau
  2023-07-20 21:27       ` Jonathan Tan
  0 siblings, 1 reply; 116+ messages in thread
From: Taylor Blau @ 2023-07-19 18:24 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee, Junio C Hamano

On Thu, Jul 13, 2023 at 02:42:11PM -0700, Jonathan Tan wrote:
> The murmur3 implementation in bloom.c has a bug when converting series
> of 4 bytes into network-order integers when char is signed (which is
> controllable by a compiler option, and the default signedness of char is
> platform-specific). When a string contains characters with the high bit
> set, this bug causes results that, although internally consistent within
> Git, does not accord with other implementations of murmur3 and even with
> Git binaries that were compiled with different signedness of char. This
> bug affects both how Git writes changed path filters to disk and how Git
> interprets changed path filters on disk.

I think that you make a worthwhile point that the existing
implementation is internally consistent, but doesn't actually implement
a conventional murmur3. I wonder if it's worth being explicit where you
mention its internal consistency to say that the existing implementation
would never cause us to produce wrong results, but wouldn't be readable
by other off-the-shelf implementations of murmur3.

(To be clear, I think that you already make this point, I'm just
suggesting that it may be worth spelling it out even more explicitly
than what is written above).

> Therefore, introduce a new version (2) of changed path filters that
> corrects this problem. The existing version (1) is still supported and
> is still the default, but users should migrate away from it as soon
> as possible.

Makes sense.

> Because this bug only manifests with characters that have the high bit
> set, it may be possible that some (or all) commits in a given repo would
> have the same changed path filter both before and after this fix is
> applied. However, in order to determine whether this is the case, the
> changed paths would first have to be computed, at which point it is not
> much more expensive to just compute a new changed path filter.

Hmm. I think in the general case that is true, but I wonder if there's a
shortcut we could take for trees that are known to not have *any*
characters with their high-order bits set. That is, if we scan both of
the first parent's trees and determine that no such paths exist, the
contents of the Bloom filter would be the same in either version, right?

I think that that would be faster than recomputing all filters from
scratch. In either case, we have to load the whole tree. But if we can
quickly scan (and cache our results by setting some bit--indicating the
absence of paths with characters having their highest bit set--on the tree
objects' `flags` field), then we should be able to copy forward the
existing representation of the filter.

I think the early checks would be more expensive, since in the worst
case you have to walk the entire tree, only to realize that you actually
wanted to compute a first-parent tree diff, meaning you have to
essentially repeat the whole walk over again. But for repositories that
have few or no commits whose Bloom filters need computing, I think it
would be significantly faster, since many of the sub-trees wouldn't need
to be visited again.

> There is a change in write_commit_graph(). graph_read_bloom_data()
> makes it possible for chunk_bloom_data to be non-NULL but
> bloom_filter_settings to be NULL, which causes a segfault later on. I
> produced such a segfault while developing this patch, but couldn't find
> a way to reproduce it neither after this complete patch (or before),
> but in any case it seemed like a good thing to include that might help
> future patch authors.

Hmm. Interesting, I'd love to know more about what you were doing that
produced the segfault. I think it would be worth it to prove to
ourselves that this segfault can't occur in the wild. Or if it can, it
would be worth it to understand the bug and prevent it from happening.

> +static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
>  {
>  	const uint32_t c1 = 0xcc9e2d51;
>  	const uint32_t c2 = 0x1b873593;
> @@ -130,8 +187,10 @@ void fill_bloom_key(const char *data,
>  	int i;
>  	const uint32_t seed0 = 0x293ae76f;
>  	const uint32_t seed1 = 0x7e646e2c;
> -	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
> -	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
> +	const uint32_t hash0 = (settings->hash_version == 2
> +		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed0, data, len);
> +	const uint32_t hash1 = (settings->hash_version == 2
> +		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed1, data, len);

I do admire the ternary operator over the function being called, as I
think that Stolee pointed out earlier in this series. But I worry that
these two checks could fall out of sync with each other, causing us to
pick incosistent values for hash0, and hash1, respectively.

FWIW, I would probably write this as:

    const uint32_t hash0, hash1;
    if (settings->hash_version == 2) {
        hash0 = murmur3_seeded_v2(seed0, data, len);
        hash1 = murmur3_seeded_v2(seed1, data, len);
    } else {
        hash0 = murmur3_seeded_v1(seed0, data, len);
        hash1 = murmur3_seeded_v1(seed1, data, len);
    }

I suppose that there isn't anything keeping the calls within each arm of
the if-statement above in sync with each other (i.e., I could call
murmur3_seeded_v1() immediately before dispatching a call to
murmur3_seeded_v2()). So in that sense I think that this is no more or
less safe than what's already written.

But IMHO I think this one reads more cleanly, so I might prefer it over
what you have in this patch. But I don't feel so strongly about it
either way.

> diff --git a/commit-graph.c b/commit-graph.c
> index 9b72319450..c50107eed5 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -302,16 +302,25 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
>  	return 0;
>  }
>
> +struct graph_read_bloom_data_data {
> +	struct commit_graph *g;
> +	int *commit_graph_changed_paths_version;
> +};
> +

Nit: maybe `graph_read_bloom_data_context`, to avoid repeating "data"? I
don't have strong feelings here, FWIW.

The rest of the implementation and tests look good to me.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 0/4] commit-graph: avoid looking at Bloom filter data directly
  2023-07-19 17:55         ` [PATCH 0/4] commit-graph: avoid looking at Bloom filter data directly Taylor Blau
                             ` (3 preceding siblings ...)
  2023-07-19 17:55           ` [PATCH 4/4] fixup! t4216: test changed path filters with high bit paths Taylor Blau
@ 2023-07-19 19:24           ` Junio C Hamano
  2023-07-20 20:22           ` Jonathan Tan
  5 siblings, 0 replies; 116+ messages in thread
From: Junio C Hamano @ 2023-07-19 19:24 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jonathan Tan, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

> The first three are preparatory, but the fourth patch should allow us to
> drop the Perl hackery necessary to dump the raw contents of arbitrary
> Bloom filters.
>
> Feel free to pick up these patches (or not), just wanted to get these
> out there as a possible suggestion.

Thanks, these seem like a good thing to do.  Would love to see an
updated version with them rolled in.

Thanks both for working so well together.

>
> Taylor Blau (4):
>   t/helper/test-read-graph.c: extract `dump_graph_info()`
>   bloom.h: make `load_bloom_filter_from_graph()` public
>   t/helper/test-read-graph: implement `bloom-filters` mode
>   fixup! t4216: test changed path filters with high bit paths
>
>  bloom.c                    |  6 ++--
>  bloom.h                    |  5 +++
>  t/helper/test-read-graph.c | 67 ++++++++++++++++++++++++++++++--------
>  t/t4216-log-bloom.sh       | 14 ++------
>  4 files changed, 64 insertions(+), 28 deletions(-)
>
> --
> 2.41.0.366.g215419bf3c2.dirty

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v5 1/4] gitformat-commit-graph: describe version 2 of BDAT
  2023-07-19 17:25     ` Taylor Blau
@ 2023-07-20 20:20       ` Jonathan Tan
  2023-07-21  1:38         ` Taylor Blau
  0 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-07-20 20:20 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, Derrick Stolee, Junio C Hamano

Taylor Blau <me@ttaylorr.com> writes:
> This is a little beyond the scope of your series, but since we're
> changing the on-disk format here a little bit, I think that it might be
> worth it to consider whether there are any other changes that we'd like
> to perform at the same time.
> 
> One that comes to mind is serializing the `max_changed_paths` value of
> the Bloom filter settings, which is currently hard-coded as a constant,
> c.f. 97ffa4fab50 (commit-graph.c: store maximum changed paths,
> 2020-09-17).
> 
> We always assume that the value there is 512, or the environment
> variable GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS, if it is set. But it
> might be nice to write it to disk, since it would allow us to do
> something like:

That's true...I do think it makes sense to either have both
bits_per_entry and max_changed_paths (because if we are reusing Bloom
filters when writing, presumably we would want to make sure that the
existing ones were generated using the same settings), or have neither
(since we don't need them for reading).

Having said that, I am inclined to not change this, so that the offset
calculations are the same for both versions (e.g. in the test tool
too), and as far as I know, we haven't had problems with this. But I can
change it if people want.

> > +	in Probabilistic Verification". Version 1 Bloom filters have a bug that appears
> > +	when char is signed and the repository has path names that have characters >=
> > +	0x80; Git supports reading and writing them, but this ability will be removed
> > +	in a future version of Git.
> 
> Makes sense.
> 
> Thanks,
> Taylor

Thanks for taking a look.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 0/4] commit-graph: avoid looking at Bloom filter data directly
  2023-07-19 17:55         ` [PATCH 0/4] commit-graph: avoid looking at Bloom filter data directly Taylor Blau
                             ` (4 preceding siblings ...)
  2023-07-19 19:24           ` [PATCH 0/4] commit-graph: avoid looking at Bloom filter data directly Junio C Hamano
@ 2023-07-20 20:22           ` Jonathan Tan
  5 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-07-20 20:22 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, Derrick Stolee, Junio C Hamano

Taylor Blau <me@ttaylorr.com> writes:
> Hi Jonathan,
> 
> Here's a few commits (and one fixup!) that could go before the second
> patch of your series, with the fixup! getting squashed into the second
> patch itself.
> 
> The first three are preparatory, but the fourth patch should allow us to
> drop the Perl hackery necessary to dump the raw contents of arbitrary
> Bloom filters.
> 
> Feel free to pick up these patches (or not), just wanted to get these
> out there as a possible suggestion.

Ah, thanks! I'm on the fence myself about whether this really needs to
be more rigorous (the Perl part just gives us an offset; we still check
that what's at the offset is correct) but since you have written the
patches already, I'll just include them.
 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v5 3/4] repo-settings: introduce commitgraph.changedPathsVersion
  2023-07-19 18:10     ` Taylor Blau
@ 2023-07-20 20:42       ` Jonathan Tan
  2023-07-20 21:02         ` Taylor Blau
  0 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-07-20 20:42 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, Derrick Stolee, Junio C Hamano

Taylor Blau <me@ttaylorr.com> writes:
> What happens if we have a combination of the two, like:
> 
>     [commitGraph]
>         readChangedPaths = false
>         changedPathsVersion = 1
> 
> ? From reading the implementation below, I think the answer is that
> changedPathsVersion would win out. I think that's fine behavior to
> implement (the more modern configuration option taking precedence over
> the deprecated one). But I think that we should either (a) note that
> precedence in the documentation here, or (b) issue a warning() when both
> are set.
> 
> For my $.02, I think that doing just (a) would be sufficient.

Thanks. I added a note to the documentation.

> > -	if (s->commit_graph_read_changed_paths) {
> > +	if (s->commit_graph_changed_paths_version != 0) {
> >  		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
> >  			   &graph->chunk_bloom_indexes);
> >  		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
> 
> Just a small note, but writing this as
> 
>     if (!s->commit_graph_changed_paths_version)
> 
> would probably be more in line with our coding guidelines.

Ah, yes. Changed it (without the exclamation mark).

> > diff --git a/repo-settings.c b/repo-settings.c
> > index 3dbd3f0e2e..e3b6565ffc 100644
> > --- a/repo-settings.c
> > +++ b/repo-settings.c
> > @@ -24,6 +24,7 @@ void prepare_repo_settings(struct repository *r)
> >  	int value;
> >  	const char *strval;
> >  	int manyfiles;
> > +	int readChangedPaths;
> 
> Small note: this should be snake-cased like "read_changed_paths".

Whoops...thanks for the catch.

 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v5 3/4] repo-settings: introduce commitgraph.changedPathsVersion
  2023-07-20 20:42       ` Jonathan Tan
@ 2023-07-20 21:02         ` Taylor Blau
  0 siblings, 0 replies; 116+ messages in thread
From: Taylor Blau @ 2023-07-20 21:02 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee, Junio C Hamano

On Thu, Jul 20, 2023 at 01:42:32PM -0700, Jonathan Tan wrote:
> > > -	if (s->commit_graph_read_changed_paths) {
> > > +	if (s->commit_graph_changed_paths_version != 0) {
> > >  		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
> > >  			   &graph->chunk_bloom_indexes);
> > >  		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
> >
> > Just a small note, but writing this as
> >
> >     if (!s->commit_graph_changed_paths_version)
> >
> > would probably be more in line with our coding guidelines.
>
> Ah, yes. Changed it (without the exclamation mark).

Hah, whoops -- thanks for spotting my mistake there. :-)

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v5 4/4] commit-graph: new filter ver. that fixes murmur3
  2023-07-19 18:24     ` Taylor Blau
@ 2023-07-20 21:27       ` Jonathan Tan
  2023-07-26 23:32         ` Taylor Blau
  0 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-07-20 21:27 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, Derrick Stolee, Junio C Hamano

Taylor Blau <me@ttaylorr.com> writes:
> On Thu, Jul 13, 2023 at 02:42:11PM -0700, Jonathan Tan wrote:
> > The murmur3 implementation in bloom.c has a bug when converting series
> > of 4 bytes into network-order integers when char is signed (which is
> > controllable by a compiler option, and the default signedness of char is
> > platform-specific). When a string contains characters with the high bit
> > set, this bug causes results that, although internally consistent within
> > Git, does not accord with other implementations of murmur3 and even with
> > Git binaries that were compiled with different signedness of char. This
> > bug affects both how Git writes changed path filters to disk and how Git
> > interprets changed path filters on disk.
> 
> I think that you make a worthwhile point that the existing
> implementation is internally consistent, but doesn't actually implement
> a conventional murmur3. I wonder if it's worth being explicit where you
> mention its internal consistency to say that the existing implementation
> would never cause us to produce wrong results, but wouldn't be readable
> by other off-the-shelf implementations of murmur3.
> 
> (To be clear, I think that you already make this point, I'm just
> suggesting that it may be worth spelling it out even more explicitly
> than what is written above).

OK, I've added some more text describing this.

> > Because this bug only manifests with characters that have the high bit
> > set, it may be possible that some (or all) commits in a given repo would
> > have the same changed path filter both before and after this fix is
> > applied. However, in order to determine whether this is the case, the
> > changed paths would first have to be computed, at which point it is not
> > much more expensive to just compute a new changed path filter.
> 
> Hmm. I think in the general case that is true, but I wonder if there's a
> shortcut we could take for trees that are known to not have *any*
> characters with their high-order bits set. That is, if we scan both of
> the first parent's trees and determine that no such paths exist, the
> contents of the Bloom filter would be the same in either version, right?
> 
> I think that that would be faster than recomputing all filters from
> scratch. In either case, we have to load the whole tree. But if we can
> quickly scan (and cache our results by setting some bit--indicating the
> absence of paths with characters having their highest bit set--on the tree
> objects' `flags` field), then we should be able to copy forward the
> existing representation of the filter.
> 
> I think the early checks would be more expensive, since in the worst
> case you have to walk the entire tree, only to realize that you actually
> wanted to compute a first-parent tree diff, meaning you have to
> essentially repeat the whole walk over again. But for repositories that
> have few or no commits whose Bloom filters need computing, I think it
> would be significantly faster, since many of the sub-trees wouldn't need
> to be visited again.

So for repositories that need little-to-no recomputation of Bloom
filters, your idea likely means that each tree needs to be read once,
as compared to recomputing everything in which, I think, each tree needs
to be read roughly twice (once when computing the Bloom filter for the
commit that introduces it, and once for the commit that substitutes a
different tree in place).

I could change the text of the commit message to discuss this (instead
of the blanket statement that it would be too hard), although I think
that an implementation of this can be done after this patchset. What do
you think?

> > There is a change in write_commit_graph(). graph_read_bloom_data()
> > makes it possible for chunk_bloom_data to be non-NULL but
> > bloom_filter_settings to be NULL, which causes a segfault later on. I
> > produced such a segfault while developing this patch, but couldn't find
> > a way to reproduce it neither after this complete patch (or before),
> > but in any case it seemed like a good thing to include that might help
> > future patch authors.
> 
> Hmm. Interesting, I'd love to know more about what you were doing that
> produced the segfault. I think it would be worth it to prove to
> ourselves that this segfault can't occur in the wild. Or if it can, it
> would be worth it to understand the bug and prevent it from happening.

If I remember correctly, I changed the version in
DEFAULT_BLOOM_FILTER_SETTINGS to 2 and ran it to see what broke. This
was a while back so I don't remember it exactly, though.

> > @@ -130,8 +187,10 @@ void fill_bloom_key(const char *data,
> >  	int i;
> >  	const uint32_t seed0 = 0x293ae76f;
> >  	const uint32_t seed1 = 0x7e646e2c;
> > -	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
> > -	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
> > +	const uint32_t hash0 = (settings->hash_version == 2
> > +		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed0, data, len);
> > +	const uint32_t hash1 = (settings->hash_version == 2
> > +		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed1, data, len);
> 
> I do admire the ternary operator over the function being called, as I
> think that Stolee pointed out earlier in this series. But I worry that
> these two checks could fall out of sync with each other, causing us to
> pick incosistent values for hash0, and hash1, respectively.
> 
> FWIW, I would probably write this as:
> 
>     const uint32_t hash0, hash1;
>     if (settings->hash_version == 2) {
>         hash0 = murmur3_seeded_v2(seed0, data, len);
>         hash1 = murmur3_seeded_v2(seed1, data, len);
>     } else {
>         hash0 = murmur3_seeded_v1(seed0, data, len);
>         hash1 = murmur3_seeded_v1(seed1, data, len);
>     }
> 
> I suppose that there isn't anything keeping the calls within each arm of
> the if-statement above in sync with each other (i.e., I could call
> murmur3_seeded_v1() immediately before dispatching a call to
> murmur3_seeded_v2()). So in that sense I think that this is no more or
> less safe than what's already written.
> 
> But IMHO I think this one reads more cleanly, so I might prefer it over
> what you have in this patch. But I don't feel so strongly about it
> either way.

I'm OK either way, so I'll go with your approach (except the "const",
since my compiler doesn't like that).

> > @@ -302,16 +302,25 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
> >  	return 0;
> >  }
> >
> > +struct graph_read_bloom_data_data {
> > +	struct commit_graph *g;
> > +	int *commit_graph_changed_paths_version;
> > +};
> > +
> 
> Nit: maybe `graph_read_bloom_data_context`, to avoid repeating "data"? I
> don't have strong feelings here, FWIW.

I'll use "context".

> The rest of the implementation and tests look good to me.
> 
> Thanks,
> Taylor

Thanks for the review.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v6 0/7] Changed path filter hash fix and version bump
  2023-05-22 21:48 [PATCH 0/2] Changed path filter hash fix and version bump Jonathan Tan
                   ` (6 preceding siblings ...)
  2023-07-13 21:42 ` [PATCH v5 " Jonathan Tan
@ 2023-07-20 21:46 ` Jonathan Tan
  2023-07-20 21:46   ` [PATCH v6 1/7] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
                     ` (8 more replies)
  2023-08-01 18:41 ` [PATCH v7 " Jonathan Tan
  8 siblings, 9 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-07-20 21:46 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Junio C Hamano, Taylor Blau

Thanks, Junio and Taylor, for your reviews. I have included Taylor's
patches in this patch set.

There seemed to be some merge conflicts when I tried to apply the
patches Taylor provided on the base that I built my patches on (that is,
the base of jt/path-filter-fix, namely, maint-2.40), so I have rebased
all my patches onto latest master.

Jonathan Tan (4):
  gitformat-commit-graph: describe version 2 of BDAT
  t4216: test changed path filters with high bit paths
  repo-settings: introduce commitgraph.changedPathsVersion
  commit-graph: new filter ver. that fixes murmur3

Taylor Blau (3):
  t/helper/test-read-graph.c: extract `dump_graph_info()`
  bloom.h: make `load_bloom_filter_from_graph()` public
  t/helper/test-read-graph: implement `bloom-filters` mode

 Documentation/config/commitgraph.txt     |  26 +++++-
 Documentation/gitformat-commit-graph.txt |   9 +-
 bloom.c                                  |  75 ++++++++++++++--
 bloom.h                                  |  13 ++-
 commit-graph.c                           |  33 +++++--
 oss-fuzz/fuzz-commit-graph.c             |   2 +-
 repo-settings.c                          |   6 +-
 repository.h                             |   2 +-
 t/helper/test-bloom.c                    |   9 +-
 t/helper/test-read-graph.c               |  67 ++++++++++++---
 t/t0095-bloom.sh                         |   8 ++
 t/t4216-log-bloom.sh                     | 104 +++++++++++++++++++++++
 12 files changed, 315 insertions(+), 39 deletions(-)

Range-diff against v5:
1:  efe7f40fed = 1:  3ce6090a4d gitformat-commit-graph: describe version 2 of BDAT
-:  ---------- > 2:  1955734d1f t/helper/test-read-graph.c: extract `dump_graph_info()`
-:  ---------- > 3:  4cf7c2f634 bloom.h: make `load_bloom_filter_from_graph()` public
-:  ---------- > 4:  47b55758e6 t/helper/test-read-graph: implement `bloom-filters` mode
2:  f684d07971 ! 5:  5276e6a90e t4216: test changed path filters with high bit paths
    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm
      	)
      '
      
    -+get_bdat_offset () {
    -+	perl -0777 -ne \
    -+		'print unpack("N", "$1") if /BDAT\0\0\0\0(....)/ or exit 1' \
    -+		.git/objects/info/commit-graph
    -+}
    -+
     +get_first_changed_path_filter () {
    -+	BDAT_OFFSET=$(get_bdat_offset) &&
    -+	perl -0777 -ne \
    -+		'print unpack("H*", substr($_, '$BDAT_OFFSET' + 12, 2))' \
    -+		.git/objects/info/commit-graph
    ++	test-tool read-graph bloom-filters >filters.dat &&
    ++	head -n 1 filters.dat
     +}
     +
     +# chosen to be the same under all Unicode normalization forms
    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm
     +
     +test_expect_success 'setup check value of version 1 changed-path' '
     +	(cd highbit1 &&
    -+		printf "52a9" >expect &&
    ++		echo "52a9" >expect &&
     +		get_first_changed_path_filter >actual &&
     +		test_cmp expect actual)
     +'
3:  2fadd87063 ! 6:  dc3f6d2d4f repo-settings: introduce commitgraph.changedPathsVersion
    @@ Documentation/config/commitgraph.txt: commitGraph.maxNewFilters::
     -	If true, then git will use the changed-path Bloom filters in the
     -	commit-graph file (if it exists, and they are present). Defaults to
     -	true. See linkgit:git-commit-graph[1] for more information.
    -+	Deprecated. Equivalent to changedPathsVersion=-1 if true, and
    -+	changedPathsVersion=0 if false.
    ++	Deprecated. Equivalent to commitGraph.changedPathsVersion=-1 if true, and
    ++	commitGraph.changedPathsVersion=0 if false. (If commitGraph.changedPathVersion
    ++	is also set, commitGraph.changedPathsVersion takes precedence.)
     +
     +commitGraph.changedPathsVersion::
     +	Specifies the version of the changed-path Bloom filters that Git will read and
    -+	write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not
    -+	match the version set in this config variable will be ignored.
    ++	write. May be -1, 0 or 1.
     ++
     +Defaults to -1.
     ++
     +If -1, Git will use the version of the changed-path Bloom filters in the
     +repository, defaulting to 1 if there are none.
     ++
    -+If 0, git will write version 1 Bloom filters when instructed to write.
    ++If 0, Git will not read any Bloom filters, and will write version 1 Bloom
    ++filters when instructed to write.
    +++
    ++If 1, Git will only read version 1 Bloom filters, and will write version 1
    ++Bloom filters.
     ++
     +See linkgit:git-commit-graph[1] for more information.
     
    @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s,
      	}
      
     -	if (s->commit_graph_read_changed_paths) {
    -+	if (s->commit_graph_changed_paths_version != 0) {
    ++	if (s->commit_graph_changed_paths_version) {
      		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
      			   &graph->chunk_bloom_indexes);
      		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
    @@ repo-settings.c: void prepare_repo_settings(struct repository *r)
      	int value;
      	const char *strval;
      	int manyfiles;
    -+	int readChangedPaths;
    ++	int read_changed_paths;
      
      	if (!r->gitdir)
      		BUG("Cannot add settings for uninitialized repository");
    @@ repo-settings.c: void prepare_repo_settings(struct repository *r)
      	repo_cfg_bool(r, "core.commitgraph", &r->settings.core_commit_graph, 1);
      	repo_cfg_int(r, "commitgraph.generationversion", &r->settings.commit_graph_generation_version, 2);
     -	repo_cfg_bool(r, "commitgraph.readchangedpaths", &r->settings.commit_graph_read_changed_paths, 1);
    -+	repo_cfg_bool(r, "commitgraph.readchangedpaths", &readChangedPaths, 1);
    ++	repo_cfg_bool(r, "commitgraph.readchangedpaths", &read_changed_paths, 1);
     +	repo_cfg_int(r, "commitgraph.changedpathsversion",
     +		     &r->settings.commit_graph_changed_paths_version,
    -+		     readChangedPaths ? -1 : 0);
    ++		     read_changed_paths ? -1 : 0);
      	repo_cfg_bool(r, "gc.writecommitgraph", &r->settings.gc_write_commit_graph, 1);
      	repo_cfg_bool(r, "fetch.writecommitgraph", &r->settings.fetch_write_commit_graph, 0);
      
    @@ repository.h: struct repo_settings {
     -	int commit_graph_read_changed_paths;
     +	int commit_graph_changed_paths_version;
      	int gc_write_commit_graph;
    - 	int gc_cruft_packs;
      	int fetch_write_commit_graph;
    + 	int command_requires_full_index;
4:  e31711ae85 ! 7:  6e2d797406 commit-graph: new filter ver. that fixes murmur3
    @@ Commit message
         controllable by a compiler option, and the default signedness of char is
         platform-specific). When a string contains characters with the high bit
         set, this bug causes results that, although internally consistent within
    -    Git, does not accord with other implementations of murmur3 and even with
    -    Git binaries that were compiled with different signedness of char. This
    -    bug affects both how Git writes changed path filters to disk and how Git
    -    interprets changed path filters on disk.
    +    Git, does not accord with other implementations of murmur3 (thus,
    +    the changed path filters wouldn't be readable by other off-the-shelf
    +    implementatios of murmur3) and even with Git binaries that were compiled
    +    with different signedness of char. This bug affects both how Git writes
    +    changed path filters to disk and how Git interprets changed path filters
    +    on disk.
     
         Therefore, introduce a new version (2) of changed path filters that
         corrects this problem. The existing version (1) is still supported and
    @@ Documentation/config/commitgraph.txt: commitGraph.readChangedPaths::
      
      commitGraph.changedPathsVersion::
      	Specifies the version of the changed-path Bloom filters that Git will read and
    --	write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not
    -+	write. May be -1, 0, 1, or 2. Any changed-path Bloom filters on disk that do not
    - 	match the version set in this config variable will be ignored.
    +-	write. May be -1, 0 or 1.
    ++	write. May be -1, 0, 1, or 2.
      +
      Defaults to -1.
    + +
    +@@ Documentation/config/commitgraph.txt: filters when instructed to write.
    + If 1, Git will only read version 1 Bloom filters, and will write version 1
    + Bloom filters.
    + +
    ++If 2, Git will only read version 2 Bloom filters, and will write version 2
    ++Bloom filters.
    +++
    + See linkgit:git-commit-graph[1] for more information.
     
      ## bloom.c ##
    -@@ bloom.c: static int load_bloom_filter_from_graph(struct commit_graph *g,
    +@@ bloom.c: int load_bloom_filter_from_graph(struct commit_graph *g,
       * Not considered to be cryptographically secure.
       * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
       */
    @@ bloom.c: void fill_bloom_key(const char *data,
      	const uint32_t seed1 = 0x7e646e2c;
     -	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
     -	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
    -+	const uint32_t hash0 = (settings->hash_version == 2
    -+		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed0, data, len);
    -+	const uint32_t hash1 = (settings->hash_version == 2
    -+		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed1, data, len);
    ++	uint32_t hash0, hash1;
    ++	if (settings->hash_version == 2) {
    ++		hash0 = murmur3_seeded_v2(seed0, data, len);
    ++		hash1 = murmur3_seeded_v2(seed1, data, len);
    ++	} else {
    ++		hash0 = murmur3_seeded_v1(seed0, data, len);
    ++		hash1 = murmur3_seeded_v1(seed1, data, len);
    ++	}
      
      	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
      	for (i = 0; i < settings->num_hashes; i++)
     
      ## bloom.h ##
    -@@ bloom.h: struct repository;
    +@@ bloom.h: struct commit_graph;
      struct bloom_filter_settings {
      	/*
      	 * The version of the hashing technique being used.
    @@ bloom.h: struct repository;
      	 */
      	uint32_t hash_version;
      
    -@@ bloom.h: struct bloom_key {
    +@@ bloom.h: int load_bloom_filter_from_graph(struct commit_graph *g,
       * Not considered to be cryptographically secure.
       * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
       */
    @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star
      	return 0;
      }
      
    -+struct graph_read_bloom_data_data {
    ++struct graph_read_bloom_data_context {
     +	struct commit_graph *g;
     +	int *commit_graph_changed_paths_version;
     +};
    @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star
      				  size_t chunk_size, void *data)
      {
     -	struct commit_graph *g = data;
    -+	struct graph_read_bloom_data_data *d = data;
    -+	struct commit_graph *g = d->g;
    ++	struct graph_read_bloom_data_context *c = data;
    ++	struct commit_graph *g = c->g;
      	uint32_t hash_version;
      	g->chunk_bloom_data = chunk_start;
      	hash_version = get_be32(chunk_start);
      
     -	if (hash_version != 1)
    -+	if (*d->commit_graph_changed_paths_version == -1) {
    -+		*d->commit_graph_changed_paths_version = hash_version;
    -+	} else if (hash_version != *d->commit_graph_changed_paths_version) {
    - 		return 0;
    +-		return 0;
    ++	if (*c->commit_graph_changed_paths_version == -1) {
    ++		*c->commit_graph_changed_paths_version = hash_version;
    ++	} else if (hash_version != *c->commit_graph_changed_paths_version) {
    ++ 		return 0;
     +	}
      
      	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
    @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star
     @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s,
      	}
      
    - 	if (s->commit_graph_changed_paths_version != 0) {
    -+		struct graph_read_bloom_data_data data = {
    + 	if (s->commit_graph_changed_paths_version) {
    ++		struct graph_read_bloom_data_context context = {
     +			.g = graph,
     +			.commit_graph_changed_paths_version = &s->commit_graph_changed_paths_version
     +		};
    @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s,
      			   &graph->chunk_bloom_indexes);
      		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
     -			   graph_read_bloom_data, graph);
    -+			   graph_read_bloom_data, &data);
    ++			   graph_read_bloom_data, &context);
      	}
      
      	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
    @@ t/t0095-bloom.sh: test_expect_success 'compute unseeded murmur3 hash for test st
      	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
     
      ## t/t4216-log-bloom.sh ##
    -@@ t/t4216-log-bloom.sh: get_bdat_offset () {
    - 		.git/objects/info/commit-graph
    - }
    - 
    -+get_changed_path_filter_version () {
    -+	BDAT_OFFSET=$(get_bdat_offset) &&
    -+	perl -0777 -ne \
    -+		'print unpack("H*", substr($_, '$BDAT_OFFSET', 4))' \
    -+		.git/objects/info/commit-graph
    -+}
    -+
    - get_first_changed_path_filter () {
    - 	BDAT_OFFSET=$(get_bdat_offset) &&
    - 	perl -0777 -ne \
     @@ t/t4216-log-bloom.sh: test_expect_success 'set up repo with high bit path, version 1 changed-path' '
      	git -C highbit1 commit-graph write --reachable --changed-paths
      '
    @@ t/t4216-log-bloom.sh: test_expect_success 'set up repo with high bit path, versi
     -test_expect_success 'setup check value of version 1 changed-path' '
     +test_expect_success 'check value of version 1 changed-path' '
      	(cd highbit1 &&
    - 		printf "52a9" >expect &&
    + 		echo "52a9" >expect &&
      		get_first_changed_path_filter >actual &&
     @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when version 1 requested' '
      		test_bloom_filters_used "-- $CENT")
    @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when vers
     +	git -C highbit1 commit-graph write --reachable --changed-paths &&
     +	(cd highbit1 &&
     +		git config --add commitgraph.changedPathsVersion -1 &&
    -+		printf "00000001" >expect &&
    -+		get_changed_path_filter_version >actual &&
    ++		echo "options: bloom(1,10,7) read_generation_data" >expect &&
    ++		test-tool read-graph >full &&
    ++		grep options full >actual &&
     +		test_cmp expect actual)
     +'
     +
    @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when vers
     +
     +test_expect_success 'check value of version 2 changed-path' '
     +	(cd highbit2 &&
    -+		printf "c01f" >expect &&
    ++		echo "c01f" >expect &&
     +		get_first_changed_path_filter >actual &&
     +		test_cmp expect actual)
     +'
    @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when vers
     +	git -C highbit2 commit-graph write --reachable --changed-paths &&
     +	(cd highbit2 &&
     +		git config --add commitgraph.changedPathsVersion -1 &&
    -+		printf "00000002" >expect &&
    -+		get_changed_path_filter_version >actual &&
    ++		echo "options: bloom(2,10,7) read_generation_data" >expect &&
    ++		test-tool read-graph >full &&
    ++		grep options full >actual &&
     +		test_cmp expect actual)
     +'
     +
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v6 1/7] gitformat-commit-graph: describe version 2 of BDAT
  2023-07-20 21:46 ` [PATCH v6 0/7] " Jonathan Tan
@ 2023-07-20 21:46   ` Jonathan Tan
  2023-07-20 21:46   ` [PATCH v6 2/7] t/helper/test-read-graph.c: extract `dump_graph_info()` Jonathan Tan
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-07-20 21:46 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Junio C Hamano, Taylor Blau

The code change to Git to support version 2 will be done in subsequent
commits.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/gitformat-commit-graph.txt | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/Documentation/gitformat-commit-graph.txt b/Documentation/gitformat-commit-graph.txt
index 31cad585e2..3e906e8030 100644
--- a/Documentation/gitformat-commit-graph.txt
+++ b/Documentation/gitformat-commit-graph.txt
@@ -142,13 +142,16 @@ All multi-byte numbers are in network byte order.
 
 ==== Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
     * It starts with header consisting of three unsigned 32-bit integers:
-      - Version of the hash algorithm being used. We currently only support
-	value 1 which corresponds to the 32-bit version of the murmur3 hash
+      - Version of the hash algorithm being used. We currently support
+	value 2 which corresponds to the 32-bit version of the murmur3 hash
 	implemented exactly as described in
 	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
 	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
 	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
-	in Probabilistic Verification"
+	in Probabilistic Verification". Version 1 Bloom filters have a bug that appears
+	when char is signed and the repository has path names that have characters >=
+	0x80; Git supports reading and writing them, but this ability will be removed
+	in a future version of Git.
       - The number of times a path is hashed and hence the number of bit positions
 	      that cumulatively determine whether a file is present in the commit.
       - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v6 2/7] t/helper/test-read-graph.c: extract `dump_graph_info()`
  2023-07-20 21:46 ` [PATCH v6 0/7] " Jonathan Tan
  2023-07-20 21:46   ` [PATCH v6 1/7] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
@ 2023-07-20 21:46   ` Jonathan Tan
  2023-07-26 23:26     ` Taylor Blau
  2023-07-20 21:46   ` [PATCH v6 3/7] bloom.h: make `load_bloom_filter_from_graph()` public Jonathan Tan
                     ` (6 subsequent siblings)
  8 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-07-20 21:46 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Junio C Hamano, Jonathan Tan

From: Taylor Blau <me@ttaylorr.com>

Prepare for the 'read-graph' test helper to perform other tasks besides
dumping high-level information about the commit-graph by extracting its
main routine into a separate function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-read-graph.c | 33 ++++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 8c7a83f578..c664928412 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -5,20 +5,8 @@
 #include "bloom.h"
 #include "setup.h"
 
-int cmd__read_graph(int argc UNUSED, const char **argv UNUSED)
+static void dump_graph_info(struct commit_graph *graph)
 {
-	struct commit_graph *graph = NULL;
-	struct object_directory *odb;
-
-	setup_git_directory();
-	odb = the_repository->objects->odb;
-
-	prepare_repo_settings(the_repository);
-
-	graph = read_commit_graph_one(the_repository, odb);
-	if (!graph)
-		return 1;
-
 	printf("header: %08x %d %d %d %d\n",
 		ntohl(*(uint32_t*)graph->data),
 		*(unsigned char*)(graph->data + 4),
@@ -57,8 +45,27 @@ int cmd__read_graph(int argc UNUSED, const char **argv UNUSED)
 	if (graph->topo_levels)
 		printf(" topo_levels");
 	printf("\n");
+}
+
+int cmd__read_graph(int argc UNUSED, const char **argv UNUSED)
+{
+	struct commit_graph *graph = NULL;
+	struct object_directory *odb;
+
+	setup_git_directory();
+	odb = the_repository->objects->odb;
+
+	prepare_repo_settings(the_repository);
+
+	graph = read_commit_graph_one(the_repository, odb);
+	if (!graph)
+		return 1;
+
+	dump_graph_info(graph);
 
 	UNLEAK(graph);
 
 	return 0;
 }
+
+
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v6 3/7] bloom.h: make `load_bloom_filter_from_graph()` public
  2023-07-20 21:46 ` [PATCH v6 0/7] " Jonathan Tan
  2023-07-20 21:46   ` [PATCH v6 1/7] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
  2023-07-20 21:46   ` [PATCH v6 2/7] t/helper/test-read-graph.c: extract `dump_graph_info()` Jonathan Tan
@ 2023-07-20 21:46   ` Jonathan Tan
  2023-07-20 21:46   ` [PATCH v6 4/7] t/helper/test-read-graph: implement `bloom-filters` mode Jonathan Tan
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-07-20 21:46 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Junio C Hamano, Jonathan Tan

From: Taylor Blau <me@ttaylorr.com>

Prepare for a future commit to use the load_bloom_filter_from_graph()
function directly to load specific Bloom filters out of the commit-graph
for manual inspection (to be used during tests).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.c | 6 +++---
 bloom.h | 5 +++++
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/bloom.c b/bloom.c
index aef6b5fea2..3e78cfe79d 100644
--- a/bloom.c
+++ b/bloom.c
@@ -29,9 +29,9 @@ static inline unsigned char get_bitmask(uint32_t pos)
 	return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
 }
 
-static int load_bloom_filter_from_graph(struct commit_graph *g,
-					struct bloom_filter *filter,
-					uint32_t graph_pos)
+int load_bloom_filter_from_graph(struct commit_graph *g,
+				 struct bloom_filter *filter,
+				 uint32_t graph_pos)
 {
 	uint32_t lex_pos, start_index, end_index;
 
diff --git a/bloom.h b/bloom.h
index adde6dfe21..1e4f612d2c 100644
--- a/bloom.h
+++ b/bloom.h
@@ -3,6 +3,7 @@
 
 struct commit;
 struct repository;
+struct commit_graph;
 
 struct bloom_filter_settings {
 	/*
@@ -68,6 +69,10 @@ struct bloom_key {
 	uint32_t *hashes;
 };
 
+int load_bloom_filter_from_graph(struct commit_graph *g,
+				 struct bloom_filter *filter,
+				 uint32_t graph_pos);
+
 /*
  * Calculate the murmur3 32-bit hash value for the given data
  * using the given seed.
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v6 4/7] t/helper/test-read-graph: implement `bloom-filters` mode
  2023-07-20 21:46 ` [PATCH v6 0/7] " Jonathan Tan
                     ` (2 preceding siblings ...)
  2023-07-20 21:46   ` [PATCH v6 3/7] bloom.h: make `load_bloom_filter_from_graph()` public Jonathan Tan
@ 2023-07-20 21:46   ` Jonathan Tan
  2023-07-20 21:46   ` [PATCH v6 5/7] t4216: test changed path filters with high bit paths Jonathan Tan
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-07-20 21:46 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Junio C Hamano, Jonathan Tan

From: Taylor Blau <me@ttaylorr.com>

Implement a mode of the "read-graph" test helper to dump out the
hexadecimal contents of the Bloom filter(s) contained in a commit-graph.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-read-graph.c | 42 +++++++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index c664928412..da9ac8584d 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -47,10 +47,32 @@ static void dump_graph_info(struct commit_graph *graph)
 	printf("\n");
 }
 
-int cmd__read_graph(int argc UNUSED, const char **argv UNUSED)
+static void dump_graph_bloom_filters(struct commit_graph *graph)
+{
+	uint32_t i;
+
+	for (i = 0; i < graph->num_commits + graph->num_commits_in_base; i++) {
+		struct bloom_filter filter = { 0 };
+		size_t j;
+
+		if (load_bloom_filter_from_graph(graph, &filter, i) < 0) {
+			fprintf(stderr, "missing Bloom filter for graph "
+				"position %"PRIu32"\n", i);
+			continue;
+		}
+
+		for (j = 0; j < filter.len; j++)
+			printf("%02x", filter.data[j]);
+		if (filter.len)
+			printf("\n");
+	}
+}
+
+int cmd__read_graph(int argc, const char **argv)
 {
 	struct commit_graph *graph = NULL;
 	struct object_directory *odb;
+	int ret = 0;
 
 	setup_git_directory();
 	odb = the_repository->objects->odb;
@@ -58,14 +80,24 @@ int cmd__read_graph(int argc UNUSED, const char **argv UNUSED)
 	prepare_repo_settings(the_repository);
 
 	graph = read_commit_graph_one(the_repository, odb);
-	if (!graph)
-		return 1;
+	if (!graph) {
+		ret = 1;
+		goto done;
+	}
 
-	dump_graph_info(graph);
+	if (argc <= 1)
+		dump_graph_info(graph);
+	else if (!strcmp(argv[1], "bloom-filters"))
+		dump_graph_bloom_filters(graph);
+	else {
+		fprintf(stderr, "unknown sub-command: '%s'\n", argv[1]);
+		ret = 1;
+	}
 
+done:
 	UNLEAK(graph);
 
-	return 0;
+	return ret;
 }
 
 
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v6 5/7] t4216: test changed path filters with high bit paths
  2023-07-20 21:46 ` [PATCH v6 0/7] " Jonathan Tan
                     ` (3 preceding siblings ...)
  2023-07-20 21:46   ` [PATCH v6 4/7] t/helper/test-read-graph: implement `bloom-filters` mode Jonathan Tan
@ 2023-07-20 21:46   ` Jonathan Tan
  2023-07-26 23:28     ` Taylor Blau
  2023-07-20 21:46   ` [PATCH v6 6/7] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-07-20 21:46 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Junio C Hamano, Taylor Blau

Subsequent commits will teach Git another version of changed path
filter that has different behavior with paths that contain at least
one character with its high bit set, so test the existing behavior as
a baseline.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 t/t4216-log-bloom.sh | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fa9d32facf..c49528b947 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -404,4 +404,43 @@ test_expect_success 'Bloom generation backfills empty commits' '
 	)
 '
 
+get_first_changed_path_filter () {
+	test-tool read-graph bloom-filters >filters.dat &&
+	head -n 1 filters.dat
+}
+
+# chosen to be the same under all Unicode normalization forms
+CENT=$(printf "\302\242")
+
+test_expect_success 'set up repo with high bit path, version 1 changed-path' '
+	git init highbit1 &&
+	test_commit -C highbit1 c1 "$CENT" &&
+	git -C highbit1 commit-graph write --reachable --changed-paths
+'
+
+test_expect_success 'setup check value of version 1 changed-path' '
+	(cd highbit1 &&
+		echo "52a9" >expect &&
+		get_first_changed_path_filter >actual &&
+		test_cmp expect actual)
+'
+
+# expect will not match actual if char is unsigned by default. Write the test
+# in this way, so that a user running this test script can still see if the two
+# files match. (It will appear as an ordinary success if they match, and a skip
+# if not.)
+if test_cmp highbit1/expect highbit1/actual
+then
+	test_set_prereq SIGNED_CHAR_BY_DEFAULT
+fi
+test_expect_success SIGNED_CHAR_BY_DEFAULT 'check value of version 1 changed-path' '
+	# Only the prereq matters for this test.
+	true
+'
+
+test_expect_success 'version 1 changed-path used when version 1 requested' '
+	(cd highbit1 &&
+		test_bloom_filters_used "-- $CENT")
+'
+
 test_done
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v6 6/7] repo-settings: introduce commitgraph.changedPathsVersion
  2023-07-20 21:46 ` [PATCH v6 0/7] " Jonathan Tan
                     ` (4 preceding siblings ...)
  2023-07-20 21:46   ` [PATCH v6 5/7] t4216: test changed path filters with high bit paths Jonathan Tan
@ 2023-07-20 21:46   ` Jonathan Tan
  2023-07-20 21:46   ` [PATCH v6 7/7] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-07-20 21:46 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Junio C Hamano, Taylor Blau

A subsequent commit will introduce another version of the changed-path
filter in the commit graph file. In order to control which version to
write (and read), a config variable is needed.

Therefore, introduce this config variable. For forwards compatibility,
teach Git to not read commit graphs when the config variable
is set to an unsupported version. Because we teach Git this,
commitgraph.readChangedPaths is now redundant, so deprecate it and
define its behavior in terms of the config variable we introduce.

This commit does not change the behavior of writing (Git writes changed
path filters when explicitly instructed regardless of any config
variable), but a subsequent commit will restrict Git such that it will
only write when commitgraph.changedPathsVersion is a recognized value.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config/commitgraph.txt | 23 ++++++++++++++++++++---
 commit-graph.c                       |  2 +-
 oss-fuzz/fuzz-commit-graph.c         |  2 +-
 repo-settings.c                      |  6 +++++-
 repository.h                         |  2 +-
 5 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index 30604e4a4c..2dc9170622 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -9,6 +9,23 @@ commitGraph.maxNewFilters::
 	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
 
 commitGraph.readChangedPaths::
-	If true, then git will use the changed-path Bloom filters in the
-	commit-graph file (if it exists, and they are present). Defaults to
-	true. See linkgit:git-commit-graph[1] for more information.
+	Deprecated. Equivalent to commitGraph.changedPathsVersion=-1 if true, and
+	commitGraph.changedPathsVersion=0 if false. (If commitGraph.changedPathVersion
+	is also set, commitGraph.changedPathsVersion takes precedence.)
+
+commitGraph.changedPathsVersion::
+	Specifies the version of the changed-path Bloom filters that Git will read and
+	write. May be -1, 0 or 1.
++
+Defaults to -1.
++
+If -1, Git will use the version of the changed-path Bloom filters in the
+repository, defaulting to 1 if there are none.
++
+If 0, Git will not read any Bloom filters, and will write version 1 Bloom
+filters when instructed to write.
++
+If 1, Git will only read version 1 Bloom filters, and will write version 1
+Bloom filters.
++
+See linkgit:git-commit-graph[1] for more information.
diff --git a/commit-graph.c b/commit-graph.c
index efc697e437..1f26c07de4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -401,7 +401,7 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 			graph->read_generation_data = 1;
 	}
 
-	if (s->commit_graph_read_changed_paths) {
+	if (s->commit_graph_changed_paths_version) {
 		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
 			   &graph->chunk_bloom_indexes);
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
diff --git a/oss-fuzz/fuzz-commit-graph.c b/oss-fuzz/fuzz-commit-graph.c
index 2992079dd9..325c0b991a 100644
--- a/oss-fuzz/fuzz-commit-graph.c
+++ b/oss-fuzz/fuzz-commit-graph.c
@@ -19,7 +19,7 @@ int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
 	 * possible.
 	 */
 	the_repository->settings.commit_graph_generation_version = 2;
-	the_repository->settings.commit_graph_read_changed_paths = 1;
+	the_repository->settings.commit_graph_changed_paths_version = 1;
 	g = parse_commit_graph(&the_repository->settings, (void *)data, size);
 	repo_clear(the_repository);
 	free_commit_graph(g);
diff --git a/repo-settings.c b/repo-settings.c
index 525f69c0c7..db8fe817f3 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -24,6 +24,7 @@ void prepare_repo_settings(struct repository *r)
 	int value;
 	const char *strval;
 	int manyfiles;
+	int read_changed_paths;
 
 	if (!r->gitdir)
 		BUG("Cannot add settings for uninitialized repository");
@@ -54,7 +55,10 @@ void prepare_repo_settings(struct repository *r)
 	/* Commit graph config or default, does not cascade (simple) */
 	repo_cfg_bool(r, "core.commitgraph", &r->settings.core_commit_graph, 1);
 	repo_cfg_int(r, "commitgraph.generationversion", &r->settings.commit_graph_generation_version, 2);
-	repo_cfg_bool(r, "commitgraph.readchangedpaths", &r->settings.commit_graph_read_changed_paths, 1);
+	repo_cfg_bool(r, "commitgraph.readchangedpaths", &read_changed_paths, 1);
+	repo_cfg_int(r, "commitgraph.changedpathsversion",
+		     &r->settings.commit_graph_changed_paths_version,
+		     read_changed_paths ? -1 : 0);
 	repo_cfg_bool(r, "gc.writecommitgraph", &r->settings.gc_write_commit_graph, 1);
 	repo_cfg_bool(r, "fetch.writecommitgraph", &r->settings.fetch_write_commit_graph, 0);
 
diff --git a/repository.h b/repository.h
index 5f18486f64..f71154e12c 100644
--- a/repository.h
+++ b/repository.h
@@ -29,7 +29,7 @@ struct repo_settings {
 
 	int core_commit_graph;
 	int commit_graph_generation_version;
-	int commit_graph_read_changed_paths;
+	int commit_graph_changed_paths_version;
 	int gc_write_commit_graph;
 	int fetch_write_commit_graph;
 	int command_requires_full_index;
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v6 7/7] commit-graph: new filter ver. that fixes murmur3
  2023-07-20 21:46 ` [PATCH v6 0/7] " Jonathan Tan
                     ` (5 preceding siblings ...)
  2023-07-20 21:46   ` [PATCH v6 6/7] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
@ 2023-07-20 21:46   ` Jonathan Tan
  2023-07-25 20:52   ` [PATCH v6 0/7] Changed path filter hash fix and version bump Junio C Hamano
  2023-07-26 20:39   ` Junio C Hamano
  8 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-07-20 21:46 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Junio C Hamano, Taylor Blau

The murmur3 implementation in bloom.c has a bug when converting series
of 4 bytes into network-order integers when char is signed (which is
controllable by a compiler option, and the default signedness of char is
platform-specific). When a string contains characters with the high bit
set, this bug causes results that, although internally consistent within
Git, does not accord with other implementations of murmur3 (thus,
the changed path filters wouldn't be readable by other off-the-shelf
implementatios of murmur3) and even with Git binaries that were compiled
with different signedness of char. This bug affects both how Git writes
changed path filters to disk and how Git interprets changed path filters
on disk.

Therefore, introduce a new version (2) of changed path filters that
corrects this problem. The existing version (1) is still supported and
is still the default, but users should migrate away from it as soon
as possible.

Because this bug only manifests with characters that have the high bit
set, it may be possible that some (or all) commits in a given repo would
have the same changed path filter both before and after this fix is
applied. However, in order to determine whether this is the case, the
changed paths would first have to be computed, at which point it is not
much more expensive to just compute a new changed path filter.

So this patch does not include any mechanism to "salvage" changed path
filters from repositories. There is also no "mixed" mode - for each
invocation of Git, reading and writing changed path filters are done
with the same version number; this version number may be explicitly
stated (typically if the user knows which version they need) or
automatically determined from the version of the existing changed path
filters in the repository.

There is a change in write_commit_graph(). graph_read_bloom_data()
makes it possible for chunk_bloom_data to be non-NULL but
bloom_filter_settings to be NULL, which causes a segfault later on. I
produced such a segfault while developing this patch, but couldn't find
a way to reproduce it neither after this complete patch (or before),
but in any case it seemed like a good thing to include that might help
future patch authors.

The value in t0095 was obtained from another murmur3 implementation
using the following Go source code:

  package main

  import "fmt"
  import "github.com/spaolacci/murmur3"

  func main() {
          fmt.Printf("%x\n", murmur3.Sum32([]byte("Hello world!")))
          fmt.Printf("%x\n", murmur3.Sum32([]byte{0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff}))
  }

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config/commitgraph.txt |  5 +-
 bloom.c                              | 69 ++++++++++++++++++++++++++--
 bloom.h                              |  8 ++--
 commit-graph.c                       | 31 +++++++++++--
 t/helper/test-bloom.c                |  9 +++-
 t/t0095-bloom.sh                     |  8 ++++
 t/t4216-log-bloom.sh                 | 67 ++++++++++++++++++++++++++-
 7 files changed, 183 insertions(+), 14 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index 2dc9170622..acc74a2f27 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -15,7 +15,7 @@ commitGraph.readChangedPaths::
 
 commitGraph.changedPathsVersion::
 	Specifies the version of the changed-path Bloom filters that Git will read and
-	write. May be -1, 0 or 1.
+	write. May be -1, 0, 1, or 2.
 +
 Defaults to -1.
 +
@@ -28,4 +28,7 @@ filters when instructed to write.
 If 1, Git will only read version 1 Bloom filters, and will write version 1
 Bloom filters.
 +
+If 2, Git will only read version 2 Bloom filters, and will write version 2
+Bloom filters.
++
 See linkgit:git-commit-graph[1] for more information.
diff --git a/bloom.c b/bloom.c
index 3e78cfe79d..ebef5cfd2f 100644
--- a/bloom.c
+++ b/bloom.c
@@ -66,7 +66,64 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
  * Not considered to be cryptographically secure.
  * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  */
-uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
+uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
+{
+	const uint32_t c1 = 0xcc9e2d51;
+	const uint32_t c2 = 0x1b873593;
+	const uint32_t r1 = 15;
+	const uint32_t r2 = 13;
+	const uint32_t m = 5;
+	const uint32_t n = 0xe6546b64;
+	int i;
+	uint32_t k1 = 0;
+	const char *tail;
+
+	int len4 = len / sizeof(uint32_t);
+
+	uint32_t k;
+	for (i = 0; i < len4; i++) {
+		uint32_t byte1 = (uint32_t)(unsigned char)data[4*i];
+		uint32_t byte2 = ((uint32_t)(unsigned char)data[4*i + 1]) << 8;
+		uint32_t byte3 = ((uint32_t)(unsigned char)data[4*i + 2]) << 16;
+		uint32_t byte4 = ((uint32_t)(unsigned char)data[4*i + 3]) << 24;
+		k = byte1 | byte2 | byte3 | byte4;
+		k *= c1;
+		k = rotate_left(k, r1);
+		k *= c2;
+
+		seed ^= k;
+		seed = rotate_left(seed, r2) * m + n;
+	}
+
+	tail = (data + len4 * sizeof(uint32_t));
+
+	switch (len & (sizeof(uint32_t) - 1)) {
+	case 3:
+		k1 ^= ((uint32_t)(unsigned char)tail[2]) << 16;
+		/*-fallthrough*/
+	case 2:
+		k1 ^= ((uint32_t)(unsigned char)tail[1]) << 8;
+		/*-fallthrough*/
+	case 1:
+		k1 ^= ((uint32_t)(unsigned char)tail[0]) << 0;
+		k1 *= c1;
+		k1 = rotate_left(k1, r1);
+		k1 *= c2;
+		seed ^= k1;
+		break;
+	}
+
+	seed ^= (uint32_t)len;
+	seed ^= (seed >> 16);
+	seed *= 0x85ebca6b;
+	seed ^= (seed >> 13);
+	seed *= 0xc2b2ae35;
+	seed ^= (seed >> 16);
+
+	return seed;
+}
+
+static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
 {
 	const uint32_t c1 = 0xcc9e2d51;
 	const uint32_t c2 = 0x1b873593;
@@ -131,8 +188,14 @@ void fill_bloom_key(const char *data,
 	int i;
 	const uint32_t seed0 = 0x293ae76f;
 	const uint32_t seed1 = 0x7e646e2c;
-	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
-	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
+	uint32_t hash0, hash1;
+	if (settings->hash_version == 2) {
+		hash0 = murmur3_seeded_v2(seed0, data, len);
+		hash1 = murmur3_seeded_v2(seed1, data, len);
+	} else {
+		hash0 = murmur3_seeded_v1(seed0, data, len);
+		hash1 = murmur3_seeded_v1(seed1, data, len);
+	}
 
 	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
 	for (i = 0; i < settings->num_hashes; i++)
diff --git a/bloom.h b/bloom.h
index 1e4f612d2c..138d57a86b 100644
--- a/bloom.h
+++ b/bloom.h
@@ -8,9 +8,11 @@ struct commit_graph;
 struct bloom_filter_settings {
 	/*
 	 * The version of the hashing technique being used.
-	 * We currently only support version = 1 which is
+	 * The newest version is 2, which is
 	 * the seeded murmur3 hashing technique implemented
-	 * in bloom.c.
+	 * in bloom.c. Bloom filters of version 1 were created
+	 * with prior versions of Git, which had a bug in the
+	 * implementation of the hash function.
 	 */
 	uint32_t hash_version;
 
@@ -80,7 +82,7 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
  * Not considered to be cryptographically secure.
  * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  */
-uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len);
+uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len);
 
 void fill_bloom_key(const char *data,
 		    size_t len,
diff --git a/commit-graph.c b/commit-graph.c
index 1f26c07de4..1a6685f554 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -304,16 +304,25 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
 	return 0;
 }
 
+struct graph_read_bloom_data_context {
+	struct commit_graph *g;
+	int *commit_graph_changed_paths_version;
+};
+
 static int graph_read_bloom_data(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
-	struct commit_graph *g = data;
+	struct graph_read_bloom_data_context *c = data;
+	struct commit_graph *g = c->g;
 	uint32_t hash_version;
 	g->chunk_bloom_data = chunk_start;
 	hash_version = get_be32(chunk_start);
 
-	if (hash_version != 1)
-		return 0;
+	if (*c->commit_graph_changed_paths_version == -1) {
+		*c->commit_graph_changed_paths_version = hash_version;
+	} else if (hash_version != *c->commit_graph_changed_paths_version) {
+ 		return 0;
+	}
 
 	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
 	g->bloom_filter_settings->hash_version = hash_version;
@@ -402,10 +411,14 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 	}
 
 	if (s->commit_graph_changed_paths_version) {
+		struct graph_read_bloom_data_context context = {
+			.g = graph,
+			.commit_graph_changed_paths_version = &s->commit_graph_changed_paths_version
+		};
 		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
 			   &graph->chunk_bloom_indexes);
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
-			   graph_read_bloom_data, graph);
+			   graph_read_bloom_data, &context);
 	}
 
 	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
@@ -2371,6 +2384,14 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->write_generation_data = (get_configured_generation_version(r) == 2);
 	ctx->num_generation_data_overflows = 0;
 
+	if (r->settings.commit_graph_changed_paths_version < -1
+	    || r->settings.commit_graph_changed_paths_version > 2) {
+		warning(_("attempting to write a commit-graph, but 'commitgraph.changedPathsVersion' (%d) is not supported"),
+			r->settings.commit_graph_changed_paths_version);
+		return 0;
+	}
+	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2
+		? 2 : 1;
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
 						      bloom_settings.bits_per_entry);
 	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
@@ -2400,7 +2421,7 @@ int write_commit_graph(struct object_directory *odb,
 		g = ctx->r->objects->commit_graph;
 
 		/* We have changed-paths already. Keep them in the next graph */
-		if (g && g->chunk_bloom_data) {
+		if (g && g->bloom_filter_settings) {
 			ctx->changed_paths = 1;
 			ctx->bloom_settings = g->bloom_filter_settings;
 		}
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index aabe31d724..3cbc0a5b50 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -50,6 +50,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 
 static const char *bloom_usage = "\n"
 "  test-tool bloom get_murmur3 <string>\n"
+"  test-tool bloom get_murmur3_seven_highbit\n"
 "  test-tool bloom generate_filter <string> [<string>...]\n"
 "  test-tool bloom get_filter_for_commit <commit-hex>\n";
 
@@ -64,7 +65,13 @@ int cmd__bloom(int argc, const char **argv)
 		uint32_t hashed;
 		if (argc < 3)
 			usage(bloom_usage);
-		hashed = murmur3_seeded(0, argv[2], strlen(argv[2]));
+		hashed = murmur3_seeded_v2(0, argv[2], strlen(argv[2]));
+		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
+	}
+
+	if (!strcmp(argv[1], "get_murmur3_seven_highbit")) {
+		uint32_t hashed;
+		hashed = murmur3_seeded_v2(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7);
 		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
 	}
 
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index b567383eb8..c8d84ab606 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -29,6 +29,14 @@ test_expect_success 'compute unseeded murmur3 hash for test string 2' '
 	test_cmp expect actual
 '
 
+test_expect_success 'compute unseeded murmur3 hash for test string 3' '
+	cat >expect <<-\EOF &&
+	Murmur3 Hash with seed=0:0xa183ccfd
+	EOF
+	test-tool bloom get_murmur3_seven_highbit >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'compute bloom key for empty string' '
 	cat >expect <<-\EOF &&
 	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c49528b947..cf4c78ddf9 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -418,7 +418,7 @@ test_expect_success 'set up repo with high bit path, version 1 changed-path' '
 	git -C highbit1 commit-graph write --reachable --changed-paths
 '
 
-test_expect_success 'setup check value of version 1 changed-path' '
+test_expect_success 'check value of version 1 changed-path' '
 	(cd highbit1 &&
 		echo "52a9" >expect &&
 		get_first_changed_path_filter >actual &&
@@ -443,4 +443,69 @@ test_expect_success 'version 1 changed-path used when version 1 requested' '
 		test_bloom_filters_used "-- $CENT")
 '
 
+test_expect_success 'version 1 changed-path not used when version 2 requested' '
+	(cd highbit1 &&
+		git config --add commitgraph.changedPathsVersion 2 &&
+		test_bloom_filters_not_used "-- $CENT")
+'
+
+test_expect_success 'version 1 changed-path used when autodetect requested' '
+	(cd highbit1 &&
+		git config --add commitgraph.changedPathsVersion -1 &&
+		test_bloom_filters_used "-- $CENT")
+'
+
+test_expect_success 'when writing another commit graph, preserve existing version 1 of changed-path' '
+	test_commit -C highbit1 c1double "$CENT$CENT" &&
+	git -C highbit1 commit-graph write --reachable --changed-paths &&
+	(cd highbit1 &&
+		git config --add commitgraph.changedPathsVersion -1 &&
+		echo "options: bloom(1,10,7) read_generation_data" >expect &&
+		test-tool read-graph >full &&
+		grep options full >actual &&
+		test_cmp expect actual)
+'
+
+test_expect_success 'set up repo with high bit path, version 2 changed-path' '
+	git init highbit2 &&
+	git -C highbit2 config --add commitgraph.changedPathsVersion 2 &&
+	test_commit -C highbit2 c2 "$CENT" &&
+	git -C highbit2 commit-graph write --reachable --changed-paths
+'
+
+test_expect_success 'check value of version 2 changed-path' '
+	(cd highbit2 &&
+		echo "c01f" >expect &&
+		get_first_changed_path_filter >actual &&
+		test_cmp expect actual)
+'
+
+test_expect_success 'version 2 changed-path used when version 2 requested' '
+	(cd highbit2 &&
+		test_bloom_filters_used "-- $CENT")
+'
+
+test_expect_success 'version 2 changed-path not used when version 1 requested' '
+	(cd highbit2 &&
+		git config --add commitgraph.changedPathsVersion 1 &&
+		test_bloom_filters_not_used "-- $CENT")
+'
+
+test_expect_success 'version 2 changed-path used when autodetect requested' '
+	(cd highbit2 &&
+		git config --add commitgraph.changedPathsVersion -1 &&
+		test_bloom_filters_used "-- $CENT")
+'
+
+test_expect_success 'when writing another commit graph, preserve existing version 2 of changed-path' '
+	test_commit -C highbit2 c2double "$CENT$CENT" &&
+	git -C highbit2 commit-graph write --reachable --changed-paths &&
+	(cd highbit2 &&
+		git config --add commitgraph.changedPathsVersion -1 &&
+		echo "options: bloom(2,10,7) read_generation_data" >expect &&
+		test-tool read-graph >full &&
+		grep options full >actual &&
+		test_cmp expect actual)
+'
+
 test_done
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v5 1/4] gitformat-commit-graph: describe version 2 of BDAT
  2023-07-20 20:20       ` Jonathan Tan
@ 2023-07-21  1:38         ` Taylor Blau
  0 siblings, 0 replies; 116+ messages in thread
From: Taylor Blau @ 2023-07-21  1:38 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee, Junio C Hamano

On Thu, Jul 20, 2023 at 01:20:06PM -0700, Jonathan Tan wrote:
> Having said that, I am inclined to not change this, so that the offset
> calculations are the same for both versions (e.g. in the test tool
> too), and as far as I know, we haven't had problems with this. But I can
> change it if people want.

Yeah, I definitely get your hesitation. I wonder if the test helper
patches that I sent change your thinking on this at all, since then
we're not relying on those scripts at all to compute the offset of the
BDAT chunk and dump its contents.

To be clear, I do think that this is probably outside the immediate
scope of this series. But since we're changing the on-disk format and
bumping the version count forward, I think that we want to make sure
that we're not missing any other format changes that we'd like to make
along the way.

I suppose that a theoretical v3 of the BDAT chunk's format would use the
same encoding for Bloom filters. But storing an extra 4-bytes of
information in the header of the Bloom data chunk feels like we could
squash that in.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/7] Changed path filter hash fix and version bump
  2023-07-20 21:46 ` [PATCH v6 0/7] " Jonathan Tan
                     ` (6 preceding siblings ...)
  2023-07-20 21:46   ` [PATCH v6 7/7] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
@ 2023-07-25 20:52   ` Junio C Hamano
  2023-07-26 20:39   ` Junio C Hamano
  8 siblings, 0 replies; 116+ messages in thread
From: Junio C Hamano @ 2023-07-25 20:52 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Taylor Blau

Jonathan Tan <jonathantanmy@google.com> writes:

> Thanks, Junio and Taylor, for your reviews. I have included Taylor's
> patches in this patch set.
>
> There seemed to be some merge conflicts when I tried to apply the
> patches Taylor provided on the base that I built my patches on (that is,
> the base of jt/path-filter-fix, namely, maint-2.40), so I have rebased
> all my patches onto latest master.
>
> Jonathan Tan (4):
>   gitformat-commit-graph: describe version 2 of BDAT
>   t4216: test changed path filters with high bit paths
>   repo-settings: introduce commitgraph.changedPathsVersion
>   commit-graph: new filter ver. that fixes murmur3
>
> Taylor Blau (3):
>   t/helper/test-read-graph.c: extract `dump_graph_info()`
>   bloom.h: make `load_bloom_filter_from_graph()` public
>   t/helper/test-read-graph: implement `bloom-filters` mode

Thanks, I seem to have missed this one.  Let's queue this version
and merge it down to 'next', unless there is no other blocking
comments in a few days.

Thanks.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/7] Changed path filter hash fix and version bump
  2023-07-20 21:46 ` [PATCH v6 0/7] " Jonathan Tan
                     ` (7 preceding siblings ...)
  2023-07-25 20:52   ` [PATCH v6 0/7] Changed path filter hash fix and version bump Junio C Hamano
@ 2023-07-26 20:39   ` Junio C Hamano
  2023-07-27  0:17     ` Taylor Blau
  8 siblings, 1 reply; 116+ messages in thread
From: Junio C Hamano @ 2023-07-26 20:39 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Taylor Blau, Derrick Stolee

Jonathan Tan <jonathantanmy@google.com> writes:

> Jonathan Tan (4):
>   gitformat-commit-graph: describe version 2 of BDAT
>   t4216: test changed path filters with high bit paths
>   repo-settings: introduce commitgraph.changedPathsVersion
>   commit-graph: new filter ver. that fixes murmur3
>
> Taylor Blau (3):
>   t/helper/test-read-graph.c: extract `dump_graph_info()`
>   bloom.h: make `load_bloom_filter_from_graph()` public
>   t/helper/test-read-graph: implement `bloom-filters` mode

After a week, nobody seems to have found anything worth saying about
these patches.  Does the design, especially the migration story, now
look sensible to everybody?  I was contemplating to mark the topic
for 'next' after reading them myself once more.

Thanks.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 2/7] t/helper/test-read-graph.c: extract `dump_graph_info()`
  2023-07-20 21:46   ` [PATCH v6 2/7] t/helper/test-read-graph.c: extract `dump_graph_info()` Jonathan Tan
@ 2023-07-26 23:26     ` Taylor Blau
  0 siblings, 0 replies; 116+ messages in thread
From: Taylor Blau @ 2023-07-26 23:26 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Junio C Hamano

On Thu, Jul 20, 2023 at 02:46:35PM -0700, Jonathan Tan wrote:
> From: Taylor Blau <me@ttaylorr.com>
>
> Prepare for the 'read-graph' test helper to perform other tasks besides
> dumping high-level information about the commit-graph by extracting its
> main routine into a separate function.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>

Missing sign-off from Jonathan?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 5/7] t4216: test changed path filters with high bit paths
  2023-07-20 21:46   ` [PATCH v6 5/7] t4216: test changed path filters with high bit paths Jonathan Tan
@ 2023-07-26 23:28     ` Taylor Blau
  0 siblings, 0 replies; 116+ messages in thread
From: Taylor Blau @ 2023-07-26 23:28 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Junio C Hamano

On Thu, Jul 20, 2023 at 02:46:38PM -0700, Jonathan Tan wrote:
> +test_expect_success 'setup check value of version 1 changed-path' '
> +	(cd highbit1 &&
> +		echo "52a9" >expect &&
> +		get_first_changed_path_filter >actual &&
> +		test_cmp expect actual)
> +'

This is a little bit of funky indentation, I probably would have
expected something more along the lines of:

    (
      cd highbit1 &&
      echo "52a9" >expect &&
      get_first_changed_path_filter >actual &&
      test_cmp expect actual
    )

but this obviously doesn't merit a reroll on its own.

> +test_expect_success 'version 1 changed-path used when version 1 requested' '
> +	(cd highbit1 &&
> +		test_bloom_filters_used "-- $CENT")
> +'

Same here, but neither of these is incorrect.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v5 4/4] commit-graph: new filter ver. that fixes murmur3
  2023-07-20 21:27       ` Jonathan Tan
@ 2023-07-26 23:32         ` Taylor Blau
  0 siblings, 0 replies; 116+ messages in thread
From: Taylor Blau @ 2023-07-26 23:32 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Derrick Stolee, Junio C Hamano

On Thu, Jul 20, 2023 at 02:27:53PM -0700, Jonathan Tan wrote:
> > I think the early checks would be more expensive, since in the worst
> > case you have to walk the entire tree, only to realize that you actually
> > wanted to compute a first-parent tree diff, meaning you have to
> > essentially repeat the whole walk over again. But for repositories that
> > have few or no commits whose Bloom filters need computing, I think it
> > would be significantly faster, since many of the sub-trees wouldn't need
> > to be visited again.
>
> So for repositories that need little-to-no recomputation of Bloom
> filters, your idea likely means that each tree needs to be read once,
> as compared to recomputing everything in which, I think, each tree needs
> to be read roughly twice (once when computing the Bloom filter for the
> commit that introduces it, and once for the commit that substitutes a
> different tree in place).
>
> I could change the text of the commit message to discuss this (instead
> of the blanket statement that it would be too hard), although I think
> that an implementation of this can be done after this patchset. What do
> you think?

Right, I think that a sizeable portion of repositories will need to
compute relatively few Bloom filters overall. If you feel strongly that
it shouldn't be included in this series, I could live with that since
this is all behind a configuration variable anyway.

I think at minimum we should call it out in the documentation, at least
until such functionality is implemented, since unsuspecting users/forge
operators may bump the filter version forward and be surprised when they
suddenly have to recompute every existing Bloom filter.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/7] Changed path filter hash fix and version bump
  2023-07-26 20:39   ` Junio C Hamano
@ 2023-07-27  0:17     ` Taylor Blau
  2023-07-27  0:49       ` Junio C Hamano
  0 siblings, 1 reply; 116+ messages in thread
From: Taylor Blau @ 2023-07-27  0:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git, Derrick Stolee

On Wed, Jul 26, 2023 at 01:39:14PM -0700, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
>
> > Jonathan Tan (4):
> >   gitformat-commit-graph: describe version 2 of BDAT
> >   t4216: test changed path filters with high bit paths
> >   repo-settings: introduce commitgraph.changedPathsVersion
> >   commit-graph: new filter ver. that fixes murmur3
> >
> > Taylor Blau (3):
> >   t/helper/test-read-graph.c: extract `dump_graph_info()`
> >   bloom.h: make `load_bloom_filter_from_graph()` public
> >   t/helper/test-read-graph: implement `bloom-filters` mode
>
> After a week, nobody seems to have found anything worth saying about
> these patches.  Does the design, especially the migration story, now
> look sensible to everybody?  I was contemplating to mark the topic
> for 'next' after reading them myself once more.

Sorry for not getting to this sooner. I didn't notice anything during my
review, but I think there may be a bug here.

Suppose that we have an existing commit-graph with v1 Bloom filters. If
we then try to rewrite that commit-graph using v2 Bloom filters, we
*should* attempt to recompute the filter from scratch. But AFAICT, that
isn't what happens.

Here's my test setup:

    test_expect_success 'test' '
      test_commit base &&
      git repack -d &&

      git -c commitGraph.changedPathVersion=1 commit-graph write --changed-paths &&
      debug git -c commitGraph.changedPathVersion=2 commit-graph write --changed-paths
    '

if you attach a debugger to the second process, and break inside of
get_or_compute_bloom_filter() when compute_if_not_present is set, you'll
see that Git will pass along the existing *v1* Bloom filter, and then
write its contents to the new commit-graph:

    (gdb) b get_or_compute_bloom_filter if compute_if_not_present
    Breakpoint 1 at 0x14340f: file bloom.c, line 260.
    (gdb) r
    Starting program: /home/ttaylorr/src/git/git -c commitGraph.changedPathVersion=2 commit-graph write --changed-paths
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

    Breakpoint 1, get_or_compute_bloom_filter (r=0x5555559bdc80 <the_repo>, c=0x5555559c8ef0,
        compute_if_not_present=1, settings=0x5555559c6950, computed=0x7fffffffd854) at bloom.c:260
    260		if (computed)
    (gdb) until 271
    get_or_compute_bloom_filter (r=0x5555559bdc80 <the_repo>, c=0x5555559c8ef0, compute_if_not_present=1,
        settings=0x5555559c6950, computed=0x7fffffffd854) at bloom.c:271
    271				load_bloom_filter_from_graph(r->objects->commit_graph,
    (gdb) p *filter
    $2 = {data = 0x0, len = 0}
    (gdb) n
    275		if (filter->data && filter->len)
    (gdb) p *filter
    $3 = {data = 0x7ffff7fc24a8 "\210\210\322a\267\234\214s}\004J\265\313\201\241\032e\312\034", len = 2}

If I'm parsing this all correctly, Git used the v1 filter corresponding
to that commit, and did not recompute a new one.

I think that this could lead to incorrect results if you use this to
masquerade a v1 Bloom filter as a v2 one. Since they use different
implementations (one correct, one not) of murmur3, that opens us up to
false negatives, at which point all bets are off.

So I think we want to be more careful about when we load the existing
Bloom data or not. I think we probably want to load it in all cases, but
only read it when we have compatible Bloom settings. That should stop us
from exhibiting this kind of bug, but it also gives us a handle on
existing Bloom data if we wanted to copy forward existing results where
we can.

If all of this tracks, I think that there is a gap in our test coverage
that didn't catch this earlier.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/7] Changed path filter hash fix and version bump
  2023-07-27  0:17     ` Taylor Blau
@ 2023-07-27  0:49       ` Junio C Hamano
  2023-07-27 17:39         ` Jonathan Tan
  2023-07-27 18:44         ` Junio C Hamano
  0 siblings, 2 replies; 116+ messages in thread
From: Junio C Hamano @ 2023-07-27  0:49 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

> Sorry for not getting to this sooner. I didn't notice anything during my
> review, but I think there may be a bug here.
>
> Suppose that we have an existing commit-graph with v1 Bloom filters. If
> we then try to rewrite that commit-graph using v2 Bloom filters, we
> *should* attempt to recompute the filter from scratch. But AFAICT, that
> isn't what happens.
> ...
> If I'm parsing this all correctly, Git used the v1 filter corresponding
> to that commit, and did not recompute a new one.
>
> I think that this could lead to incorrect results if you use this to
> masquerade a v1 Bloom filter as a v2 one.

That indeed is very bad.  How hard it would be to construct a test
case that fails if filter computed as v1 is marketed as v2?  A test
may be far easier to construct if it does not have to be end-to-end
(e.g. instrument the codepath you followed through with the debugger
with trace2 annotations and see the control takes the right branches
by reading the log).

> So I think we want to be more careful about when we load the existing
> Bloom data or not. I think we probably want to load it in all cases, but
> only read it when we have compatible Bloom settings. That should stop us
> from exhibiting this kind of bug, but it also gives us a handle on
> existing Bloom data if we wanted to copy forward existing results where
> we can.
>
> If all of this tracks, I think that there is a gap in our test coverage
> that didn't catch this earlier.

Yeah, thanks for raising a concern.

Jonathan?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/7] Changed path filter hash fix and version bump
  2023-07-27  0:49       ` Junio C Hamano
@ 2023-07-27 17:39         ` Jonathan Tan
  2023-07-27 17:56           ` Taylor Blau
  2023-07-27 18:44         ` Junio C Hamano
  1 sibling, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-07-27 17:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, Taylor Blau, git, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:
> Taylor Blau <me@ttaylorr.com> writes:
> > Suppose that we have an existing commit-graph with v1 Bloom filters. If
> > we then try to rewrite that commit-graph using v2 Bloom filters, we
> > *should* attempt to recompute the filter from scratch. But AFAICT, that
> > isn't what happens.
> > ...
> > If I'm parsing this all correctly, Git used the v1 filter corresponding
> > to that commit, and did not recompute a new one.
> >
> > I think that this could lead to incorrect results if you use this to
> > masquerade a v1 Bloom filter as a v2 one.
> 
> That indeed is very bad.  How hard it would be to construct a test
> case that fails if filter computed as v1 is marketed as v2?  A test
> may be far easier to construct if it does not have to be end-to-end
> (e.g. instrument the codepath you followed through with the debugger
> with trace2 annotations and see the control takes the right branches
> by reading the log).

Ah, thanks, Taylor, for so meticulously investigating this. I'll take a look.

A test should be doable - we already have tests (the ones that use
"get_first_changed_path_filter") that check the bytes of the filter
generated, so we should be able to write a test that writes one version,
writes the other, then checks the bytes.

> > So I think we want to be more careful about when we load the existing
> > Bloom data or not. I think we probably want to load it in all cases, but
> > only read it when we have compatible Bloom settings. That should stop us
> > from exhibiting this kind of bug, but it also gives us a handle on
> > existing Bloom data if we wanted to copy forward existing results where
> > we can.

The intention in the current patch set was to not load it at all when we
have incompatible Bloom settings because it appeared quite troublesome
to notate which Bloom filter in memory is of which version. If we want
to copy forward existing results, we can change that, but I don't know
whether it's worth doing that (and if we think it's worth doing, this
should probably go in another patch set).

> > If all of this tracks, I think that there is a gap in our test coverage
> > that didn't catch this earlier.
> 
> Yeah, thanks for raising a concern.
> 
> Jonathan?

I'll take a look. Yes this does seem like a gap in test coverage -
I thought the existing test that checks that Bloom filters are not
used when a different version is requested would be sufficient, but
apparently that's not the case.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/7] Changed path filter hash fix and version bump
  2023-07-27 17:39         ` Jonathan Tan
@ 2023-07-27 17:56           ` Taylor Blau
  2023-07-27 20:53             ` Jonathan Tan
  0 siblings, 1 reply; 116+ messages in thread
From: Taylor Blau @ 2023-07-27 17:56 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Junio C Hamano, git, Derrick Stolee

On Thu, Jul 27, 2023 at 10:39:05AM -0700, Jonathan Tan wrote:
> A test should be doable - we already have tests (the ones that use
> "get_first_changed_path_filter") that check the bytes of the filter
> generated, so we should be able to write a test that writes one version,
> writes the other, then checks the bytes.

Thanks for looking into it!

> > > So I think we want to be more careful about when we load the existing
> > > Bloom data or not. I think we probably want to load it in all cases, but
> > > only read it when we have compatible Bloom settings. That should stop us
> > > from exhibiting this kind of bug, but it also gives us a handle on
> > > existing Bloom data if we wanted to copy forward existing results where
> > > we can.
>
> The intention in the current patch set was to not load it at all when we
> have incompatible Bloom settings because it appeared quite troublesome
> to notate which Bloom filter in memory is of which version. If we want
> to copy forward existing results, we can change that, but I don't know
> whether it's worth doing that (and if we think it's worth doing, this
> should probably go in another patch set).

Yeah, I think having Bloom filters accessible from a commit-graph
regardless of whether or not it matches our Bloom filter version is
prerequisite to being able to implement something like this.

I feel like this is important enough to do in the same patch set, or the
same release to avoid surprising operators when their commit-graph write
suddenly recomputes all of its Bloom filters.

Since we already store the Bloom version that we're using in the
commit-graph file itself, shouldn't it be something along the lines of
sticking that value onto the bloom_filter when we read its contents?

Although I don't think that you'd even need to annotate each individual
filter, since you know that every pre-existing Bloom filter you are able
to find has the version given by:

    the_repository->objects->commit_graph->bloom_filter_settings->hash_version

right?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/7] Changed path filter hash fix and version bump
  2023-07-27  0:49       ` Junio C Hamano
  2023-07-27 17:39         ` Jonathan Tan
@ 2023-07-27 18:44         ` Junio C Hamano
  1 sibling, 0 replies; 116+ messages in thread
From: Junio C Hamano @ 2023-07-27 18:44 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

> ...  How hard it would be to construct a test
> case that fails if filter computed as v1 is marketed as v2?

There may be a more effective way for future-proofing, besides
ensuring the test coverage.

Although this series added a way to see which version the on-disk
data is using using the version field, I do not think it touched the
"struct bloom_filter" and "struct bloom_key" that represent the
in-core data.  If we had a member in these structures that
get_or_compute_bloom_filter() can fill in from the on-disk structure
or the version it used when it computed the filter anew, would it
become easier to catch the case where we try to add a version 2
computed key to a filter that was read from version 1 on-disk
structure, presumably at add_key_to_filter()?

Thanks.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/7] Changed path filter hash fix and version bump
  2023-07-27 17:56           ` Taylor Blau
@ 2023-07-27 20:53             ` Jonathan Tan
  2023-08-01 18:08               ` Taylor Blau
  0 siblings, 1 reply; 116+ messages in thread
From: Jonathan Tan @ 2023-07-27 20:53 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, Junio C Hamano, git, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:
> > The intention in the current patch set was to not load it at all when we
> > have incompatible Bloom settings because it appeared quite troublesome
> > to notate which Bloom filter in memory is of which version. If we want
> > to copy forward existing results, we can change that, but I don't know
> > whether it's worth doing that (and if we think it's worth doing, this
> > should probably go in another patch set).
> 
> Yeah, I think having Bloom filters accessible from a commit-graph
> regardless of whether or not it matches our Bloom filter version is
> prerequisite to being able to implement something like this.
> 
> I feel like this is important enough to do in the same patch set, or the
> same release to avoid surprising operators when their commit-graph write
> suddenly recomputes all of its Bloom filters.

Suddenly reading many (or most) of the repo's trees would be a similar
surprise, right?

Also this would happen only if the server operator explicitly sets a
changed path filter version. If they leave it as-is, commit graphs will
still be written with the same version as the one on disk.

> Since we already store the Bloom version that we're using in the
> commit-graph file itself, shouldn't it be something along the lines of
> sticking that value onto the bloom_filter when we read its contents?
> 
> Although I don't think that you'd even need to annotate each individual
> filter, since you know that every pre-existing Bloom filter you are able
> to find has the version given by:
> 
>     the_repository->objects->commit_graph->bloom_filter_settings->hash_version
> 
> right?
> 
> Thanks,
> Taylor

Regarding consulting commit_graph->bloom_filter_settings->hash_version,
the issue I ran into was that firstly, I didn't know what to do about
commit_graph->base_graph (which also has its own bloom_filter_settings)
and what to do if it had a contradictory hash_version. And even if
we found a way to unify those, it is not true that every Bloom filter
in memory is of that version, since we may have generated some that
correspond to the version we're writing (not the version on disk).
In particular, the Bloom filters we write come from a commit slab
(bloom_filters in bloom.c) and in that slab, both Bloom filters from
disk and Bloom filters that were generated in-process coexist.

I also thought of your other proposal of augmenting struct bloom_filter
to also include the version. The issue I ran into there is if, for a
given commit, there already exists a Bloom filter read from disk with
the wrong version, what should we do? Overwrite it, or store both
versions in memory? (We can't just immediately output the Bloom filter
to disk and forget about the new version, only storing its size so that
we can generate the BIDX, because in the current code, generation and
writing to disk are separate. We could try to refactor it, but I didn't
want to make such a large change to reduce the possibility of bugs.)
Both storing the version number and storing an additional pointer for a
second version would increase memory consumption too, even when
supporting two versions isn't needed, but maybe this isn't a big deal.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/7] Changed path filter hash fix and version bump
  2023-07-27 20:53             ` Jonathan Tan
@ 2023-08-01 18:08               ` Taylor Blau
  2023-08-01 18:52                 ` Jonathan Tan
  2023-08-03  0:01                 ` Taylor Blau
  0 siblings, 2 replies; 116+ messages in thread
From: Taylor Blau @ 2023-08-01 18:08 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Junio C Hamano, git, Derrick Stolee

On Thu, Jul 27, 2023 at 01:53:08PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > > The intention in the current patch set was to not load it at all when we
> > > have incompatible Bloom settings because it appeared quite troublesome
> > > to notate which Bloom filter in memory is of which version. If we want
> > > to copy forward existing results, we can change that, but I don't know
> > > whether it's worth doing that (and if we think it's worth doing, this
> > > should probably go in another patch set).
> >
> > Yeah, I think having Bloom filters accessible from a commit-graph
> > regardless of whether or not it matches our Bloom filter version is
> > prerequisite to being able to implement something like this.
> >
> > I feel like this is important enough to do in the same patch set, or the
> > same release to avoid surprising operators when their commit-graph write
> > suddenly recomputes all of its Bloom filters.
>
> Suddenly reading many (or most) of the repo's trees would be a similar
> surprise, right?

That's a good point. I think in general I'd expect Git to avoid
recomputing Bloom filters where that work can be avoided, if the work in
order to detect whether or not we need to recompute a filter is cheap
enough to carry out.

> Also this would happen only if the server operator explicitly sets a
> changed path filter version. If they leave it as-is, commit graphs will
> still be written with the same version as the one on disk.

I think that I could live with that if the default is to leave things
as-is.

I still think that it's worth it to have this functionality to propagate
Bloom filters forward should ship in Git, but we can work on that
outside of this series.

> Regarding consulting commit_graph->bloom_filter_settings->hash_version,
> the issue I ran into was that firstly, I didn't know what to do about
> commit_graph->base_graph (which also has its own bloom_filter_settings)
> and what to do if it had a contradictory hash_version. And even if
> we found a way to unify those, it is not true that every Bloom filter
> in memory is of that version, since we may have generated some that
> correspond to the version we're writing (not the version on disk).
> In particular, the Bloom filters we write come from a commit slab
> (bloom_filters in bloom.c) and in that slab, both Bloom filters from
> disk and Bloom filters that were generated in-process coexist.

Would we ever want to have a commit-graph chain with mixed Bloom filter
versions?

We avoid mixing generation number schemes across multiple layers of a
commit-graph chain. But I don't see any reason that we should or need to
have a similar restriction in place for the Bloom filter version. Both
are readable, as long as the user-provided configuration allows them to
be.

We just have to interpret them differently depending on what layer of
the graph (and therefore, what Bloom filter version they are) they come
from.

Sorry for thinking aloud a little there. I think that this means that we
at minimum have to keep in context the commit-graph layer we found the
Bloom filter in so that we can tie that back to its Bloom filter
version. That might just mean that we have to tag each Bloom filter with
the version it was computed under, or maybe we already have the
commit-graph layer in context, in which case we shouldn't need an
additional field.

My gut is telling me that we probably *do* need such a field, since we
don't often have a reference to the particular layer that we found a
Bloom filter in, just the tip of the commit-graph chain that it came
from.

> I also thought of your other proposal of augmenting struct bloom_filter
> to also include the version. The issue I ran into there is if, for a
> given commit, there already exists a Bloom filter read from disk with
> the wrong version, what should we do? Overwrite it, or store both
> versions in memory? (We can't just immediately output the Bloom filter
> to disk and forget about the new version, only storing its size so that
> we can generate the BIDX, because in the current code, generation and
> writing to disk are separate. We could try to refactor it, but I didn't
> want to make such a large change to reduce the possibility of bugs.)
> Both storing the version number and storing an additional pointer for a
> second version would increase memory consumption too, even when
> supporting two versions isn't needed, but maybe this isn't a big deal.

It's likely that I'm missing something here, but what is stopping us
from discarding the old Bloom filter as soon as we generate the new
one? We shouldn't need to load the old filter again out of the commit
slab, right?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v7 0/7] Changed path filter hash fix and version bump
  2023-05-22 21:48 [PATCH 0/2] Changed path filter hash fix and version bump Jonathan Tan
                   ` (7 preceding siblings ...)
  2023-07-20 21:46 ` [PATCH v6 0/7] " Jonathan Tan
@ 2023-08-01 18:41 ` Jonathan Tan
  2023-08-01 18:41   ` [PATCH v7 1/7] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
                     ` (7 more replies)
  8 siblings, 8 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-08-01 18:41 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Junio C Hamano, Taylor Blau

I've fixed the bug that Taylor described. It was an issue where the
presence of Bloom filters can be indicated both by the presence of the
chunk and the presence of a bloom_filter_settings, and I've fixed it by
avoiding setting the chunk_bloom_data if we're not using Bloom filters
due to an incompatible version. In the future, we might want to refactor
the code so that there is only one way to indicate whether the Bloom
filters are present.

I've also added sign-offs and changed the indentation of the tests, as
remarked by Taylor [1] [2].

[1] https://lore.kernel.org/git/ZMGruenDbAo22aWV@nand.local/
[2] https://lore.kernel.org/git/ZMGsJTxBXZ94lhMU@nand.local/

Taylor also suggested copying forward Bloom filters whenever possible
in this patch set [3], but also that we could work on this outside this
series [4]. I did not implement this in this series.

[3] https://lore.kernel.org/git/ZMKvsObx+uaKA8zF@nand.local/
[4] https://lore.kernel.org/git/ZMlKMmAs3wKULAOd@nand.local/

Jonathan Tan (4):
  gitformat-commit-graph: describe version 2 of BDAT
  t4216: test changed path filters with high bit paths
  repo-settings: introduce commitgraph.changedPathsVersion
  commit-graph: new filter ver. that fixes murmur3

Taylor Blau (3):
  t/helper/test-read-graph.c: extract `dump_graph_info()`
  bloom.h: make `load_bloom_filter_from_graph()` public
  t/helper/test-read-graph: implement `bloom-filters` mode

 Documentation/config/commitgraph.txt     |  26 ++++-
 Documentation/gitformat-commit-graph.txt |   9 +-
 bloom.c                                  |  75 +++++++++++-
 bloom.h                                  |  13 ++-
 commit-graph.c                           |  35 ++++--
 oss-fuzz/fuzz-commit-graph.c             |   2 +-
 repo-settings.c                          |   6 +-
 repository.h                             |   2 +-
 t/helper/test-bloom.c                    |   9 +-
 t/helper/test-read-graph.c               |  67 ++++++++---
 t/t0095-bloom.sh                         |   8 ++
 t/t4216-log-bloom.sh                     | 139 +++++++++++++++++++++++
 12 files changed, 351 insertions(+), 40 deletions(-)

Range-diff against v6:
1:  3ce6090a4d = 1:  3ce6090a4d gitformat-commit-graph: describe version 2 of BDAT
2:  1955734d1f ! 2:  fc6346c039 t/helper/test-read-graph.c: extract `dump_graph_info()`
    @@ Commit message
         main routine into a separate function.
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
    +    Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
     
      ## t/helper/test-read-graph.c ##
     @@
3:  4cf7c2f634 ! 3:  f144dc4b15 bloom.h: make `load_bloom_filter_from_graph()` public
    @@ Commit message
         for manual inspection (to be used during tests).
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
    +    Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
     
      ## bloom.c ##
     @@ bloom.c: static inline unsigned char get_bitmask(uint32_t pos)
4:  47b55758e6 ! 4:  2ade832a23 t/helper/test-read-graph: implement `bloom-filters` mode
    @@ Commit message
         hexadecimal contents of the Bloom filter(s) contained in a commit-graph.
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
    +    Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
     
      ## t/helper/test-read-graph.c ##
     @@ t/helper/test-read-graph.c: static void dump_graph_info(struct commit_graph *graph)
5:  5276e6a90e ! 5:  74863c11e5 t4216: test changed path filters with high bit paths
    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm
     +'
     +
     +test_expect_success 'setup check value of version 1 changed-path' '
    -+	(cd highbit1 &&
    ++	(
    ++		cd highbit1 &&
     +		echo "52a9" >expect &&
     +		get_first_changed_path_filter >actual &&
    -+		test_cmp expect actual)
    ++		test_cmp expect actual
    ++	)
     +'
     +
     +# expect will not match actual if char is unsigned by default. Write the test
    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm
     +'
     +
     +test_expect_success 'version 1 changed-path used when version 1 requested' '
    -+	(cd highbit1 &&
    -+		test_bloom_filters_used "-- $CENT")
    ++	(
    ++		cd highbit1 &&
    ++		test_bloom_filters_used "-- $CENT"
    ++	)
     +'
     +
      test_done
6:  dc3f6d2d4f = 6:  60f4faeff9 repo-settings: introduce commitgraph.changedPathsVersion
7:  6e2d797406 ! 7:  68258cfd04 commit-graph: new filter ver. that fixes murmur3
    @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star
     +	struct graph_read_bloom_data_context *c = data;
     +	struct commit_graph *g = c->g;
      	uint32_t hash_version;
    - 	g->chunk_bloom_data = chunk_start;
    +-	g->chunk_bloom_data = chunk_start;
      	hash_version = get_be32(chunk_start);
      
     -	if (hash_version != 1)
    @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star
     + 		return 0;
     +	}
      
    ++	g->chunk_bloom_data = chunk_start;
      	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
      	g->bloom_filter_settings->hash_version = hash_version;
    + 	g->bloom_filter_settings->num_hashes = get_be32(chunk_start + 4);
     @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s,
      	}
      
    @@ t/t0095-bloom.sh: test_expect_success 'compute unseeded murmur3 hash for test st
      	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
     
      ## t/t4216-log-bloom.sh ##
    -@@ t/t4216-log-bloom.sh: test_expect_success 'set up repo with high bit path, version 1 changed-path' '
    - 	git -C highbit1 commit-graph write --reachable --changed-paths
    - '
    - 
    --test_expect_success 'setup check value of version 1 changed-path' '
    -+test_expect_success 'check value of version 1 changed-path' '
    - 	(cd highbit1 &&
    - 		echo "52a9" >expect &&
    - 		get_first_changed_path_filter >actual &&
     @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when version 1 requested' '
    - 		test_bloom_filters_used "-- $CENT")
    + 	)
      '
      
     +test_expect_success 'version 1 changed-path not used when version 2 requested' '
    -+	(cd highbit1 &&
    ++	(
    ++		cd highbit1 &&
     +		git config --add commitgraph.changedPathsVersion 2 &&
    -+		test_bloom_filters_not_used "-- $CENT")
    ++		test_bloom_filters_not_used "-- $CENT"
    ++	)
     +'
     +
     +test_expect_success 'version 1 changed-path used when autodetect requested' '
    -+	(cd highbit1 &&
    ++	(
    ++		cd highbit1 &&
     +		git config --add commitgraph.changedPathsVersion -1 &&
    -+		test_bloom_filters_used "-- $CENT")
    ++		test_bloom_filters_used "-- $CENT"
    ++	)
     +'
     +
     +test_expect_success 'when writing another commit graph, preserve existing version 1 of changed-path' '
     +	test_commit -C highbit1 c1double "$CENT$CENT" &&
     +	git -C highbit1 commit-graph write --reachable --changed-paths &&
    -+	(cd highbit1 &&
    ++	(
    ++		cd highbit1 &&
     +		git config --add commitgraph.changedPathsVersion -1 &&
     +		echo "options: bloom(1,10,7) read_generation_data" >expect &&
     +		test-tool read-graph >full &&
     +		grep options full >actual &&
    -+		test_cmp expect actual)
    ++		test_cmp expect actual
    ++	)
     +'
     +
     +test_expect_success 'set up repo with high bit path, version 2 changed-path' '
    @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when vers
     +'
     +
     +test_expect_success 'check value of version 2 changed-path' '
    -+	(cd highbit2 &&
    ++	(
    ++		cd highbit2 &&
     +		echo "c01f" >expect &&
     +		get_first_changed_path_filter >actual &&
    -+		test_cmp expect actual)
    ++		test_cmp expect actual
    ++	)
     +'
     +
     +test_expect_success 'version 2 changed-path used when version 2 requested' '
    -+	(cd highbit2 &&
    -+		test_bloom_filters_used "-- $CENT")
    ++	(
    ++		cd highbit2 &&
    ++		test_bloom_filters_used "-- $CENT"
    ++	)
     +'
     +
     +test_expect_success 'version 2 changed-path not used when version 1 requested' '
    -+	(cd highbit2 &&
    ++	(
    ++		cd highbit2 &&
     +		git config --add commitgraph.changedPathsVersion 1 &&
    -+		test_bloom_filters_not_used "-- $CENT")
    ++		test_bloom_filters_not_used "-- $CENT"
    ++	)
     +'
     +
     +test_expect_success 'version 2 changed-path used when autodetect requested' '
    -+	(cd highbit2 &&
    ++	(
    ++		cd highbit2 &&
     +		git config --add commitgraph.changedPathsVersion -1 &&
    -+		test_bloom_filters_used "-- $CENT")
    ++		test_bloom_filters_used "-- $CENT"
    ++	)
     +'
     +
     +test_expect_success 'when writing another commit graph, preserve existing version 2 of changed-path' '
     +	test_commit -C highbit2 c2double "$CENT$CENT" &&
     +	git -C highbit2 commit-graph write --reachable --changed-paths &&
    -+	(cd highbit2 &&
    ++	(
    ++		cd highbit2 &&
     +		git config --add commitgraph.changedPathsVersion -1 &&
     +		echo "options: bloom(2,10,7) read_generation_data" >expect &&
     +		test-tool read-graph >full &&
     +		grep options full >actual &&
    -+		test_cmp expect actual)
    ++		test_cmp expect actual
    ++	)
    ++'
    ++
    ++test_expect_success 'when writing commit graph, do not reuse changed-path of another version' '
    ++	git init doublewrite &&
    ++	test_commit -C doublewrite c "$CENT" &&
    ++	git -C doublewrite config --add commitgraph.changedPathsVersion 1 &&
    ++	git -C doublewrite commit-graph write --reachable --changed-paths &&
    ++	git -C doublewrite config --add commitgraph.changedPathsVersion 2 &&
    ++	git -C doublewrite commit-graph write --reachable --changed-paths &&
    ++	(
    ++		cd doublewrite &&
    ++		echo "c01f" >expect &&
    ++		get_first_changed_path_filter >actual &&
    ++		test_cmp expect actual
    ++	)
     +'
     +
      test_done
-- 
2.41.0.585.gd2178a4bd4-goog


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v7 1/7] gitformat-commit-graph: describe version 2 of BDAT
  2023-08-01 18:41 ` [PATCH v7 " Jonathan Tan
@ 2023-08-01 18:41   ` Jonathan Tan
  2023-08-01 18:41   ` [PATCH v7 2/7] t/helper/test-read-graph.c: extract `dump_graph_info()` Jonathan Tan
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-08-01 18:41 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Junio C Hamano, Taylor Blau

The code change to Git to support version 2 will be done in subsequent
commits.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/gitformat-commit-graph.txt | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/Documentation/gitformat-commit-graph.txt b/Documentation/gitformat-commit-graph.txt
index 31cad585e2..3e906e8030 100644
--- a/Documentation/gitformat-commit-graph.txt
+++ b/Documentation/gitformat-commit-graph.txt
@@ -142,13 +142,16 @@ All multi-byte numbers are in network byte order.
 
 ==== Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
     * It starts with header consisting of three unsigned 32-bit integers:
-      - Version of the hash algorithm being used. We currently only support
-	value 1 which corresponds to the 32-bit version of the murmur3 hash
+      - Version of the hash algorithm being used. We currently support
+	value 2 which corresponds to the 32-bit version of the murmur3 hash
 	implemented exactly as described in
 	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
 	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
 	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
-	in Probabilistic Verification"
+	in Probabilistic Verification". Version 1 Bloom filters have a bug that appears
+	when char is signed and the repository has path names that have characters >=
+	0x80; Git supports reading and writing them, but this ability will be removed
+	in a future version of Git.
       - The number of times a path is hashed and hence the number of bit positions
 	      that cumulatively determine whether a file is present in the commit.
       - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
-- 
2.41.0.585.gd2178a4bd4-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v7 2/7] t/helper/test-read-graph.c: extract `dump_graph_info()`
  2023-08-01 18:41 ` [PATCH v7 " Jonathan Tan
  2023-08-01 18:41   ` [PATCH v7 1/7] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
@ 2023-08-01 18:41   ` Jonathan Tan
  2023-08-01 18:41   ` [PATCH v7 3/7] bloom.h: make `load_bloom_filter_from_graph()` public Jonathan Tan
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-08-01 18:41 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Junio C Hamano, Jonathan Tan

From: Taylor Blau <me@ttaylorr.com>

Prepare for the 'read-graph' test helper to perform other tasks besides
dumping high-level information about the commit-graph by extracting its
main routine into a separate function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 t/helper/test-read-graph.c | 33 ++++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 8c7a83f578..c664928412 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -5,20 +5,8 @@
 #include "bloom.h"
 #include "setup.h"
 
-int cmd__read_graph(int argc UNUSED, const char **argv UNUSED)
+static void dump_graph_info(struct commit_graph *graph)
 {
-	struct commit_graph *graph = NULL;
-	struct object_directory *odb;
-
-	setup_git_directory();
-	odb = the_repository->objects->odb;
-
-	prepare_repo_settings(the_repository);
-
-	graph = read_commit_graph_one(the_repository, odb);
-	if (!graph)
-		return 1;
-
 	printf("header: %08x %d %d %d %d\n",
 		ntohl(*(uint32_t*)graph->data),
 		*(unsigned char*)(graph->data + 4),
@@ -57,8 +45,27 @@ int cmd__read_graph(int argc UNUSED, const char **argv UNUSED)
 	if (graph->topo_levels)
 		printf(" topo_levels");
 	printf("\n");
+}
+
+int cmd__read_graph(int argc UNUSED, const char **argv UNUSED)
+{
+	struct commit_graph *graph = NULL;
+	struct object_directory *odb;
+
+	setup_git_directory();
+	odb = the_repository->objects->odb;
+
+	prepare_repo_settings(the_repository);
+
+	graph = read_commit_graph_one(the_repository, odb);
+	if (!graph)
+		return 1;
+
+	dump_graph_info(graph);
 
 	UNLEAK(graph);
 
 	return 0;
 }
+
+
-- 
2.41.0.585.gd2178a4bd4-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v7 3/7] bloom.h: make `load_bloom_filter_from_graph()` public
  2023-08-01 18:41 ` [PATCH v7 " Jonathan Tan
  2023-08-01 18:41   ` [PATCH v7 1/7] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
  2023-08-01 18:41   ` [PATCH v7 2/7] t/helper/test-read-graph.c: extract `dump_graph_info()` Jonathan Tan
@ 2023-08-01 18:41   ` Jonathan Tan
  2023-08-01 18:41   ` [PATCH v7 4/7] t/helper/test-read-graph: implement `bloom-filters` mode Jonathan Tan
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-08-01 18:41 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Junio C Hamano, Jonathan Tan

From: Taylor Blau <me@ttaylorr.com>

Prepare for a future commit to use the load_bloom_filter_from_graph()
function directly to load specific Bloom filters out of the commit-graph
for manual inspection (to be used during tests).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 bloom.c | 6 +++---
 bloom.h | 5 +++++
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/bloom.c b/bloom.c
index aef6b5fea2..3e78cfe79d 100644
--- a/bloom.c
+++ b/bloom.c
@@ -29,9 +29,9 @@ static inline unsigned char get_bitmask(uint32_t pos)
 	return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
 }
 
-static int load_bloom_filter_from_graph(struct commit_graph *g,
-					struct bloom_filter *filter,
-					uint32_t graph_pos)
+int load_bloom_filter_from_graph(struct commit_graph *g,
+				 struct bloom_filter *filter,
+				 uint32_t graph_pos)
 {
 	uint32_t lex_pos, start_index, end_index;
 
diff --git a/bloom.h b/bloom.h
index adde6dfe21..1e4f612d2c 100644
--- a/bloom.h
+++ b/bloom.h
@@ -3,6 +3,7 @@
 
 struct commit;
 struct repository;
+struct commit_graph;
 
 struct bloom_filter_settings {
 	/*
@@ -68,6 +69,10 @@ struct bloom_key {
 	uint32_t *hashes;
 };
 
+int load_bloom_filter_from_graph(struct commit_graph *g,
+				 struct bloom_filter *filter,
+				 uint32_t graph_pos);
+
 /*
  * Calculate the murmur3 32-bit hash value for the given data
  * using the given seed.
-- 
2.41.0.585.gd2178a4bd4-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v7 4/7] t/helper/test-read-graph: implement `bloom-filters` mode
  2023-08-01 18:41 ` [PATCH v7 " Jonathan Tan
                     ` (2 preceding siblings ...)
  2023-08-01 18:41   ` [PATCH v7 3/7] bloom.h: make `load_bloom_filter_from_graph()` public Jonathan Tan
@ 2023-08-01 18:41   ` Jonathan Tan
  2023-08-01 18:41   ` [PATCH v7 5/7] t4216: test changed path filters with high bit paths Jonathan Tan
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-08-01 18:41 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Junio C Hamano, Jonathan Tan

From: Taylor Blau <me@ttaylorr.com>

Implement a mode of the "read-graph" test helper to dump out the
hexadecimal contents of the Bloom filter(s) contained in a commit-graph.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 t/helper/test-read-graph.c | 42 +++++++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index c664928412..da9ac8584d 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -47,10 +47,32 @@ static void dump_graph_info(struct commit_graph *graph)
 	printf("\n");
 }
 
-int cmd__read_graph(int argc UNUSED, const char **argv UNUSED)
+static void dump_graph_bloom_filters(struct commit_graph *graph)
+{
+	uint32_t i;
+
+	for (i = 0; i < graph->num_commits + graph->num_commits_in_base; i++) {
+		struct bloom_filter filter = { 0 };
+		size_t j;
+
+		if (load_bloom_filter_from_graph(graph, &filter, i) < 0) {
+			fprintf(stderr, "missing Bloom filter for graph "
+				"position %"PRIu32"\n", i);
+			continue;
+		}
+
+		for (j = 0; j < filter.len; j++)
+			printf("%02x", filter.data[j]);
+		if (filter.len)
+			printf("\n");
+	}
+}
+
+int cmd__read_graph(int argc, const char **argv)
 {
 	struct commit_graph *graph = NULL;
 	struct object_directory *odb;
+	int ret = 0;
 
 	setup_git_directory();
 	odb = the_repository->objects->odb;
@@ -58,14 +80,24 @@ int cmd__read_graph(int argc UNUSED, const char **argv UNUSED)
 	prepare_repo_settings(the_repository);
 
 	graph = read_commit_graph_one(the_repository, odb);
-	if (!graph)
-		return 1;
+	if (!graph) {
+		ret = 1;
+		goto done;
+	}
 
-	dump_graph_info(graph);
+	if (argc <= 1)
+		dump_graph_info(graph);
+	else if (!strcmp(argv[1], "bloom-filters"))
+		dump_graph_bloom_filters(graph);
+	else {
+		fprintf(stderr, "unknown sub-command: '%s'\n", argv[1]);
+		ret = 1;
+	}
 
+done:
 	UNLEAK(graph);
 
-	return 0;
+	return ret;
 }
 
 
-- 
2.41.0.585.gd2178a4bd4-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v7 5/7] t4216: test changed path filters with high bit paths
  2023-08-01 18:41 ` [PATCH v7 " Jonathan Tan
                     ` (3 preceding siblings ...)
  2023-08-01 18:41   ` [PATCH v7 4/7] t/helper/test-read-graph: implement `bloom-filters` mode Jonathan Tan
@ 2023-08-01 18:41   ` Jonathan Tan
  2023-08-01 18:41   ` [PATCH v7 6/7] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-08-01 18:41 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Junio C Hamano, Taylor Blau

Subsequent commits will teach Git another version of changed path
filter that has different behavior with paths that contain at least
one character with its high bit set, so test the existing behavior as
a baseline.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 t/t4216-log-bloom.sh | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fa9d32facf..2d4a3fefee 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -404,4 +404,47 @@ test_expect_success 'Bloom generation backfills empty commits' '
 	)
 '
 
+get_first_changed_path_filter () {
+	test-tool read-graph bloom-filters >filters.dat &&
+	head -n 1 filters.dat
+}
+
+# chosen to be the same under all Unicode normalization forms
+CENT=$(printf "\302\242")
+
+test_expect_success 'set up repo with high bit path, version 1 changed-path' '
+	git init highbit1 &&
+	test_commit -C highbit1 c1 "$CENT" &&
+	git -C highbit1 commit-graph write --reachable --changed-paths
+'
+
+test_expect_success 'setup check value of version 1 changed-path' '
+	(
+		cd highbit1 &&
+		echo "52a9" >expect &&
+		get_first_changed_path_filter >actual &&
+		test_cmp expect actual
+	)
+'
+
+# expect will not match actual if char is unsigned by default. Write the test
+# in this way, so that a user running this test script can still see if the two
+# files match. (It will appear as an ordinary success if they match, and a skip
+# if not.)
+if test_cmp highbit1/expect highbit1/actual
+then
+	test_set_prereq SIGNED_CHAR_BY_DEFAULT
+fi
+test_expect_success SIGNED_CHAR_BY_DEFAULT 'check value of version 1 changed-path' '
+	# Only the prereq matters for this test.
+	true
+'
+
+test_expect_success 'version 1 changed-path used when version 1 requested' '
+	(
+		cd highbit1 &&
+		test_bloom_filters_used "-- $CENT"
+	)
+'
+
 test_done
-- 
2.41.0.585.gd2178a4bd4-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v7 6/7] repo-settings: introduce commitgraph.changedPathsVersion
  2023-08-01 18:41 ` [PATCH v7 " Jonathan Tan
                     ` (4 preceding siblings ...)
  2023-08-01 18:41   ` [PATCH v7 5/7] t4216: test changed path filters with high bit paths Jonathan Tan
@ 2023-08-01 18:41   ` Jonathan Tan
  2023-08-01 18:41   ` [PATCH v7 7/7] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
  2023-08-01 18:44   ` [PATCH v7 0/7] Changed path filter hash fix and version bump Junio C Hamano
  7 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-08-01 18:41 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Junio C Hamano, Taylor Blau

A subsequent commit will introduce another version of the changed-path
filter in the commit graph file. In order to control which version to
write (and read), a config variable is needed.

Therefore, introduce this config variable. For forwards compatibility,
teach Git to not read commit graphs when the config variable
is set to an unsupported version. Because we teach Git this,
commitgraph.readChangedPaths is now redundant, so deprecate it and
define its behavior in terms of the config variable we introduce.

This commit does not change the behavior of writing (Git writes changed
path filters when explicitly instructed regardless of any config
variable), but a subsequent commit will restrict Git such that it will
only write when commitgraph.changedPathsVersion is a recognized value.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config/commitgraph.txt | 23 ++++++++++++++++++++---
 commit-graph.c                       |  2 +-
 oss-fuzz/fuzz-commit-graph.c         |  2 +-
 repo-settings.c                      |  6 +++++-
 repository.h                         |  2 +-
 5 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index 30604e4a4c..2dc9170622 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -9,6 +9,23 @@ commitGraph.maxNewFilters::
 	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
 
 commitGraph.readChangedPaths::
-	If true, then git will use the changed-path Bloom filters in the
-	commit-graph file (if it exists, and they are present). Defaults to
-	true. See linkgit:git-commit-graph[1] for more information.
+	Deprecated. Equivalent to commitGraph.changedPathsVersion=-1 if true, and
+	commitGraph.changedPathsVersion=0 if false. (If commitGraph.changedPathVersion
+	is also set, commitGraph.changedPathsVersion takes precedence.)
+
+commitGraph.changedPathsVersion::
+	Specifies the version of the changed-path Bloom filters that Git will read and
+	write. May be -1, 0 or 1.
++
+Defaults to -1.
++
+If -1, Git will use the version of the changed-path Bloom filters in the
+repository, defaulting to 1 if there are none.
++
+If 0, Git will not read any Bloom filters, and will write version 1 Bloom
+filters when instructed to write.
++
+If 1, Git will only read version 1 Bloom filters, and will write version 1
+Bloom filters.
++
+See linkgit:git-commit-graph[1] for more information.
diff --git a/commit-graph.c b/commit-graph.c
index efc697e437..1f26c07de4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -401,7 +401,7 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 			graph->read_generation_data = 1;
 	}
 
-	if (s->commit_graph_read_changed_paths) {
+	if (s->commit_graph_changed_paths_version) {
 		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
 			   &graph->chunk_bloom_indexes);
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
diff --git a/oss-fuzz/fuzz-commit-graph.c b/oss-fuzz/fuzz-commit-graph.c
index 2992079dd9..325c0b991a 100644
--- a/oss-fuzz/fuzz-commit-graph.c
+++ b/oss-fuzz/fuzz-commit-graph.c
@@ -19,7 +19,7 @@ int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
 	 * possible.
 	 */
 	the_repository->settings.commit_graph_generation_version = 2;
-	the_repository->settings.commit_graph_read_changed_paths = 1;
+	the_repository->settings.commit_graph_changed_paths_version = 1;
 	g = parse_commit_graph(&the_repository->settings, (void *)data, size);
 	repo_clear(the_repository);
 	free_commit_graph(g);
diff --git a/repo-settings.c b/repo-settings.c
index 525f69c0c7..db8fe817f3 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -24,6 +24,7 @@ void prepare_repo_settings(struct repository *r)
 	int value;
 	const char *strval;
 	int manyfiles;
+	int read_changed_paths;
 
 	if (!r->gitdir)
 		BUG("Cannot add settings for uninitialized repository");
@@ -54,7 +55,10 @@ void prepare_repo_settings(struct repository *r)
 	/* Commit graph config or default, does not cascade (simple) */
 	repo_cfg_bool(r, "core.commitgraph", &r->settings.core_commit_graph, 1);
 	repo_cfg_int(r, "commitgraph.generationversion", &r->settings.commit_graph_generation_version, 2);
-	repo_cfg_bool(r, "commitgraph.readchangedpaths", &r->settings.commit_graph_read_changed_paths, 1);
+	repo_cfg_bool(r, "commitgraph.readchangedpaths", &read_changed_paths, 1);
+	repo_cfg_int(r, "commitgraph.changedpathsversion",
+		     &r->settings.commit_graph_changed_paths_version,
+		     read_changed_paths ? -1 : 0);
 	repo_cfg_bool(r, "gc.writecommitgraph", &r->settings.gc_write_commit_graph, 1);
 	repo_cfg_bool(r, "fetch.writecommitgraph", &r->settings.fetch_write_commit_graph, 0);
 
diff --git a/repository.h b/repository.h
index 5f18486f64..f71154e12c 100644
--- a/repository.h
+++ b/repository.h
@@ -29,7 +29,7 @@ struct repo_settings {
 
 	int core_commit_graph;
 	int commit_graph_generation_version;
-	int commit_graph_read_changed_paths;
+	int commit_graph_changed_paths_version;
 	int gc_write_commit_graph;
 	int fetch_write_commit_graph;
 	int command_requires_full_index;
-- 
2.41.0.585.gd2178a4bd4-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v7 7/7] commit-graph: new filter ver. that fixes murmur3
  2023-08-01 18:41 ` [PATCH v7 " Jonathan Tan
                     ` (5 preceding siblings ...)
  2023-08-01 18:41   ` [PATCH v7 6/7] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
@ 2023-08-01 18:41   ` Jonathan Tan
  2023-08-01 18:44   ` [PATCH v7 0/7] Changed path filter hash fix and version bump Junio C Hamano
  7 siblings, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-08-01 18:41 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Junio C Hamano, Taylor Blau

The murmur3 implementation in bloom.c has a bug when converting series
of 4 bytes into network-order integers when char is signed (which is
controllable by a compiler option, and the default signedness of char is
platform-specific). When a string contains characters with the high bit
set, this bug causes results that, although internally consistent within
Git, does not accord with other implementations of murmur3 (thus,
the changed path filters wouldn't be readable by other off-the-shelf
implementatios of murmur3) and even with Git binaries that were compiled
with different signedness of char. This bug affects both how Git writes
changed path filters to disk and how Git interprets changed path filters
on disk.

Therefore, introduce a new version (2) of changed path filters that
corrects this problem. The existing version (1) is still supported and
is still the default, but users should migrate away from it as soon
as possible.

Because this bug only manifests with characters that have the high bit
set, it may be possible that some (or all) commits in a given repo would
have the same changed path filter both before and after this fix is
applied. However, in order to determine whether this is the case, the
changed paths would first have to be computed, at which point it is not
much more expensive to just compute a new changed path filter.

So this patch does not include any mechanism to "salvage" changed path
filters from repositories. There is also no "mixed" mode - for each
invocation of Git, reading and writing changed path filters are done
with the same version number; this version number may be explicitly
stated (typically if the user knows which version they need) or
automatically determined from the version of the existing changed path
filters in the repository.

There is a change in write_commit_graph(). graph_read_bloom_data()
makes it possible for chunk_bloom_data to be non-NULL but
bloom_filter_settings to be NULL, which causes a segfault later on. I
produced such a segfault while developing this patch, but couldn't find
a way to reproduce it neither after this complete patch (or before),
but in any case it seemed like a good thing to include that might help
future patch authors.

The value in t0095 was obtained from another murmur3 implementation
using the following Go source code:

  package main

  import "fmt"
  import "github.com/spaolacci/murmur3"

  func main() {
          fmt.Printf("%x\n", murmur3.Sum32([]byte("Hello world!")))
          fmt.Printf("%x\n", murmur3.Sum32([]byte{0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff}))
  }

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config/commitgraph.txt |  5 +-
 bloom.c                              | 69 +++++++++++++++++++-
 bloom.h                              |  8 ++-
 commit-graph.c                       | 33 ++++++++--
 t/helper/test-bloom.c                |  9 ++-
 t/t0095-bloom.sh                     |  8 +++
 t/t4216-log-bloom.sh                 | 96 ++++++++++++++++++++++++++++
 7 files changed, 214 insertions(+), 14 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index 2dc9170622..acc74a2f27 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -15,7 +15,7 @@ commitGraph.readChangedPaths::
 
 commitGraph.changedPathsVersion::
 	Specifies the version of the changed-path Bloom filters that Git will read and
-	write. May be -1, 0 or 1.
+	write. May be -1, 0, 1, or 2.
 +
 Defaults to -1.
 +
@@ -28,4 +28,7 @@ filters when instructed to write.
 If 1, Git will only read version 1 Bloom filters, and will write version 1
 Bloom filters.
 +
+If 2, Git will only read version 2 Bloom filters, and will write version 2
+Bloom filters.
++
 See linkgit:git-commit-graph[1] for more information.
diff --git a/bloom.c b/bloom.c
index 3e78cfe79d..ebef5cfd2f 100644
--- a/bloom.c
+++ b/bloom.c
@@ -66,7 +66,64 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
  * Not considered to be cryptographically secure.
  * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  */
-uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
+uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
+{
+	const uint32_t c1 = 0xcc9e2d51;
+	const uint32_t c2 = 0x1b873593;
+	const uint32_t r1 = 15;
+	const uint32_t r2 = 13;
+	const uint32_t m = 5;
+	const uint32_t n = 0xe6546b64;
+	int i;
+	uint32_t k1 = 0;
+	const char *tail;
+
+	int len4 = len / sizeof(uint32_t);
+
+	uint32_t k;
+	for (i = 0; i < len4; i++) {
+		uint32_t byte1 = (uint32_t)(unsigned char)data[4*i];
+		uint32_t byte2 = ((uint32_t)(unsigned char)data[4*i + 1]) << 8;
+		uint32_t byte3 = ((uint32_t)(unsigned char)data[4*i + 2]) << 16;
+		uint32_t byte4 = ((uint32_t)(unsigned char)data[4*i + 3]) << 24;
+		k = byte1 | byte2 | byte3 | byte4;
+		k *= c1;
+		k = rotate_left(k, r1);
+		k *= c2;
+
+		seed ^= k;
+		seed = rotate_left(seed, r2) * m + n;
+	}
+
+	tail = (data + len4 * sizeof(uint32_t));
+
+	switch (len & (sizeof(uint32_t) - 1)) {
+	case 3:
+		k1 ^= ((uint32_t)(unsigned char)tail[2]) << 16;
+		/*-fallthrough*/
+	case 2:
+		k1 ^= ((uint32_t)(unsigned char)tail[1]) << 8;
+		/*-fallthrough*/
+	case 1:
+		k1 ^= ((uint32_t)(unsigned char)tail[0]) << 0;
+		k1 *= c1;
+		k1 = rotate_left(k1, r1);
+		k1 *= c2;
+		seed ^= k1;
+		break;
+	}
+
+	seed ^= (uint32_t)len;
+	seed ^= (seed >> 16);
+	seed *= 0x85ebca6b;
+	seed ^= (seed >> 13);
+	seed *= 0xc2b2ae35;
+	seed ^= (seed >> 16);
+
+	return seed;
+}
+
+static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
 {
 	const uint32_t c1 = 0xcc9e2d51;
 	const uint32_t c2 = 0x1b873593;
@@ -131,8 +188,14 @@ void fill_bloom_key(const char *data,
 	int i;
 	const uint32_t seed0 = 0x293ae76f;
 	const uint32_t seed1 = 0x7e646e2c;
-	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
-	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
+	uint32_t hash0, hash1;
+	if (settings->hash_version == 2) {
+		hash0 = murmur3_seeded_v2(seed0, data, len);
+		hash1 = murmur3_seeded_v2(seed1, data, len);
+	} else {
+		hash0 = murmur3_seeded_v1(seed0, data, len);
+		hash1 = murmur3_seeded_v1(seed1, data, len);
+	}
 
 	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
 	for (i = 0; i < settings->num_hashes; i++)
diff --git a/bloom.h b/bloom.h
index 1e4f612d2c..138d57a86b 100644
--- a/bloom.h
+++ b/bloom.h
@@ -8,9 +8,11 @@ struct commit_graph;
 struct bloom_filter_settings {
 	/*
 	 * The version of the hashing technique being used.
-	 * We currently only support version = 1 which is
+	 * The newest version is 2, which is
 	 * the seeded murmur3 hashing technique implemented
-	 * in bloom.c.
+	 * in bloom.c. Bloom filters of version 1 were created
+	 * with prior versions of Git, which had a bug in the
+	 * implementation of the hash function.
 	 */
 	uint32_t hash_version;
 
@@ -80,7 +82,7 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
  * Not considered to be cryptographically secure.
  * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  */
-uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len);
+uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len);
 
 void fill_bloom_key(const char *data,
 		    size_t len,
diff --git a/commit-graph.c b/commit-graph.c
index 1f26c07de4..ffbc86151e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -304,17 +304,26 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
 	return 0;
 }
 
+struct graph_read_bloom_data_context {
+	struct commit_graph *g;
+	int *commit_graph_changed_paths_version;
+};
+
 static int graph_read_bloom_data(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
-	struct commit_graph *g = data;
+	struct graph_read_bloom_data_context *c = data;
+	struct commit_graph *g = c->g;
 	uint32_t hash_version;
-	g->chunk_bloom_data = chunk_start;
 	hash_version = get_be32(chunk_start);
 
-	if (hash_version != 1)
-		return 0;
+	if (*c->commit_graph_changed_paths_version == -1) {
+		*c->commit_graph_changed_paths_version = hash_version;
+	} else if (hash_version != *c->commit_graph_changed_paths_version) {
+ 		return 0;
+	}
 
+	g->chunk_bloom_data = chunk_start;
 	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
 	g->bloom_filter_settings->hash_version = hash_version;
 	g->bloom_filter_settings->num_hashes = get_be32(chunk_start + 4);
@@ -402,10 +411,14 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 	}
 
 	if (s->commit_graph_changed_paths_version) {
+		struct graph_read_bloom_data_context context = {
+			.g = graph,
+			.commit_graph_changed_paths_version = &s->commit_graph_changed_paths_version
+		};
 		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
 			   &graph->chunk_bloom_indexes);
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
-			   graph_read_bloom_data, graph);
+			   graph_read_bloom_data, &context);
 	}
 
 	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
@@ -2371,6 +2384,14 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->write_generation_data = (get_configured_generation_version(r) == 2);
 	ctx->num_generation_data_overflows = 0;
 
+	if (r->settings.commit_graph_changed_paths_version < -1
+	    || r->settings.commit_graph_changed_paths_version > 2) {
+		warning(_("attempting to write a commit-graph, but 'commitgraph.changedPathsVersion' (%d) is not supported"),
+			r->settings.commit_graph_changed_paths_version);
+		return 0;
+	}
+	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2
+		? 2 : 1;
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
 						      bloom_settings.bits_per_entry);
 	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
@@ -2400,7 +2421,7 @@ int write_commit_graph(struct object_directory *odb,
 		g = ctx->r->objects->commit_graph;
 
 		/* We have changed-paths already. Keep them in the next graph */
-		if (g && g->chunk_bloom_data) {
+		if (g && g->bloom_filter_settings) {
 			ctx->changed_paths = 1;
 			ctx->bloom_settings = g->bloom_filter_settings;
 		}
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index aabe31d724..3cbc0a5b50 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -50,6 +50,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 
 static const char *bloom_usage = "\n"
 "  test-tool bloom get_murmur3 <string>\n"
+"  test-tool bloom get_murmur3_seven_highbit\n"
 "  test-tool bloom generate_filter <string> [<string>...]\n"
 "  test-tool bloom get_filter_for_commit <commit-hex>\n";
 
@@ -64,7 +65,13 @@ int cmd__bloom(int argc, const char **argv)
 		uint32_t hashed;
 		if (argc < 3)
 			usage(bloom_usage);
-		hashed = murmur3_seeded(0, argv[2], strlen(argv[2]));
+		hashed = murmur3_seeded_v2(0, argv[2], strlen(argv[2]));
+		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
+	}
+
+	if (!strcmp(argv[1], "get_murmur3_seven_highbit")) {
+		uint32_t hashed;
+		hashed = murmur3_seeded_v2(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7);
 		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
 	}
 
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index b567383eb8..c8d84ab606 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -29,6 +29,14 @@ test_expect_success 'compute unseeded murmur3 hash for test string 2' '
 	test_cmp expect actual
 '
 
+test_expect_success 'compute unseeded murmur3 hash for test string 3' '
+	cat >expect <<-\EOF &&
+	Murmur3 Hash with seed=0:0xa183ccfd
+	EOF
+	test-tool bloom get_murmur3_seven_highbit >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'compute bloom key for empty string' '
 	cat >expect <<-\EOF &&
 	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 2d4a3fefee..775e59d864 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -447,4 +447,100 @@ test_expect_success 'version 1 changed-path used when version 1 requested' '
 	)
 '
 
+test_expect_success 'version 1 changed-path not used when version 2 requested' '
+	(
+		cd highbit1 &&
+		git config --add commitgraph.changedPathsVersion 2 &&
+		test_bloom_filters_not_used "-- $CENT"
+	)
+'
+
+test_expect_success 'version 1 changed-path used when autodetect requested' '
+	(
+		cd highbit1 &&
+		git config --add commitgraph.changedPathsVersion -1 &&
+		test_bloom_filters_used "-- $CENT"
+	)
+'
+
+test_expect_success 'when writing another commit graph, preserve existing version 1 of changed-path' '
+	test_commit -C highbit1 c1double "$CENT$CENT" &&
+	git -C highbit1 commit-graph write --reachable --changed-paths &&
+	(
+		cd highbit1 &&
+		git config --add commitgraph.changedPathsVersion -1 &&
+		echo "options: bloom(1,10,7) read_generation_data" >expect &&
+		test-tool read-graph >full &&
+		grep options full >actual &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'set up repo with high bit path, version 2 changed-path' '
+	git init highbit2 &&
+	git -C highbit2 config --add commitgraph.changedPathsVersion 2 &&
+	test_commit -C highbit2 c2 "$CENT" &&
+	git -C highbit2 commit-graph write --reachable --changed-paths
+'
+
+test_expect_success 'check value of version 2 changed-path' '
+	(
+		cd highbit2 &&
+		echo "c01f" >expect &&
+		get_first_changed_path_filter >actual &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'version 2 changed-path used when version 2 requested' '
+	(
+		cd highbit2 &&
+		test_bloom_filters_used "-- $CENT"
+	)
+'
+
+test_expect_success 'version 2 changed-path not used when version 1 requested' '
+	(
+		cd highbit2 &&
+		git config --add commitgraph.changedPathsVersion 1 &&
+		test_bloom_filters_not_used "-- $CENT"
+	)
+'
+
+test_expect_success 'version 2 changed-path used when autodetect requested' '
+	(
+		cd highbit2 &&
+		git config --add commitgraph.changedPathsVersion -1 &&
+		test_bloom_filters_used "-- $CENT"
+	)
+'
+
+test_expect_success 'when writing another commit graph, preserve existing version 2 of changed-path' '
+	test_commit -C highbit2 c2double "$CENT$CENT" &&
+	git -C highbit2 commit-graph write --reachable --changed-paths &&
+	(
+		cd highbit2 &&
+		git config --add commitgraph.changedPathsVersion -1 &&
+		echo "options: bloom(2,10,7) read_generation_data" >expect &&
+		test-tool read-graph >full &&
+		grep options full >actual &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'when writing commit graph, do not reuse changed-path of another version' '
+	git init doublewrite &&
+	test_commit -C doublewrite c "$CENT" &&
+	git -C doublewrite config --add commitgraph.changedPathsVersion 1 &&
+	git -C doublewrite commit-graph write --reachable --changed-paths &&
+	git -C doublewrite config --add commitgraph.changedPathsVersion 2 &&
+	git -C doublewrite commit-graph write --reachable --changed-paths &&
+	(
+		cd doublewrite &&
+		echo "c01f" >expect &&
+		get_first_changed_path_filter >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
2.41.0.585.gd2178a4bd4-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v7 0/7] Changed path filter hash fix and version bump
  2023-08-01 18:41 ` [PATCH v7 " Jonathan Tan
                     ` (6 preceding siblings ...)
  2023-08-01 18:41   ` [PATCH v7 7/7] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
@ 2023-08-01 18:44   ` Junio C Hamano
  2023-08-01 20:58     ` Taylor Blau
  7 siblings, 1 reply; 116+ messages in thread
From: Junio C Hamano @ 2023-08-01 18:44 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Taylor Blau

Jonathan Tan <jonathantanmy@google.com> writes:

> Taylor also suggested copying forward Bloom filters whenever possible
> in this patch set [3], but also that we could work on this outside this
> series [4]. I did not implement this in this series.

I think it is a good place to stop, as it would merely be a quality
of implementation difference and would not change the transition
story very much.

Thanks for working well together.  Will queue.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/7] Changed path filter hash fix and version bump
  2023-08-01 18:08               ` Taylor Blau
@ 2023-08-01 18:52                 ` Jonathan Tan
  2023-08-03  0:01                 ` Taylor Blau
  1 sibling, 0 replies; 116+ messages in thread
From: Jonathan Tan @ 2023-08-01 18:52 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, Junio C Hamano, git, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:
> On Thu, Jul 27, 2023 at 01:53:08PM -0700, Jonathan Tan wrote:
> > Suddenly reading many (or most) of the repo's trees would be a similar
> > surprise, right?
> 
> That's a good point. I think in general I'd expect Git to avoid
> recomputing Bloom filters where that work can be avoided, if the work in
> order to detect whether or not we need to recompute a filter is cheap
> enough to carry out.

Makes sense. I just don't think that there is a cheap way to detect if
a filter does not need to be recomputed (the closest way I think we have
is something that will require reading a lot of trees in a repo).

> > Also this would happen only if the server operator explicitly sets a
> > changed path filter version. If they leave it as-is, commit graphs will
> > still be written with the same version as the one on disk.
> 
> I think that I could live with that if the default is to leave things
> as-is.

Ah, thanks.
 
> I still think that it's worth it to have this functionality to propagate
> Bloom filters forward should ship in Git, but we can work on that
> outside of this series.

Makes sense.

> > Regarding consulting commit_graph->bloom_filter_settings->hash_version,
> > the issue I ran into was that firstly, I didn't know what to do about
> > commit_graph->base_graph (which also has its own bloom_filter_settings)
> > and what to do if it had a contradictory hash_version. And even if
> > we found a way to unify those, it is not true that every Bloom filter
> > in memory is of that version, since we may have generated some that
> > correspond to the version we're writing (not the version on disk).
> > In particular, the Bloom filters we write come from a commit slab
> > (bloom_filters in bloom.c) and in that slab, both Bloom filters from
> > disk and Bloom filters that were generated in-process coexist.
> 
> Would we ever want to have a commit-graph chain with mixed Bloom filter
> versions?

Probably not, but I wanted to be robust in case a third-party tool wrote a chain with
mixed versions.

> We avoid mixing generation number schemes across multiple layers of a
> commit-graph chain. But I don't see any reason that we should or need to
> have a similar restriction in place for the Bloom filter version. Both
> are readable, as long as the user-provided configuration allows them to
> be.

Yes, that's true - there is no inherent reason why we can't mix them,
unlike with generation numbers.

> We just have to interpret them differently depending on what layer of
> the graph (and therefore, what Bloom filter version they are) they come
> from.
> 
> Sorry for thinking aloud a little there. I think that this means that we
> at minimum have to keep in context the commit-graph layer we found the
> Bloom filter in so that we can tie that back to its Bloom filter
> version. That might just mean that we have to tag each Bloom filter with
> the version it was computed under, or maybe we already have the
> commit-graph layer in context, in which case we shouldn't need an
> additional field.
> 
> My gut is telling me that we probably *do* need such a field, since we
> don't often have a reference to the particular layer that we found a
> Bloom filter in, just the tip of the commit-graph chain that it came
> from.

We'll need the additional field because we don't know which commit graph
layer it comes from. In fact, we don't even know which *repo* the commit
comes from, since the commit slab is global. (Moving the slab to being
under a repo or under a commit graph layer would fix this.)

But I think there still remains the question of whether we really need
to support multiple versions in one Git invocation.

> > I also thought of your other proposal of augmenting struct bloom_filter
> > to also include the version. The issue I ran into there is if, for a
> > given commit, there already exists a Bloom filter read from disk with
> > the wrong version, what should we do? Overwrite it, or store both
> > versions in memory? (We can't just immediately output the Bloom filter
> > to disk and forget about the new version, only storing its size so that
> > we can generate the BIDX, because in the current code, generation and
> > writing to disk are separate. We could try to refactor it, but I didn't
> > want to make such a large change to reduce the possibility of bugs.)
> > Both storing the version number and storing an additional pointer for a
> > second version would increase memory consumption too, even when
> > supporting two versions isn't needed, but maybe this isn't a big deal.
> 
> It's likely that I'm missing something here, but what is stopping us
> from discarding the old Bloom filter as soon as we generate the new
> one? We shouldn't need to load the old filter again out of the commit
> slab, right?
> 
> Thanks,
> Taylor

I did not look at the code closely enough except to see that there was
a gap between generating the new-version Bloom filters and writing them
to disk, and I was concerned that now, or in the future, there would
be some code in that gap that reads the Bloom filter for a commit and
expects the old-version Bloom filter there.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v7 0/7] Changed path filter hash fix and version bump
  2023-08-01 18:44   ` [PATCH v7 0/7] Changed path filter hash fix and version bump Junio C Hamano
@ 2023-08-01 20:58     ` Taylor Blau
  2023-08-01 21:07       ` Junio C Hamano
  0 siblings, 1 reply; 116+ messages in thread
From: Taylor Blau @ 2023-08-01 20:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git

On Tue, Aug 01, 2023 at 11:44:06AM -0700, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
>
> > Taylor also suggested copying forward Bloom filters whenever possible
> > in this patch set [3], but also that we could work on this outside this
> > series [4]. I did not implement this in this series.
>
> I think it is a good place to stop, as it would merely be a quality
> of implementation difference and would not change the transition
> story very much.
>
> Thanks for working well together.  Will queue.

Thanks. I read through this version and feel good about the results.
I agree that queuing this one down makes sense.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v7 0/7] Changed path filter hash fix and version bump
  2023-08-01 20:58     ` Taylor Blau
@ 2023-08-01 21:07       ` Junio C Hamano
  0 siblings, 0 replies; 116+ messages in thread
From: Junio C Hamano @ 2023-08-01 21:07 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git

Taylor Blau <me@ttaylorr.com> writes:

> On Tue, Aug 01, 2023 at 11:44:06AM -0700, Junio C Hamano wrote:
>> Jonathan Tan <jonathantanmy@google.com> writes:
>>
>> > Taylor also suggested copying forward Bloom filters whenever possible
>> > in this patch set [3], but also that we could work on this outside this
>> > series [4]. I did not implement this in this series.
>>
>> I think it is a good place to stop, as it would merely be a quality
>> of implementation difference and would not change the transition
>> story very much.
>>
>> Thanks for working well together.  Will queue.
>
> Thanks. I read through this version and feel good about the results.
> I agree that queuing this one down makes sense.

Thanks.

ps. no more comms from me for the rest of the day as I am feeling
ill.  I've pushed out today's integration result already, merging
three or so topics down in 'next' and also absorbing a few new
topics in 'seen'.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/7] Changed path filter hash fix and version bump
  2023-08-01 18:08               ` Taylor Blau
  2023-08-01 18:52                 ` Jonathan Tan
@ 2023-08-03  0:01                 ` Taylor Blau
  2023-08-03 13:18                   ` Derrick Stolee
  1 sibling, 1 reply; 116+ messages in thread
From: Taylor Blau @ 2023-08-03  0:01 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Junio C Hamano, git, Derrick Stolee

On Tue, Aug 01, 2023 at 02:08:50PM -0400, Taylor Blau wrote:
> On Thu, Jul 27, 2023 at 01:53:08PM -0700, Jonathan Tan wrote:
> > Taylor Blau <me@ttaylorr.com> writes:
> > > > The intention in the current patch set was to not load it at all when we
> > > > have incompatible Bloom settings because it appeared quite troublesome
> > > > to notate which Bloom filter in memory is of which version. If we want
> > > > to copy forward existing results, we can change that, but I don't know
> > > > whether it's worth doing that (and if we think it's worth doing, this
> > > > should probably go in another patch set).
> > >
> > > Yeah, I think having Bloom filters accessible from a commit-graph
> > > regardless of whether or not it matches our Bloom filter version is
> > > prerequisite to being able to implement something like this.
> > >
> > > I feel like this is important enough to do in the same patch set, or the
> > > same release to avoid surprising operators when their commit-graph write
> > > suddenly recomputes all of its Bloom filters.
> >
> > Suddenly reading many (or most) of the repo's trees would be a similar
> > surprise, right?
>
> That's a good point. I think in general I'd expect Git to avoid
> recomputing Bloom filters where that work can be avoided, if the work in
> order to detect whether or not we need to recompute a filter is cheap
> enough to carry out.

I spent some time implementing this (patches are available in the branch
'tb/path-filter-fix-upgrade' from my fork). Handling incompatible Bloom
filter versions is kind of tricky, but do-able without having to
implement too much on top of what's already there.

But I don't think that it's enough to say that we can reuse a commit's
Bloom filter iff that commit's tree has no paths with characters >=
0x80. Suppose we have such a tree, whose Bloom filter we believe to be
reusable. If its first parent *does* have such a path, then that path
would appear as a deletion relative to its first parent. So that path
*would* be in the filter, meaning that it isn't reusable.

So I think the revised condition is something like: a commit's Bloom
filter is reusable when there are no paths with characters >= 0x80 in
a tree-diff against its first parent. I think that ensuring that there
are no such paths in both a commit's root tree, as well as its first
parent's root tree is equivalent, since that removes the possibility of
such a path showing up in its tree-diff.

As long as we aren't generating a commit-graph with --stdin-packs, then
we process commits in generation order, so we will see parents before
their children. I think we could reuse existing filters in that case,
but the condition is slightly more complex than I originally thought.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/7] Changed path filter hash fix and version bump
  2023-08-03  0:01                 ` Taylor Blau
@ 2023-08-03 13:18                   ` Derrick Stolee
  2023-08-03 18:45                     ` Taylor Blau
  0 siblings, 1 reply; 116+ messages in thread
From: Derrick Stolee @ 2023-08-03 13:18 UTC (permalink / raw)
  To: Taylor Blau, Jonathan Tan; +Cc: Junio C Hamano, git

On 8/2/2023 8:01 PM, Taylor Blau wrote:
> On Tue, Aug 01, 2023 at 02:08:50PM -0400, Taylor Blau wrote:

>> That's a good point. I think in general I'd expect Git to avoid
>> recomputing Bloom filters where that work can be avoided, if the work in
>> order to detect whether or not we need to recompute a filter is cheap
>> enough to carry out.
> 
> I spent some time implementing this (patches are available in the branch
> 'tb/path-filter-fix-upgrade' from my fork). Handling incompatible Bloom
> filter versions is kind of tricky, but do-able without having to
> implement too much on top of what's already there.
> 
> But I don't think that it's enough to say that we can reuse a commit's
> Bloom filter iff that commit's tree has no paths with characters >=
> 0x80. Suppose we have such a tree, whose Bloom filter we believe to be
> reusable. If its first parent *does* have such a path, then that path
> would appear as a deletion relative to its first parent. So that path
> *would* be in the filter, meaning that it isn't reusable.
> 
> So I think the revised condition is something like: a commit's Bloom
> filter is reusable when there are no paths with characters >= 0x80 in
> a tree-diff against its first parent. I think that ensuring that there
> are no such paths in both a commit's root tree, as well as its first
> parent's root tree is equivalent, since that removes the possibility of
> such a path showing up in its tree-diff.

This condition is exactly "we computed the diff to know which paths were
input to the filter" which is as difficult as recomputing the Bloom filter
from scratch. I don't think there is much room to gain a performance
improvement here.
 
Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v6 0/7] Changed path filter hash fix and version bump
  2023-08-03 13:18                   ` Derrick Stolee
@ 2023-08-03 18:45                     ` Taylor Blau
  0 siblings, 0 replies; 116+ messages in thread
From: Taylor Blau @ 2023-08-03 18:45 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Jonathan Tan, Junio C Hamano, git

On Thu, Aug 03, 2023 at 09:18:11AM -0400, Derrick Stolee wrote:
> > So I think the revised condition is something like: a commit's Bloom
> > filter is reusable when there are no paths with characters >= 0x80 in
> > a tree-diff against its first parent. I think that ensuring that there
> > are no such paths in both a commit's root tree, as well as its first
> > parent's root tree is equivalent, since that removes the possibility of
> > such a path showing up in its tree-diff.
>
> This condition is exactly "we computed the diff to know which paths were
> input to the filter" which is as difficult as recomputing the Bloom filter
> from scratch. I don't think there is much room to gain a performance
> improvement here.

I think that's true in the worst case, and certainly for repositories
with many tree entries which have characters >= 0x80.

But note that it's a heuristic in one direction only. If we know that a
commit's root tree (and that of it's first parent, if it has one) is
free of any such paths, then it's impossible for the first parent
tree-diff to contain such an entry, and therefore we can reuse any
existing filter.

Of course, a commit's root tree (and its parent) may both have a path
whose characters >= 0x80 while still not seeing a corresponding entry
show up in the tree-diff if that path is unchanged between a commit and
its first parent.

I think if we were looking at every tree every time only to realize that
we have to go back and compute its changed-path Bloom filter, this would
be a non-starter. But since we "cache" the results of our walk via the
tree object's flags bits, we can skip looking at many trees.

In my testing, this showed a significant improvement on linux.git and
git.git. My setup for testing is something like:

    $ git clone git@github.com:torvalds/linux.git
    $ cd linux
    $ git commit-graph write --reachable --changed-paths
    $ graph=".git/objects/info/commit-graph"
    $ mv $graph{,.bak}

so that .git/objects/info/commit-graph.bak is a commit-graph with v1
changed-path Bloom filters for all commits in generation order.

With that in place, I can do:

    $ git config commitGraph.changedPathsVersion 2
    $ hyperfine -p 'cp -f $graph.bak $graph' -L v 0,1 \
        'GIT_TEST_UPGRADE_BLOOM_FILTERS={v} git.compile commit-graph write --reachable --changed-paths'

, producing the following results on linux.git:

    Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths
      Time (mean ± σ):     124.873 s ±  0.316 s    [User: 124.081 s, System: 0.643 s]
      Range (min … max):   124.621 s … 125.227 s    3 runs

    Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths
      Time (mean ± σ):     79.271 s ±  0.163 s    [User: 74.611 s, System: 4.521 s]
      Range (min … max):   79.112 s … 79.437 s    3 runs

    Summary
      'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran
        1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths'

On git.git, writing a new commit-graph after upgrading from v1 to v2
filters went from taking 4.163 seconds to 3.348 seconds, for a more
modest 1.24x speed-up.

Of course, all of this depends on how much of the tree meets the above
conditions, so we'd expect worse results on repositories with paths that
contain characters >= 0x80. I think we'd want some kind of mechanism
(probably via config, not a GIT_TEST environment variable) to control
whether or not to upgrade existing filters.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 116+ messages in thread

end of thread, other threads:[~2023-08-03 18:47 UTC | newest]

Thread overview: 116+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-22 21:48 [PATCH 0/2] Changed path filter hash fix and version bump Jonathan Tan
2023-05-22 21:48 ` [PATCH 1/2] t4216: test wrong bloom filter version rejection Jonathan Tan
2023-05-22 21:48 ` [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2 Jonathan Tan
2023-05-23 13:00   ` Derrick Stolee
2023-05-23 23:00     ` Jonathan Tan
2023-05-23 23:51     ` Junio C Hamano
2023-05-24 21:26       ` Jonathan Tan
2023-05-26 13:19         ` Derrick Stolee
2023-05-30 17:26           ` Jonathan Tan
2023-05-23  4:42 ` [PATCH 0/2] Changed path filter hash fix and version bump Junio C Hamano
2023-05-31 23:12 ` [PATCH v2 0/3] " Jonathan Tan
2023-05-31 23:12   ` [PATCH v2 1/3] t4216: test changed path filters with high bit paths Jonathan Tan
2023-05-31 23:12   ` [PATCH v2 2/3] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
2023-05-31 23:12   ` [PATCH v2 3/3] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
2023-06-03  1:01   ` [PATCH v2 0/3] Changed path filter hash fix and version bump Junio C Hamano
2023-06-03  2:24     ` Junio C Hamano
2023-06-07 16:30       ` Jonathan Tan
2023-06-07 21:37         ` Jonathan Tan
2023-06-08 19:21 ` [PATCH v3 0/4] " Jonathan Tan
2023-06-08 19:21   ` [PATCH v3 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
2023-06-08 19:52     ` Ramsay Jones
2023-06-12 21:26       ` Junio C Hamano
2023-06-08 19:21   ` [PATCH v3 2/4] t4216: test changed path filters with high bit paths Jonathan Tan
2023-06-08 19:21   ` [PATCH v3 3/4] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
2023-06-08 19:21   ` [PATCH v3 4/4] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
2023-06-08 19:50   ` [PATCH v3 0/4] Changed path filter hash fix and version bump Ramsay Jones
2023-06-09  0:08     ` Jonathan Tan
2023-06-12 21:31     ` Junio C Hamano
2023-06-13 17:16       ` Jonathan Tan
2023-06-13 17:29         ` [PATCH] CodingGuidelines: use octal escapes, not hex Jonathan Tan
2023-06-13 18:16           ` Eric Sunshine
2023-06-13 18:43             ` Jonathan Tan
2023-06-13 19:15               ` Eric Sunshine
2023-06-13 19:29                 ` Junio C Hamano
2023-06-13 19:16         ` [PATCH v3 0/4] Changed path filter hash fix and version bump Junio C Hamano
2023-06-13 17:39 ` [PATCH v4 " Jonathan Tan
2023-06-13 17:39   ` [PATCH v4 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
2023-06-13 21:58     ` Junio C Hamano
2023-06-20 13:22       ` Derrick Stolee
2023-06-21 12:08       ` Taylor Blau
2023-06-22 22:26         ` Jonathan Tan
2023-06-23 13:05           ` Derrick Stolee
2023-06-13 17:39   ` [PATCH v4 2/4] t4216: test changed path filters with high bit paths Jonathan Tan
2023-06-13 17:39   ` [PATCH v4 3/4] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
2023-06-20 13:28     ` Derrick Stolee
2023-06-21 12:14     ` Taylor Blau
2023-06-13 17:39   ` [PATCH v4 4/4] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
2023-06-20 13:39     ` Derrick Stolee
2023-06-20 18:37       ` Junio C Hamano
2023-06-13 19:21   ` [PATCH v4 0/4] Changed path filter hash fix and version bump Junio C Hamano
2023-06-20 13:43     ` Derrick Stolee
2023-06-20 21:56       ` Jonathan Tan
2023-06-21 12:19       ` Taylor Blau
2023-06-21 17:53         ` Derrick Stolee
2023-06-22 22:27           ` Jonathan Tan
2023-06-23 13:18             ` Derrick Stolee
2023-07-13 21:42 ` [PATCH v5 " Jonathan Tan
2023-07-13 21:42   ` [PATCH v5 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
2023-07-19 17:25     ` Taylor Blau
2023-07-20 20:20       ` Jonathan Tan
2023-07-21  1:38         ` Taylor Blau
2023-07-13 21:42   ` [PATCH v5 2/4] t4216: test changed path filters with high bit paths Jonathan Tan
2023-07-13 22:50     ` Junio C Hamano
2023-07-19 17:27       ` Taylor Blau
2023-07-19 17:55         ` [PATCH 0/4] commit-graph: avoid looking at Bloom filter data directly Taylor Blau
2023-07-19 17:55           ` [PATCH 1/4] t/helper/test-read-graph.c: extract `dump_graph_info()` Taylor Blau
2023-07-19 17:55           ` [PATCH 2/4] bloom.h: make `load_bloom_filter_from_graph()` public Taylor Blau
2023-07-19 17:55           ` [PATCH 3/4] t/helper/test-read-graph: implement `bloom-filters` mode Taylor Blau
2023-07-19 17:55           ` [PATCH 4/4] fixup! t4216: test changed path filters with high bit paths Taylor Blau
2023-07-19 19:24           ` [PATCH 0/4] commit-graph: avoid looking at Bloom filter data directly Junio C Hamano
2023-07-20 20:22           ` Jonathan Tan
2023-07-13 21:42   ` [PATCH v5 3/4] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
2023-07-19 18:10     ` Taylor Blau
2023-07-20 20:42       ` Jonathan Tan
2023-07-20 21:02         ` Taylor Blau
2023-07-13 21:42   ` [PATCH v5 4/4] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
2023-07-19 18:24     ` Taylor Blau
2023-07-20 21:27       ` Jonathan Tan
2023-07-26 23:32         ` Taylor Blau
2023-07-13 22:16   ` [PATCH v5 0/4] Changed path filter hash fix and version bump Junio C Hamano
2023-07-13 22:59     ` Junio C Hamano
2023-07-14 18:48     ` Jonathan Tan
2023-07-20 21:46 ` [PATCH v6 0/7] " Jonathan Tan
2023-07-20 21:46   ` [PATCH v6 1/7] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
2023-07-20 21:46   ` [PATCH v6 2/7] t/helper/test-read-graph.c: extract `dump_graph_info()` Jonathan Tan
2023-07-26 23:26     ` Taylor Blau
2023-07-20 21:46   ` [PATCH v6 3/7] bloom.h: make `load_bloom_filter_from_graph()` public Jonathan Tan
2023-07-20 21:46   ` [PATCH v6 4/7] t/helper/test-read-graph: implement `bloom-filters` mode Jonathan Tan
2023-07-20 21:46   ` [PATCH v6 5/7] t4216: test changed path filters with high bit paths Jonathan Tan
2023-07-26 23:28     ` Taylor Blau
2023-07-20 21:46   ` [PATCH v6 6/7] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
2023-07-20 21:46   ` [PATCH v6 7/7] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
2023-07-25 20:52   ` [PATCH v6 0/7] Changed path filter hash fix and version bump Junio C Hamano
2023-07-26 20:39   ` Junio C Hamano
2023-07-27  0:17     ` Taylor Blau
2023-07-27  0:49       ` Junio C Hamano
2023-07-27 17:39         ` Jonathan Tan
2023-07-27 17:56           ` Taylor Blau
2023-07-27 20:53             ` Jonathan Tan
2023-08-01 18:08               ` Taylor Blau
2023-08-01 18:52                 ` Jonathan Tan
2023-08-03  0:01                 ` Taylor Blau
2023-08-03 13:18                   ` Derrick Stolee
2023-08-03 18:45                     ` Taylor Blau
2023-07-27 18:44         ` Junio C Hamano
2023-08-01 18:41 ` [PATCH v7 " Jonathan Tan
2023-08-01 18:41   ` [PATCH v7 1/7] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
2023-08-01 18:41   ` [PATCH v7 2/7] t/helper/test-read-graph.c: extract `dump_graph_info()` Jonathan Tan
2023-08-01 18:41   ` [PATCH v7 3/7] bloom.h: make `load_bloom_filter_from_graph()` public Jonathan Tan
2023-08-01 18:41   ` [PATCH v7 4/7] t/helper/test-read-graph: implement `bloom-filters` mode Jonathan Tan
2023-08-01 18:41   ` [PATCH v7 5/7] t4216: test changed path filters with high bit paths Jonathan Tan
2023-08-01 18:41   ` [PATCH v7 6/7] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
2023-08-01 18:41   ` [PATCH v7 7/7] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
2023-08-01 18:44   ` [PATCH v7 0/7] Changed path filter hash fix and version bump Junio C Hamano
2023-08-01 20:58     ` Taylor Blau
2023-08-01 21:07       ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).