Git Mailing List Archive mirror
 help / color / mirror / code / Atom feed
* [PATCH 0/2] Changed path filter hash fix and version bump
@ 2023-05-22 21:48 Jonathan Tan
  2023-05-22 21:48 ` [PATCH 1/2] t4216: test wrong bloom filter version rejection Jonathan Tan
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Jonathan Tan @ 2023-05-22 21:48 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, me

Following the conversation in [1], here are patches to fix the murmur3
hash function used in creating (and interpreting) changed path filters,
and also to bump the version number to 2.

This is I think the simplest way to do this (invalidating all existing
changed path filters). The resource-consuming part of creating a changed
path filter is in computing the changed paths (thus, reading trees and
calculating changes), and to check if a changed path filter could be
reused, one would need to compute the changed paths anyway in order to
determine if any of them have high-bit strings, so I did not pursue this
further. Server operators might be able to reuse changed path filters
if, for example, they have a more efficient way to determine that no
paths in a repo have the high bit set, but I think that this is out of
scope for the Git project.

In patch 2, I couldn't figure out how to make Bash pass high-bit strings
as a CLI argument for some reason, so I hardcoded the string I wanted
in the test helper instead. If anyone knows how to pass such strings,
please let me know.

[1] https://lore.kernel.org/git/20230511224101.972442-1-jonathantanmy@google.com/

Jonathan Tan (2):
  t4216: test wrong bloom filter version rejection
  commit-graph: fix murmur3, bump filter ver. to 2

 bloom.c               | 14 +++++++-------
 bloom.h               |  9 ++++++---
 commit-graph.c        |  4 ++--
 t/helper/test-bloom.c |  7 +++++++
 t/t0095-bloom.sh      |  8 ++++++++
 t/t4216-log-bloom.sh  | 36 +++++++++++++++++++++++++++++++++---
 6 files changed, 63 insertions(+), 15 deletions(-)

-- 
2.40.1.698.g37aff9b760-goog


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] t4216: test wrong bloom filter version rejection
  2023-05-22 21:48 [PATCH 0/2] Changed path filter hash fix and version bump Jonathan Tan
@ 2023-05-22 21:48 ` Jonathan Tan
  2023-05-22 21:48 ` [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2 Jonathan Tan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Jonathan Tan @ 2023-05-22 21:48 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, me

Add a test that checks that Git does not make use of changed path
filters that have an unrecognized version.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 t/t4216-log-bloom.sh | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fa9d32facf..f14cc1c1f1 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -85,6 +85,36 @@ test_bloom_filters_not_used () {
 	test_cmp log_wo_bloom log_w_bloom
 }
 
+get_bdat_offset () {
+	perl -0777 -ne \
+		'print unpack("N", "$1") if /BDAT\0\0\0\0(....)/ or exit 1' \
+		.git/objects/info/commit-graph
+}
+
+test_expect_success 'incompatible bloom filter versions are not used' '
+	cp .git/objects/info/commit-graph old-commit-graph &&
+	test_when_finished "mv old-commit-graph .git/objects/info/commit-graph" &&
+
+	BDAT_OFFSET=$(get_bdat_offset) &&
+
+	# Write an arbitrary number to the least significant byte of the
+	# version field in the BDAT chunk
+	cat old-commit-graph >new-commit-graph &&
+	printf "\aa" |
+		dd of=new-commit-graph bs=1 count=1 \
+			seek=$((BDAT_OFFSET + 3)) conv=notrunc &&
+	mv new-commit-graph .git/objects/info/commit-graph &&
+	test_bloom_filters_not_used "-- A" &&
+
+	# But the correct version number works
+	cat old-commit-graph >new-commit-graph &&
+	printf "\01" |
+		dd of=new-commit-graph bs=1 count=1 \
+			seek=$((BDAT_OFFSET + 3)) conv=notrunc &&
+	mv new-commit-graph .git/objects/info/commit-graph &&
+	test_bloom_filters_used "-- A"
+'
+
 for path in A A/B A/B/C A/file1 A/B/file2 A/B/C/file3 file4 file5 file5_renamed file_to_be_deleted
 do
 	for option in "" \
-- 
2.40.1.698.g37aff9b760-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2
  2023-05-22 21:48 [PATCH 0/2] Changed path filter hash fix and version bump Jonathan Tan
  2023-05-22 21:48 ` [PATCH 1/2] t4216: test wrong bloom filter version rejection Jonathan Tan
@ 2023-05-22 21:48 ` Jonathan Tan
  2023-05-23 13:00   ` Derrick Stolee
  2023-05-23  4:42 ` [PATCH 0/2] Changed path filter hash fix and version bump Junio C Hamano
  2023-05-31 23:12 ` [PATCH v2 0/3] " Jonathan Tan
  3 siblings, 1 reply; 14+ messages in thread
From: Jonathan Tan @ 2023-05-22 21:48 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, me

The murmur3 implementation in bloom.c has a bug when converting series
of 4 bytes into network-order integers when char is signed (which is
controllable by a compiler option, and the default signedness of char is
platform-specific). When a string contains characters with the high bit
set, this bug causes results that, although internally consistent within
Git, does not accord with other implementations of murmur3 and even with
Git binaries that were compiled with different signedness of char. This
bug affects both how Git writes changed path filters to disk and how Git
interprets changed path filters on disk.

Therefore, fix this bug. And because changed path filters on disk might
no longer be compatible, teach Git to write "2" as the version when
writing changed path filters (instead of "1" currently), and only accept
"2" as the version when reading them (instead of "1" currently).

Because this bug only manifests with characters that have the high bit
set, it may be possible that some (or all) commits in a given repo would
have the same changed path filter both before and after this fix is
applied. However, in order to determine whether this is the case, the
changed paths would first have to be computed, at which point it is not
much more expensive to just compute a new changed path filter. So this
patch does not include any mechanism to "salvage" changed path filters
from repositories.

There is a change in write_commit_graph(). graph_read_bloom_data()
makes it possible for chunk_bloom_data to be non-NULL but
bloom_filter_settings to be NULL, which causes a segfault later on. I
produced such a segfault while developing this patch, but couldn't find
a way to reproduce it neither after this complete patch (or before),
but in any case it seemed like a good thing to include that might help
future patch authors.

The value in the test was obtained from another murmur3 implementation
using the following Go source code:

  package main

  import "fmt"
  import "github.com/spaolacci/murmur3"

  func main() {
          fmt.Printf("%x\n", murmur3.Sum32([]byte("Hello world!")))
          fmt.Printf("%x\n", murmur3.Sum32([]byte{0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff}))
  }

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 bloom.c               | 14 +++++++-------
 bloom.h               |  9 ++++++---
 commit-graph.c        |  4 ++--
 t/helper/test-bloom.c |  7 +++++++
 t/t0095-bloom.sh      |  8 ++++++++
 t/t4216-log-bloom.sh  |  8 ++++----
 6 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/bloom.c b/bloom.c
index aef6b5fea2..fec243b2f1 100644
--- a/bloom.c
+++ b/bloom.c
@@ -82,10 +82,10 @@ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
 
 	uint32_t k;
 	for (i = 0; i < len4; i++) {
-		uint32_t byte1 = (uint32_t)data[4*i];
-		uint32_t byte2 = ((uint32_t)data[4*i + 1]) << 8;
-		uint32_t byte3 = ((uint32_t)data[4*i + 2]) << 16;
-		uint32_t byte4 = ((uint32_t)data[4*i + 3]) << 24;
+		uint32_t byte1 = (uint32_t)(unsigned char)data[4*i];
+		uint32_t byte2 = ((uint32_t)(unsigned char)data[4*i + 1]) << 8;
+		uint32_t byte3 = ((uint32_t)(unsigned char)data[4*i + 2]) << 16;
+		uint32_t byte4 = ((uint32_t)(unsigned char)data[4*i + 3]) << 24;
 		k = byte1 | byte2 | byte3 | byte4;
 		k *= c1;
 		k = rotate_left(k, r1);
@@ -99,13 +99,13 @@ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
 
 	switch (len & (sizeof(uint32_t) - 1)) {
 	case 3:
-		k1 ^= ((uint32_t)tail[2]) << 16;
+		k1 ^= ((uint32_t)(unsigned char)tail[2]) << 16;
 		/*-fallthrough*/
 	case 2:
-		k1 ^= ((uint32_t)tail[1]) << 8;
+		k1 ^= ((uint32_t)(unsigned char)tail[1]) << 8;
 		/*-fallthrough*/
 	case 1:
-		k1 ^= ((uint32_t)tail[0]) << 0;
+		k1 ^= ((uint32_t)(unsigned char)tail[0]) << 0;
 		k1 *= c1;
 		k1 = rotate_left(k1, r1);
 		k1 *= c2;
diff --git a/bloom.h b/bloom.h
index adde6dfe21..8526fa948c 100644
--- a/bloom.h
+++ b/bloom.h
@@ -7,9 +7,11 @@ struct repository;
 struct bloom_filter_settings {
 	/*
 	 * The version of the hashing technique being used.
-	 * We currently only support version = 1 which is
+	 * We currently only support version = 2 which is
 	 * the seeded murmur3 hashing technique implemented
-	 * in bloom.c.
+	 * in bloom.c. Bloom filters of version 1 were created
+	 * with prior versions of Git, which had a bug in the
+	 * implementation of the hash function.
 	 */
 	uint32_t hash_version;
 
@@ -38,8 +40,9 @@ struct bloom_filter_settings {
 	uint32_t max_changed_paths;
 };
 
+#define BLOOM_HASH_VERSION 2
 #define DEFAULT_BLOOM_MAX_CHANGES 512
-#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10, DEFAULT_BLOOM_MAX_CHANGES }
+#define DEFAULT_BLOOM_FILTER_SETTINGS { BLOOM_HASH_VERSION, 7, 10, DEFAULT_BLOOM_MAX_CHANGES }
 #define BITS_PER_WORD 8
 #define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
 
diff --git a/commit-graph.c b/commit-graph.c
index 843bdb458d..2eb9b781f4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -314,7 +314,7 @@ static int graph_read_bloom_data(const unsigned char *chunk_start,
 	g->chunk_bloom_data = chunk_start;
 	hash_version = get_be32(chunk_start);
 
-	if (hash_version != 1)
+	if (hash_version != BLOOM_HASH_VERSION)
 		return 0;
 
 	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
@@ -2402,7 +2402,7 @@ int write_commit_graph(struct object_directory *odb,
 		g = ctx->r->objects->commit_graph;
 
 		/* We have changed-paths already. Keep them in the next graph */
-		if (g && g->chunk_bloom_data) {
+		if (g && g->bloom_filter_settings) {
 			ctx->changed_paths = 1;
 			ctx->bloom_settings = g->bloom_filter_settings;
 		}
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index aabe31d724..5624e4d2db 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -50,6 +50,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 
 static const char *bloom_usage = "\n"
 "  test-tool bloom get_murmur3 <string>\n"
+"  test-tool bloom get_murmur3_seven_highbit\n"
 "  test-tool bloom generate_filter <string> [<string>...]\n"
 "  test-tool bloom get_filter_for_commit <commit-hex>\n";
 
@@ -68,6 +69,12 @@ int cmd__bloom(int argc, const char **argv)
 		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
 	}
 
+	if (!strcmp(argv[1], "get_murmur3_seven_highbit")) {
+		uint32_t hashed;
+		hashed = murmur3_seeded(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7);
+		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
+	}
+
 	if (!strcmp(argv[1], "generate_filter")) {
 		struct bloom_filter filter;
 		int i = 2;
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index b567383eb8..c8d84ab606 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -29,6 +29,14 @@ test_expect_success 'compute unseeded murmur3 hash for test string 2' '
 	test_cmp expect actual
 '
 
+test_expect_success 'compute unseeded murmur3 hash for test string 3' '
+	cat >expect <<-\EOF &&
+	Murmur3 Hash with seed=0:0xa183ccfd
+	EOF
+	test-tool bloom get_murmur3_seven_highbit >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'compute bloom key for empty string' '
 	cat >expect <<-\EOF &&
 	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index f14cc1c1f1..7a193aa143 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -48,7 +48,7 @@ graph_read_expect () {
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
-	options: bloom(1,10,7) read_generation_data
+	options: bloom(2,10,7) read_generation_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
@@ -108,7 +108,7 @@ test_expect_success 'incompatible bloom filter versions are not used' '
 
 	# But the correct version number works
 	cat old-commit-graph >new-commit-graph &&
-	printf "\01" |
+	printf "\02" |
 		dd of=new-commit-graph bs=1 count=1 \
 			seek=$((BDAT_OFFSET + 3)) conv=notrunc &&
 	mv new-commit-graph .git/objects/info/commit-graph &&
@@ -209,10 +209,10 @@ test_expect_success 'persist filter settings' '
 		GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
 		GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2.txt &&
+	grep "{\"hash_version\":2,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2.txt &&
 	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2-auto.txt
+	grep "{\"hash_version\":2,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2-auto.txt
 '
 
 test_max_changed_paths () {
-- 
2.40.1.698.g37aff9b760-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] Changed path filter hash fix and version bump
  2023-05-22 21:48 [PATCH 0/2] Changed path filter hash fix and version bump Jonathan Tan
  2023-05-22 21:48 ` [PATCH 1/2] t4216: test wrong bloom filter version rejection Jonathan Tan
  2023-05-22 21:48 ` [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2 Jonathan Tan
@ 2023-05-23  4:42 ` Junio C Hamano
  2023-05-31 23:12 ` [PATCH v2 0/3] " Jonathan Tan
  3 siblings, 0 replies; 14+ messages in thread
From: Junio C Hamano @ 2023-05-23  4:42 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, me

Jonathan Tan <jonathantanmy@google.com> writes:

> Following the conversation in [1], here are patches to fix the murmur3
> hash function used in creating (and interpreting) changed path filters,
> and also to bump the version number to 2.

Wonderful.  Thanks for a quick update.  Will take a look when I come
back to the keyboard (I'm on half-vacation right now).

>
> This is I think the simplest way to do this (invalidating all existing
> changed path filters). The resource-consuming part of creating a changed
> path filter is in computing the changed paths (thus, reading trees and
> calculating changes), and to check if a changed path filter could be
> reused, one would need to compute the changed paths anyway in order to
> determine if any of them have high-bit strings, so I did not pursue this
> further. Server operators might be able to reuse changed path filters
> if, for example, they have a more efficient way to determine that no
> paths in a repo have the high bit set, but I think that this is out of
> scope for the Git project.
>
> In patch 2, I couldn't figure out how to make Bash pass high-bit strings
> as a CLI argument for some reason, so I hardcoded the string I wanted
> in the test helper instead. If anyone knows how to pass such strings,
> please let me know.
>
> [1] https://lore.kernel.org/git/20230511224101.972442-1-jonathantanmy@google.com/
>
> Jonathan Tan (2):
>   t4216: test wrong bloom filter version rejection
>   commit-graph: fix murmur3, bump filter ver. to 2
>
>  bloom.c               | 14 +++++++-------
>  bloom.h               |  9 ++++++---
>  commit-graph.c        |  4 ++--
>  t/helper/test-bloom.c |  7 +++++++
>  t/t0095-bloom.sh      |  8 ++++++++
>  t/t4216-log-bloom.sh  | 36 +++++++++++++++++++++++++++++++++---
>  6 files changed, 63 insertions(+), 15 deletions(-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2
  2023-05-22 21:48 ` [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2 Jonathan Tan
@ 2023-05-23 13:00   ` Derrick Stolee
  2023-05-23 23:00     ` Jonathan Tan
  2023-05-23 23:51     ` Junio C Hamano
  0 siblings, 2 replies; 14+ messages in thread
From: Derrick Stolee @ 2023-05-23 13:00 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: me

On 5/22/2023 5:48 PM, Jonathan Tan wrote:
> The murmur3 implementation in bloom.c has a bug when converting series
> of 4 bytes into network-order integers when char is signed (which is
> controllable by a compiler option, and the default signedness of char is
> platform-specific). When a string contains characters with the high bit
> set, this bug causes results that, although internally consistent within
> Git, does not accord with other implementations of murmur3 and even with
> Git binaries that were compiled with different signedness of char. This
> bug affects both how Git writes changed path filters to disk and how Git
> interprets changed path filters on disk.
> 
> Therefore, fix this bug. And because changed path filters on disk might
> no longer be compatible, teach Git to write "2" as the version when
> writing changed path filters (instead of "1" currently), and only accept
> "2" as the version when reading them (instead of "1" currently).

I appreciate that you discovered and are presenting a way out of this
problem, however the current approach does not preserve compatibility
enough.

> @@ -82,10 +82,10 @@ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
>  
>  	uint32_t k;
>  	for (i = 0; i < len4; i++) {
> -		uint32_t byte1 = (uint32_t)data[4*i];
> -		uint32_t byte2 = ((uint32_t)data[4*i + 1]) << 8;
> -		uint32_t byte3 = ((uint32_t)data[4*i + 2]) << 16;
> -		uint32_t byte4 = ((uint32_t)data[4*i + 3]) << 24;
> +		uint32_t byte1 = (uint32_t)(unsigned char)data[4*i];
> +		uint32_t byte2 = ((uint32_t)(unsigned char)data[4*i + 1]) << 8;
> +		uint32_t byte3 = ((uint32_t)(unsigned char)data[4*i + 2]) << 16;
> +		uint32_t byte4 = ((uint32_t)(unsigned char)data[4*i + 3]) << 24;
>  		k = byte1 | byte2 | byte3 | byte4;
>  		k *= c1;
>  		k = rotate_left(k, r1);

By changing this algorithm directly (instead of making an "unsigned" version,
or renaming this one to the "maybe signed" version), you are making it
impossible for us to ship a version that can read version 1 Bloom filters,
so all read-only history operations will immediately slow down (because they
will ignore v1 chunks, better than incorrectly parsing v1 chunks).

> @@ -314,7 +314,7 @@ static int graph_read_bloom_data(const unsigned char *chunk_start,
>  	g->chunk_bloom_data = chunk_start;
>  	hash_version = get_be32(chunk_start);
>  
> -	if (hash_version != 1)
> +	if (hash_version != BLOOM_HASH_VERSION)
>  		return 0;
  
Here's where we would ignore v1 filters, instead of continuing to read them
(with all the risks involved).

In order for this to be something we can ship safely to environments that depend
on changed-path Bloom filters, we need to be able to parse v1 filters. It would
be even better if we didn't write v2 filters by default, but instead hid it
behind a config option that is off by default for at least one major release.

I notice that you didn't update the commit-graph format docs, which seems like
a valuable place to describe the new version number, as well as any plans to
completely deprecate v1. For instance, describing the v1 implementation as
having inconsistent application of murmur3 is a valuable thing to have, but
then describe the plans for deprecating it as an unsafe format.

Here is a potential plan to consider:

 1. v2.42.0 includes writing v2 format, off by default.
 2. v2.43.0 writes v2 format by default.
 3. v2.44.0 no longer parses v1 format (ignored without error).

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2
  2023-05-23 13:00   ` Derrick Stolee
@ 2023-05-23 23:00     ` Jonathan Tan
  2023-05-23 23:51     ` Junio C Hamano
  1 sibling, 0 replies; 14+ messages in thread
From: Jonathan Tan @ 2023-05-23 23:00 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Jonathan Tan, git, me

Derrick Stolee <derrickstolee@github.com> writes:
> I notice that you didn't update the commit-graph format docs,

Ah, thanks for the reminder.

> Here is a potential plan to consider:
> 
>  1. v2.42.0 includes writing v2 format, off by default.
>  2. v2.43.0 writes v2 format by default.
>  3. v2.44.0 no longer parses v1 format (ignored without error).

First of all, thanks for your comments on the migration process - that
is indeed the most complicated part of this.

The code change to support 2 versions seems not too hard (duplicate
murmur3_seeded() and modify one so that we have one version 1 and
one version 2, and teach fill_bloom_key() to call the appropriate one
based on the struct bloom_filter_settings) but this requires both the
code author and reviewer(s) to check that we don't have cases in which
we read or write one version when we're supposed to do it with the
other. And the benefit of doing this seems to just be giving server
administrators an opportunity to perform the migration at a more relaxed
pace, which I think there are other ways to accomplish if we really
wanted to do this, so I wanted to avoid having 2 versions in the Git
codebase.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2
  2023-05-23 13:00   ` Derrick Stolee
  2023-05-23 23:00     ` Jonathan Tan
@ 2023-05-23 23:51     ` Junio C Hamano
  2023-05-24 21:26       ` Jonathan Tan
  1 sibling, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2023-05-23 23:51 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Jonathan Tan, git, me

Derrick Stolee <derrickstolee@github.com> writes:

> I appreciate that you discovered and are presenting a way out of this
> problem, however the current approach does not preserve compatibility
> enough.
> ...
> By changing this algorithm directly (instead of making an "unsigned" version,
> or renaming this one to the "maybe signed" version), you are making it
> impossible for us to ship a version that can read version 1 Bloom filters,
> so all read-only history operations will immediately slow down (because they
> will ignore v1 chunks, better than incorrectly parsing v1 chunks).
>
> Here's where we would ignore v1 filters, instead of continuing to read them
> (with all the risks involved).

I do not know the "all the risks involved" comment.  Is the risk
something we can mitigate by still reading v1 data but be careful
about when not to apply the filters?

I may be misremembering the original discussion, but wasn't the
conclusion that v1 data is salvageable in the sense that it can
still reliably say that, given a pathname without bytes with
high-bit set, it cannot possibly belong to the set of changed paths,
even though, because the filter is contaminated with "signed" data,
its false-positive rate may be higher than using "unsigned" version?
And based on that observation, we can still read v1 data but only
use the Bloom filters when the queried paths have no byte with
high-bit set.

Also if we are operating in such an environment then would it be
possible to first compute as if we were going to generate v2 data,
but write it as v1 after reading all the path and realizing there
are no problematic paths?  IOW, even if the version of Git is
capable of writing and reading v2, it does not have to use v2,
right?  To put it the other way around, we will have to and can keep
supporting data that is labeled as v1, no?

> In order for this to be something we can ship safely to environments that depend
> on changed-path Bloom filters, we need to be able to parse v1 filters. It would
> be even better if we didn't write v2 filters by default, but instead hid it
> behind a config option that is off by default for at least one major release.

Is the concern that we will double the chunk size because both v1
and v2 will be written?  Or is it that we will stop writing v1 if we
start writing v2 and switching too early will mean the repositories
will become slower for older implementations that haven't died out?

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2
  2023-05-23 23:51     ` Junio C Hamano
@ 2023-05-24 21:26       ` Jonathan Tan
  2023-05-26 13:19         ` Derrick Stolee
  0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Tan @ 2023-05-24 21:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, Derrick Stolee, git, me

Junio C Hamano <gitster@pobox.com> writes:
> I may be misremembering the original discussion, but wasn't the
> conclusion that v1 data is salvageable in the sense that it can
> still reliably say that, given a pathname without bytes with
> high-bit set, it cannot possibly belong to the set of changed paths,
> even though, because the filter is contaminated with "signed" data,
> its false-positive rate may be higher than using "unsigned" version?
> And based on that observation, we can still read v1 data but only
> use the Bloom filters when the queried paths have no byte with
> high-bit set.

There are at least 3 ways of salvaging the data that we've discussed:

- Enumerating all of a repo's paths and if none of them have a high bit,
  retain the existing filters.
- Walking all of a repo's trees (so that we know which tree corresponds
  to which commit) and if for a commit, all its trees have no high bit,
  retain the filter for that tree (otherwise recompute it).
- Keep using a version 1 filter but only when the sought-for path has no
  high bit (as you describe here).

(The first 2 is my interpretation of what Taylor described [1].)

I'm not sure if we want to keep version 1 filters around at all -
this would work with Git as long as it is not compiled with different
signedness of char, but would not work with other implementations of
Git (unless they replicate the hashing bug). There is also the issue of
how we're going to indicate that in a commit graph file, some filters
are version 1 and some filters are version 2 (unless the plan is to
completely rewrite the filters in this case, but then we'll run into
the issue that computing these filters en-masse is expensive, as Taylor
describes also in [1]).

> Also if we are operating in such an environment then would it be
> possible to first compute as if we were going to generate v2 data,
> but write it as v1 after reading all the path and realizing there
> are no problematic paths?  

I think in this case, we would want to write it as v2 anyway, because
there's no way to distinguish a v1 that has high bits and is written
incorrectly versus a v1 that happens to have no high bits and therefore
is identical under v2.

> IOW, even if the version of Git is
> capable of writing and reading v2, it does not have to use v2,
> right?  To put it the other way around, we will have to and can keep
> supporting data that is labeled as v1, no?

I think this is the main point - whether we want to continue supporting
data labeled as v1. I personally think that we should migrate to v2
as quickly as possible. But if the consensus is that we should support
both, at least for a few releases of Git, I'll go with that.

[1] https://lore.kernel.org/git/ZF116EDcmAy7XEbC@nand.local/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2
  2023-05-24 21:26       ` Jonathan Tan
@ 2023-05-26 13:19         ` Derrick Stolee
  2023-05-30 17:26           ` Jonathan Tan
  0 siblings, 1 reply; 14+ messages in thread
From: Derrick Stolee @ 2023-05-26 13:19 UTC (permalink / raw)
  To: Jonathan Tan, Junio C Hamano; +Cc: git, me

On 5/24/2023 5:26 PM, Jonathan Tan wrote:
> Junio C Hamano <gitster@pobox.com> writes:

>> IOW, even if the version of Git is
>> capable of writing and reading v2, it does not have to use v2,
>> right?  To put it the other way around, we will have to and can keep
>> supporting data that is labeled as v1, no?
> 
> I think this is the main point - whether we want to continue supporting
> data labeled as v1. I personally think that we should migrate to v2
> as quickly as possible. But if the consensus is that we should support
> both, at least for a few releases of Git, I'll go with that.

I agree on migrating quickly as possible, within basic safety guidelines.

Shipping a Git change that suddenly is unable to use on-disk data that
it has previously relied upon is not a safe change. And that is the
absolute minimum amount of safety required. The other side is to not
make a Git change that suddenly changes the on-disk format without a
switch to disable it.

Think about it this way: if there was a bug in the code, could we
safely roll it back? If we are immediately writing v2 filters after
the deployment, then rolling it back will cause the previous version
to not recognize those filters, leading to a delayed recovery.

I'd be willing to modify my suggested steps:

>>> 1. v2.42.0 includes writing v2 format, off by default.
>>> 2. v2.43.0 writes v2 format by default.
>>> 3. v2.44.0 no longer parses v1 format (ignored without error).

to something simpler:

 1. v2.42.0 writes v2 format by default, but can be disabled by config.
 2. v2.43.0 no longer parses or writes v1 format.

With this, we could proactively set the config value that disables the
v2 format in our production environment, then slowly re-enable that
config after the binaries have deployed. This allows us to limit the
blast radius if something goes wrong, which is really important.

Further, I'm describing an environment where we control all of the Git
versions that are interacting with the repositories. Other environments
don't have that luxury, such as typical client users.

Even the three-version plan is an accelerated deprecation plan, based
on previous examples in Git.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2
  2023-05-26 13:19         ` Derrick Stolee
@ 2023-05-30 17:26           ` Jonathan Tan
  0 siblings, 0 replies; 14+ messages in thread
From: Jonathan Tan @ 2023-05-30 17:26 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Jonathan Tan, Junio C Hamano, git, me

Derrick Stolee <derrickstolee@github.com> writes:
> I'd be willing to modify my suggested steps:
> 
> >>> 1. v2.42.0 includes writing v2 format, off by default.
> >>> 2. v2.43.0 writes v2 format by default.
> >>> 3. v2.44.0 no longer parses v1 format (ignored without error).
> 
> to something simpler:
> 
>  1. v2.42.0 writes v2 format by default, but can be disabled by config.
>  2. v2.43.0 no longer parses or writes v1 format.
> 
> With this, we could proactively set the config value that disables the
> v2 format in our production environment, then slowly re-enable that
> config after the binaries have deployed. This allows us to limit the
> blast radius if something goes wrong, which is really important.
> 
> Further, I'm describing an environment where we control all of the Git
> versions that are interacting with the repositories. Other environments
> don't have that luxury, such as typical client users.
> 
> Even the three-version plan is an accelerated deprecation plan, based
> on previous examples in Git.
> 
> Thanks,
> -Stolee

OK, let me take a look and see what this (having at least a version
of Git that supports both versions of hash functions) would look like.
If we're going to have this, we might as well roll it out as safely
as possible, so I'll aim for your original step 1 of 3 (write v2, off
by default).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 0/3] Changed path filter hash fix and version bump
  2023-05-22 21:48 [PATCH 0/2] Changed path filter hash fix and version bump Jonathan Tan
                   ` (2 preceding siblings ...)
  2023-05-23  4:42 ` [PATCH 0/2] Changed path filter hash fix and version bump Junio C Hamano
@ 2023-05-31 23:12 ` Jonathan Tan
  2023-05-31 23:12   ` [PATCH v2 1/3] t4216: test changed path filters with high bit paths Jonathan Tan
                     ` (2 more replies)
  3 siblings, 3 replies; 14+ messages in thread
From: Jonathan Tan @ 2023-05-31 23:12 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano

Here's a new version. With this, Git can function with both version
1 (incorrect murmur3) and version 2 (correct murmur3) changed path
filters, but not at the same time: the user can set a config variable to
choose which one, and Git will ignore existing changed path filters of
the wrong version (and always write the version that the config variable
says).

In patch 1, the test assumes that char is signed. I'm not sure if it's
worth asserting on the contents of the filter, since it depends on
whether char is signed, but I've included it anyway (since it's easy
to remove).

Jonathan Tan (3):
  t4216: test changed path filters with high bit paths
  repo-settings: introduce commitgraph.changedPathsVersion
  commit-graph: new filter ver. that fixes murmur3

 Documentation/config/commitgraph.txt | 16 +++++--
 bloom.c                              | 65 ++++++++++++++++++++++++++--
 bloom.h                              |  8 ++--
 commit-graph.c                       | 29 ++++++++++---
 oss-fuzz/fuzz-commit-graph.c         |  2 +-
 repo-settings.c                      |  6 ++-
 repository.h                         |  2 +-
 t/helper/test-bloom.c                |  9 +++-
 t/t0095-bloom.sh                     |  8 ++++
 t/t4216-log-bloom.sh                 | 65 ++++++++++++++++++++++++++++
 10 files changed, 192 insertions(+), 18 deletions(-)

Range-diff against v1:
1:  3a5d53d3c0 < -:  ---------- t4216: test wrong bloom filter version rejection
2:  5a91f9682b < -:  ---------- commit-graph: fix murmur3, bump filter ver. to 2
-:  ---------- > 1:  175dc912fe t4216: test changed path filters with high bit paths
-:  ---------- > 2:  4a7553f3fb repo-settings: introduce commitgraph.changedPathsVersion
-:  ---------- > 3:  f5c3f6080a commit-graph: new filter ver. that fixes murmur3
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/3] t4216: test changed path filters with high bit paths
  2023-05-31 23:12 ` [PATCH v2 0/3] " Jonathan Tan
@ 2023-05-31 23:12   ` Jonathan Tan
  2023-05-31 23:12   ` [PATCH v2 2/3] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
  2023-05-31 23:12   ` [PATCH v2 3/3] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
  2 siblings, 0 replies; 14+ messages in thread
From: Jonathan Tan @ 2023-05-31 23:12 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano

Subsequent commits will teach Git another version of changed path
filter that has different behavior with paths that contain at least
one character with its high bit set, so test the existing behavior as
a baseline.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 t/t4216-log-bloom.sh | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fa9d32facf..2ec5b5b5e7 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -404,4 +404,38 @@ test_expect_success 'Bloom generation backfills empty commits' '
 	)
 '
 
+get_bdat_offset () {
+	perl -0777 -ne \
+		'print unpack("N", "$1") if /BDAT\0\0\0\0(....)/ or exit 1' \
+		.git/objects/info/commit-graph
+}
+
+get_first_changed_path_filter () {
+	BDAT_OFFSET=$(get_bdat_offset) &&
+	perl -0777 -ne \
+		'print unpack("H*", substr($_, '$BDAT_OFFSET' + 12, 2))' \
+		.git/objects/info/commit-graph
+}
+
+# chosen to be the same under all Unicode normalization forms
+CENT=$(printf "\xc2\xa2")
+
+test_expect_success 'set up repo with high bit path, version 1 changed-path' '
+	git init highbit1 &&
+	test_commit -C highbit1 c1 "$CENT" &&
+	git -C highbit1 commit-graph write --reachable --changed-paths
+'
+
+test_expect_success 'check value of version 1 changed-path' '
+	(cd highbit1 &&
+		printf "52a9" >expect &&
+		get_first_changed_path_filter >actual &&
+		test_cmp expect actual)
+'
+
+test_expect_success 'version 1 changed-path used when version 1 requested' '
+	(cd highbit1 &&
+		test_bloom_filters_used "-- $CENT")
+'
+
 test_done
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/3] repo-settings: introduce commitgraph.changedPathsVersion
  2023-05-31 23:12 ` [PATCH v2 0/3] " Jonathan Tan
  2023-05-31 23:12   ` [PATCH v2 1/3] t4216: test changed path filters with high bit paths Jonathan Tan
@ 2023-05-31 23:12   ` Jonathan Tan
  2023-05-31 23:12   ` [PATCH v2 3/3] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
  2 siblings, 0 replies; 14+ messages in thread
From: Jonathan Tan @ 2023-05-31 23:12 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano

A subsequent commit will introduce another version of the changed-path
filter in the commit graph file. In order to control which version is
to be accepted when read (and which version to write), a config variable
is needed.

Therefore, introduce this config variable. For forwards compatibility,
teach Git to not read commit graphs when the config variable
is set to an unsupported version. Because we teach Git this,
commitgraph.readChangedPaths is now redundant, so deprecate it and
define its behavior in terms of the config variable we introduce.

This commit does not change the behavior of writing (Git writes changed
path filters when explicitly instructed regardless of any config
variable), but a subsequent commit will restrict Git such that it will
only write when commitgraph.changedPathsVersion is 0, 1, or 2.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/config/commitgraph.txt | 16 +++++++++++++---
 commit-graph.c                       |  2 +-
 oss-fuzz/fuzz-commit-graph.c         |  2 +-
 repo-settings.c                      |  6 +++++-
 repository.h                         |  2 +-
 5 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index 30604e4a4c..eaa10bf232 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -9,6 +9,16 @@ commitGraph.maxNewFilters::
 	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
 
 commitGraph.readChangedPaths::
-	If true, then git will use the changed-path Bloom filters in the
-	commit-graph file (if it exists, and they are present). Defaults to
-	true. See linkgit:git-commit-graph[1] for more information.
+	Deprecated. Equivalent to changedPathsVersion=1 if true, and
+	changedPathsVersion=0 if false.
+
+commitGraph.changedPathsVersion::
+	Specifies the version of the changed-path Bloom filters that Git will read and
+	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
+	match the version set in this config variable will be ignored.
++
+Defaults to 1.
++
+If 0, git will write version 1 Bloom filters when instructed to write.
++
+See linkgit:git-commit-graph[1] for more information.
diff --git a/commit-graph.c b/commit-graph.c
index c11b59f28b..bd448047f1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -399,7 +399,7 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 			graph->read_generation_data = 1;
 	}
 
-	if (s->commit_graph_read_changed_paths) {
+	if (s->commit_graph_changed_paths_version == 1) {
 		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
 			   &graph->chunk_bloom_indexes);
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
diff --git a/oss-fuzz/fuzz-commit-graph.c b/oss-fuzz/fuzz-commit-graph.c
index 914026f5d8..b56731f51a 100644
--- a/oss-fuzz/fuzz-commit-graph.c
+++ b/oss-fuzz/fuzz-commit-graph.c
@@ -18,7 +18,7 @@ int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
 	 * possible.
 	 */
 	the_repository->settings.commit_graph_generation_version = 2;
-	the_repository->settings.commit_graph_read_changed_paths = 1;
+	the_repository->settings.commit_graph_changed_paths_version = 1;
 	g = parse_commit_graph(&the_repository->settings, (void *)data, size);
 	repo_clear(the_repository);
 	free_commit_graph(g);
diff --git a/repo-settings.c b/repo-settings.c
index 3dbd3f0e2e..6cbe02681b 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -24,6 +24,7 @@ void prepare_repo_settings(struct repository *r)
 	int value;
 	const char *strval;
 	int manyfiles;
+	int readChangedPaths;
 
 	if (!r->gitdir)
 		BUG("Cannot add settings for uninitialized repository");
@@ -54,7 +55,10 @@ void prepare_repo_settings(struct repository *r)
 	/* Commit graph config or default, does not cascade (simple) */
 	repo_cfg_bool(r, "core.commitgraph", &r->settings.core_commit_graph, 1);
 	repo_cfg_int(r, "commitgraph.generationversion", &r->settings.commit_graph_generation_version, 2);
-	repo_cfg_bool(r, "commitgraph.readchangedpaths", &r->settings.commit_graph_read_changed_paths, 1);
+	repo_cfg_bool(r, "commitgraph.readchangedpaths", &readChangedPaths, 1);
+	repo_cfg_int(r, "commitgraph.changedpathsversion",
+		     &r->settings.commit_graph_changed_paths_version,
+		     readChangedPaths ? 1 : 0);
 	repo_cfg_bool(r, "gc.writecommitgraph", &r->settings.gc_write_commit_graph, 1);
 	repo_cfg_bool(r, "fetch.writecommitgraph", &r->settings.fetch_write_commit_graph, 0);
 
diff --git a/repository.h b/repository.h
index e8c67ffe16..1f1c32a6dd 100644
--- a/repository.h
+++ b/repository.h
@@ -32,7 +32,7 @@ struct repo_settings {
 
 	int core_commit_graph;
 	int commit_graph_generation_version;
-	int commit_graph_read_changed_paths;
+	int commit_graph_changed_paths_version;
 	int gc_write_commit_graph;
 	int gc_cruft_packs;
 	int fetch_write_commit_graph;
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/3] commit-graph: new filter ver. that fixes murmur3
  2023-05-31 23:12 ` [PATCH v2 0/3] " Jonathan Tan
  2023-05-31 23:12   ` [PATCH v2 1/3] t4216: test changed path filters with high bit paths Jonathan Tan
  2023-05-31 23:12   ` [PATCH v2 2/3] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
@ 2023-05-31 23:12   ` Jonathan Tan
  2 siblings, 0 replies; 14+ messages in thread
From: Jonathan Tan @ 2023-05-31 23:12 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Junio C Hamano

The murmur3 implementation in bloom.c has a bug when converting series
of 4 bytes into network-order integers when char is signed (which is
controllable by a compiler option, and the default signedness of char is
platform-specific). When a string contains characters with the high bit
set, this bug causes results that, although internally consistent within
Git, does not accord with other implementations of murmur3 and even with
Git binaries that were compiled with different signedness of char. This
bug affects both how Git writes changed path filters to disk and how Git
interprets changed path filters on disk.

Therefore, introduce a new version (2) of changed path filters that
corrects this problem. The existing version (1) is still supported and
is still the default, but users should migrate away from it as soon
as possible.

Because this bug only manifests with characters that have the high bit
set, it may be possible that some (or all) commits in a given repo would
have the same changed path filter both before and after this fix is
applied. However, in order to determine whether this is the case, the
changed paths would first have to be computed, at which point it is not
much more expensive to just compute a new changed path filter.

So this patch does not include any mechanism to "salvage" changed path
filters from repositories. There is also no "mixed" mode - for each
invocation of Git, reading and writing changed path filters are done
with the same version number.

There is a change in write_commit_graph(). graph_read_bloom_data()
makes it possible for chunk_bloom_data to be non-NULL but
bloom_filter_settings to be NULL, which causes a segfault later on. I
produced such a segfault while developing this patch, but couldn't find
a way to reproduce it neither after this complete patch (or before),
but in any case it seemed like a good thing to include that might help
future patch authors.

The value in t0095 was obtained from another murmur3 implementation
using the following Go source code:

  package main

  import "fmt"
  import "github.com/spaolacci/murmur3"

  func main() {
          fmt.Printf("%x\n", murmur3.Sum32([]byte("Hello world!")))
          fmt.Printf("%x\n", murmur3.Sum32([]byte{0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff}))
  }

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/config/commitgraph.txt |  2 +-
 bloom.c                              | 65 ++++++++++++++++++++++++++--
 bloom.h                              |  8 ++--
 commit-graph.c                       | 29 ++++++++++---
 t/helper/test-bloom.c                |  9 +++-
 t/t0095-bloom.sh                     |  8 ++++
 t/t4216-log-bloom.sh                 | 31 +++++++++++++
 7 files changed, 139 insertions(+), 13 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index eaa10bf232..c64ee4f459 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -14,7 +14,7 @@ commitGraph.readChangedPaths::
 
 commitGraph.changedPathsVersion::
 	Specifies the version of the changed-path Bloom filters that Git will read and
-	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
+	write. May be 0, 1, or 2. Any changed-path Bloom filters on disk that do not
 	match the version set in this config variable will be ignored.
 +
 Defaults to 1.
diff --git a/bloom.c b/bloom.c
index d0730525da..915d8e5a31 100644
--- a/bloom.c
+++ b/bloom.c
@@ -65,7 +65,64 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
  * Not considered to be cryptographically secure.
  * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  */
-uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
+uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
+{
+	const uint32_t c1 = 0xcc9e2d51;
+	const uint32_t c2 = 0x1b873593;
+	const uint32_t r1 = 15;
+	const uint32_t r2 = 13;
+	const uint32_t m = 5;
+	const uint32_t n = 0xe6546b64;
+	int i;
+	uint32_t k1 = 0;
+	const char *tail;
+
+	int len4 = len / sizeof(uint32_t);
+
+	uint32_t k;
+	for (i = 0; i < len4; i++) {
+		uint32_t byte1 = (uint32_t)(unsigned char)data[4*i];
+		uint32_t byte2 = ((uint32_t)(unsigned char)data[4*i + 1]) << 8;
+		uint32_t byte3 = ((uint32_t)(unsigned char)data[4*i + 2]) << 16;
+		uint32_t byte4 = ((uint32_t)(unsigned char)data[4*i + 3]) << 24;
+		k = byte1 | byte2 | byte3 | byte4;
+		k *= c1;
+		k = rotate_left(k, r1);
+		k *= c2;
+
+		seed ^= k;
+		seed = rotate_left(seed, r2) * m + n;
+	}
+
+	tail = (data + len4 * sizeof(uint32_t));
+
+	switch (len & (sizeof(uint32_t) - 1)) {
+	case 3:
+		k1 ^= ((uint32_t)(unsigned char)tail[2]) << 16;
+		/*-fallthrough*/
+	case 2:
+		k1 ^= ((uint32_t)(unsigned char)tail[1]) << 8;
+		/*-fallthrough*/
+	case 1:
+		k1 ^= ((uint32_t)(unsigned char)tail[0]) << 0;
+		k1 *= c1;
+		k1 = rotate_left(k1, r1);
+		k1 *= c2;
+		seed ^= k1;
+		break;
+	}
+
+	seed ^= (uint32_t)len;
+	seed ^= (seed >> 16);
+	seed *= 0x85ebca6b;
+	seed ^= (seed >> 13);
+	seed *= 0xc2b2ae35;
+	seed ^= (seed >> 16);
+
+	return seed;
+}
+
+static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
 {
 	const uint32_t c1 = 0xcc9e2d51;
 	const uint32_t c2 = 0x1b873593;
@@ -130,8 +187,10 @@ void fill_bloom_key(const char *data,
 	int i;
 	const uint32_t seed0 = 0x293ae76f;
 	const uint32_t seed1 = 0x7e646e2c;
-	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
-	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
+	const uint32_t hash0 = (settings->hash_version == 2
+		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed0, data, len);
+	const uint32_t hash1 = (settings->hash_version == 2
+		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed1, data, len);
 
 	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
 	for (i = 0; i < settings->num_hashes; i++)
diff --git a/bloom.h b/bloom.h
index adde6dfe21..0c33ae282c 100644
--- a/bloom.h
+++ b/bloom.h
@@ -7,9 +7,11 @@ struct repository;
 struct bloom_filter_settings {
 	/*
 	 * The version of the hashing technique being used.
-	 * We currently only support version = 1 which is
+	 * The newest version is 2, which is
 	 * the seeded murmur3 hashing technique implemented
-	 * in bloom.c.
+	 * in bloom.c. Bloom filters of version 1 were created
+	 * with prior versions of Git, which had a bug in the
+	 * implementation of the hash function.
 	 */
 	uint32_t hash_version;
 
@@ -75,7 +77,7 @@ struct bloom_key {
  * Not considered to be cryptographically secure.
  * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  */
-uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len);
+uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len);
 
 void fill_bloom_key(const char *data,
 		    size_t len,
diff --git a/commit-graph.c b/commit-graph.c
index bd448047f1..36e6d09e74 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -302,15 +302,21 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
 	return 0;
 }
 
+struct graph_read_bloom_data_data {
+	struct commit_graph *g;
+	int commit_graph_changed_paths_version;
+};
+
 static int graph_read_bloom_data(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
-	struct commit_graph *g = data;
+	struct graph_read_bloom_data_data *d = data;
+	struct commit_graph *g = d->g;
 	uint32_t hash_version;
 	g->chunk_bloom_data = chunk_start;
 	hash_version = get_be32(chunk_start);
 
-	if (hash_version != 1)
+	if (hash_version != d->commit_graph_changed_paths_version)
 		return 0;
 
 	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
@@ -399,11 +405,16 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 			graph->read_generation_data = 1;
 	}
 
-	if (s->commit_graph_changed_paths_version == 1) {
+	if (s->commit_graph_changed_paths_version == 1
+	    || s->commit_graph_changed_paths_version == 2) {
+		struct graph_read_bloom_data_data data = {
+			.g = graph,
+			.commit_graph_changed_paths_version = s->commit_graph_changed_paths_version
+		};
 		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
 			   &graph->chunk_bloom_indexes);
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
-			   graph_read_bloom_data, graph);
+			   graph_read_bloom_data, &data);
 	}
 
 	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
@@ -2302,6 +2313,14 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->write_generation_data = (get_configured_generation_version(r) == 2);
 	ctx->num_generation_data_overflows = 0;
 
+	if (r->settings.commit_graph_changed_paths_version < 0
+	    || r->settings.commit_graph_changed_paths_version > 2) {
+		warning(_("attempting to write a commit-graph, but 'commitgraph.changedPathsVersion' (%d) is not supported"),
+			r->settings.commit_graph_changed_paths_version);
+		return 0;
+	}
+	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2
+		? 2 : 1;
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
 						      bloom_settings.bits_per_entry);
 	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
@@ -2331,7 +2350,7 @@ int write_commit_graph(struct object_directory *odb,
 		g = ctx->r->objects->commit_graph;
 
 		/* We have changed-paths already. Keep them in the next graph */
-		if (g && g->chunk_bloom_data) {
+		if (g && g->bloom_filter_settings) {
 			ctx->changed_paths = 1;
 			ctx->bloom_settings = g->bloom_filter_settings;
 		}
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 6c900ca668..34b8dd9164 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -48,6 +48,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 
 static const char *bloom_usage = "\n"
 "  test-tool bloom get_murmur3 <string>\n"
+"  test-tool bloom get_murmur3_seven_highbit\n"
 "  test-tool bloom generate_filter <string> [<string>...]\n"
 "  test-tool bloom get_filter_for_commit <commit-hex>\n";
 
@@ -62,7 +63,13 @@ int cmd__bloom(int argc, const char **argv)
 		uint32_t hashed;
 		if (argc < 3)
 			usage(bloom_usage);
-		hashed = murmur3_seeded(0, argv[2], strlen(argv[2]));
+		hashed = murmur3_seeded_v2(0, argv[2], strlen(argv[2]));
+		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
+	}
+
+	if (!strcmp(argv[1], "get_murmur3_seven_highbit")) {
+		uint32_t hashed;
+		hashed = murmur3_seeded_v2(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7);
 		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
 	}
 
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index b567383eb8..c8d84ab606 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -29,6 +29,14 @@ test_expect_success 'compute unseeded murmur3 hash for test string 2' '
 	test_cmp expect actual
 '
 
+test_expect_success 'compute unseeded murmur3 hash for test string 3' '
+	cat >expect <<-\EOF &&
+	Murmur3 Hash with seed=0:0xa183ccfd
+	EOF
+	test-tool bloom get_murmur3_seven_highbit >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'compute bloom key for empty string' '
 	cat >expect <<-\EOF &&
 	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 2ec5b5b5e7..764c6dee0f 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -438,4 +438,35 @@ test_expect_success 'version 1 changed-path used when version 1 requested' '
 		test_bloom_filters_used "-- $CENT")
 '
 
+test_expect_success 'version 1 changed-path not used when version 2 requested' '
+	(cd highbit1 &&
+		git config --add commitgraph.changedPathsVersion 2 &&
+		test_bloom_filters_not_used "-- $CENT")
+'
+
+test_expect_success 'set up repo with high bit path, version 2 changed-path' '
+	git init highbit2 &&
+	git -C highbit2 config --add commitgraph.changedPathsVersion 2 &&
+	test_commit -C highbit2 c2 "$CENT" &&
+	git -C highbit2 commit-graph write --reachable --changed-paths
+'
+
+test_expect_success 'check value of version 2 changed-path' '
+	(cd highbit2 &&
+		printf "c01f" >expect &&
+		get_first_changed_path_filter >actual &&
+		test_cmp expect actual)
+'
+
+test_expect_success 'version 2 changed-path used when version 2 requested' '
+	(cd highbit2 &&
+		test_bloom_filters_used "-- $CENT")
+'
+
+test_expect_success 'version 2 changed-path not used when version 1 requested' '
+	(cd highbit2 &&
+		git config --add commitgraph.changedPathsVersion 1 &&
+		test_bloom_filters_not_used "-- $CENT")
+'
+
 test_done
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-05-31 23:13 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-22 21:48 [PATCH 0/2] Changed path filter hash fix and version bump Jonathan Tan
2023-05-22 21:48 ` [PATCH 1/2] t4216: test wrong bloom filter version rejection Jonathan Tan
2023-05-22 21:48 ` [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2 Jonathan Tan
2023-05-23 13:00   ` Derrick Stolee
2023-05-23 23:00     ` Jonathan Tan
2023-05-23 23:51     ` Junio C Hamano
2023-05-24 21:26       ` Jonathan Tan
2023-05-26 13:19         ` Derrick Stolee
2023-05-30 17:26           ` Jonathan Tan
2023-05-23  4:42 ` [PATCH 0/2] Changed path filter hash fix and version bump Junio C Hamano
2023-05-31 23:12 ` [PATCH v2 0/3] " Jonathan Tan
2023-05-31 23:12   ` [PATCH v2 1/3] t4216: test changed path filters with high bit paths Jonathan Tan
2023-05-31 23:12   ` [PATCH v2 2/3] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
2023-05-31 23:12   ` [PATCH v2 3/3] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan

Code repositories for project(s) associated with this public inbox

	https://80x24.org/pub/scm/git/git.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).