Git Mailing List Archive mirror
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org, Derrick Stolee <derrickstolee@github.com>,
	Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH v5 4/4] commit-graph: new filter ver. that fixes murmur3
Date: Wed, 19 Jul 2023 14:24:44 -0400	[thread overview]
Message-ID: <ZLgqbB2JG8+HPoHN@nand.local> (raw)
In-Reply-To: <47ba89c565d3383a8a14272fe52ac274be82d0be.1689283789.git.jonathantanmy@google.com>

On Thu, Jul 13, 2023 at 02:42:11PM -0700, Jonathan Tan wrote:
> The murmur3 implementation in bloom.c has a bug when converting series
> of 4 bytes into network-order integers when char is signed (which is
> controllable by a compiler option, and the default signedness of char is
> platform-specific). When a string contains characters with the high bit
> set, this bug causes results that, although internally consistent within
> Git, does not accord with other implementations of murmur3 and even with
> Git binaries that were compiled with different signedness of char. This
> bug affects both how Git writes changed path filters to disk and how Git
> interprets changed path filters on disk.

I think that you make a worthwhile point that the existing
implementation is internally consistent, but doesn't actually implement
a conventional murmur3. I wonder if it's worth being explicit where you
mention its internal consistency to say that the existing implementation
would never cause us to produce wrong results, but wouldn't be readable
by other off-the-shelf implementations of murmur3.

(To be clear, I think that you already make this point, I'm just
suggesting that it may be worth spelling it out even more explicitly
than what is written above).

> Therefore, introduce a new version (2) of changed path filters that
> corrects this problem. The existing version (1) is still supported and
> is still the default, but users should migrate away from it as soon
> as possible.

Makes sense.

> Because this bug only manifests with characters that have the high bit
> set, it may be possible that some (or all) commits in a given repo would
> have the same changed path filter both before and after this fix is
> applied. However, in order to determine whether this is the case, the
> changed paths would first have to be computed, at which point it is not
> much more expensive to just compute a new changed path filter.

Hmm. I think in the general case that is true, but I wonder if there's a
shortcut we could take for trees that are known to not have *any*
characters with their high-order bits set. That is, if we scan both of
the first parent's trees and determine that no such paths exist, the
contents of the Bloom filter would be the same in either version, right?

I think that that would be faster than recomputing all filters from
scratch. In either case, we have to load the whole tree. But if we can
quickly scan (and cache our results by setting some bit--indicating the
absence of paths with characters having their highest bit set--on the tree
objects' `flags` field), then we should be able to copy forward the
existing representation of the filter.

I think the early checks would be more expensive, since in the worst
case you have to walk the entire tree, only to realize that you actually
wanted to compute a first-parent tree diff, meaning you have to
essentially repeat the whole walk over again. But for repositories that
have few or no commits whose Bloom filters need computing, I think it
would be significantly faster, since many of the sub-trees wouldn't need
to be visited again.

> There is a change in write_commit_graph(). graph_read_bloom_data()
> makes it possible for chunk_bloom_data to be non-NULL but
> bloom_filter_settings to be NULL, which causes a segfault later on. I
> produced such a segfault while developing this patch, but couldn't find
> a way to reproduce it neither after this complete patch (or before),
> but in any case it seemed like a good thing to include that might help
> future patch authors.

Hmm. Interesting, I'd love to know more about what you were doing that
produced the segfault. I think it would be worth it to prove to
ourselves that this segfault can't occur in the wild. Or if it can, it
would be worth it to understand the bug and prevent it from happening.

> +static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
>  {
>  	const uint32_t c1 = 0xcc9e2d51;
>  	const uint32_t c2 = 0x1b873593;
> @@ -130,8 +187,10 @@ void fill_bloom_key(const char *data,
>  	int i;
>  	const uint32_t seed0 = 0x293ae76f;
>  	const uint32_t seed1 = 0x7e646e2c;
> -	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
> -	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
> +	const uint32_t hash0 = (settings->hash_version == 2
> +		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed0, data, len);
> +	const uint32_t hash1 = (settings->hash_version == 2
> +		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed1, data, len);

I do admire the ternary operator over the function being called, as I
think that Stolee pointed out earlier in this series. But I worry that
these two checks could fall out of sync with each other, causing us to
pick incosistent values for hash0, and hash1, respectively.

FWIW, I would probably write this as:

    const uint32_t hash0, hash1;
    if (settings->hash_version == 2) {
        hash0 = murmur3_seeded_v2(seed0, data, len);
        hash1 = murmur3_seeded_v2(seed1, data, len);
    } else {
        hash0 = murmur3_seeded_v1(seed0, data, len);
        hash1 = murmur3_seeded_v1(seed1, data, len);
    }

I suppose that there isn't anything keeping the calls within each arm of
the if-statement above in sync with each other (i.e., I could call
murmur3_seeded_v1() immediately before dispatching a call to
murmur3_seeded_v2()). So in that sense I think that this is no more or
less safe than what's already written.

But IMHO I think this one reads more cleanly, so I might prefer it over
what you have in this patch. But I don't feel so strongly about it
either way.

> diff --git a/commit-graph.c b/commit-graph.c
> index 9b72319450..c50107eed5 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -302,16 +302,25 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
>  	return 0;
>  }
>
> +struct graph_read_bloom_data_data {
> +	struct commit_graph *g;
> +	int *commit_graph_changed_paths_version;
> +};
> +

Nit: maybe `graph_read_bloom_data_context`, to avoid repeating "data"? I
don't have strong feelings here, FWIW.

The rest of the implementation and tests look good to me.

Thanks,
Taylor

  reply	other threads:[~2023-07-19 18:24 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-22 21:48 [PATCH 0/2] Changed path filter hash fix and version bump Jonathan Tan
2023-05-22 21:48 ` [PATCH 1/2] t4216: test wrong bloom filter version rejection Jonathan Tan
2023-05-22 21:48 ` [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2 Jonathan Tan
2023-05-23 13:00   ` Derrick Stolee
2023-05-23 23:00     ` Jonathan Tan
2023-05-23 23:51     ` Junio C Hamano
2023-05-24 21:26       ` Jonathan Tan
2023-05-26 13:19         ` Derrick Stolee
2023-05-30 17:26           ` Jonathan Tan
2023-05-23  4:42 ` [PATCH 0/2] Changed path filter hash fix and version bump Junio C Hamano
2023-05-31 23:12 ` [PATCH v2 0/3] " Jonathan Tan
2023-05-31 23:12   ` [PATCH v2 1/3] t4216: test changed path filters with high bit paths Jonathan Tan
2023-05-31 23:12   ` [PATCH v2 2/3] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
2023-05-31 23:12   ` [PATCH v2 3/3] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
2023-06-03  1:01   ` [PATCH v2 0/3] Changed path filter hash fix and version bump Junio C Hamano
2023-06-03  2:24     ` Junio C Hamano
2023-06-07 16:30       ` Jonathan Tan
2023-06-07 21:37         ` Jonathan Tan
2023-06-08 19:21 ` [PATCH v3 0/4] " Jonathan Tan
2023-06-08 19:21   ` [PATCH v3 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
2023-06-08 19:52     ` Ramsay Jones
2023-06-12 21:26       ` Junio C Hamano
2023-06-08 19:21   ` [PATCH v3 2/4] t4216: test changed path filters with high bit paths Jonathan Tan
2023-06-08 19:21   ` [PATCH v3 3/4] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
2023-06-08 19:21   ` [PATCH v3 4/4] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
2023-06-08 19:50   ` [PATCH v3 0/4] Changed path filter hash fix and version bump Ramsay Jones
2023-06-09  0:08     ` Jonathan Tan
2023-06-12 21:31     ` Junio C Hamano
2023-06-13 17:16       ` Jonathan Tan
2023-06-13 17:29         ` [PATCH] CodingGuidelines: use octal escapes, not hex Jonathan Tan
2023-06-13 18:16           ` Eric Sunshine
2023-06-13 18:43             ` Jonathan Tan
2023-06-13 19:15               ` Eric Sunshine
2023-06-13 19:29                 ` Junio C Hamano
2023-06-13 19:16         ` [PATCH v3 0/4] Changed path filter hash fix and version bump Junio C Hamano
2023-06-13 17:39 ` [PATCH v4 " Jonathan Tan
2023-06-13 17:39   ` [PATCH v4 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
2023-06-13 21:58     ` Junio C Hamano
2023-06-20 13:22       ` Derrick Stolee
2023-06-21 12:08       ` Taylor Blau
2023-06-22 22:26         ` Jonathan Tan
2023-06-23 13:05           ` Derrick Stolee
2023-06-13 17:39   ` [PATCH v4 2/4] t4216: test changed path filters with high bit paths Jonathan Tan
2023-06-13 17:39   ` [PATCH v4 3/4] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
2023-06-20 13:28     ` Derrick Stolee
2023-06-21 12:14     ` Taylor Blau
2023-06-13 17:39   ` [PATCH v4 4/4] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
2023-06-20 13:39     ` Derrick Stolee
2023-06-20 18:37       ` Junio C Hamano
2023-06-13 19:21   ` [PATCH v4 0/4] Changed path filter hash fix and version bump Junio C Hamano
2023-06-20 13:43     ` Derrick Stolee
2023-06-20 21:56       ` Jonathan Tan
2023-06-21 12:19       ` Taylor Blau
2023-06-21 17:53         ` Derrick Stolee
2023-06-22 22:27           ` Jonathan Tan
2023-06-23 13:18             ` Derrick Stolee
2023-07-13 21:42 ` [PATCH v5 " Jonathan Tan
2023-07-13 21:42   ` [PATCH v5 1/4] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
2023-07-19 17:25     ` Taylor Blau
2023-07-20 20:20       ` Jonathan Tan
2023-07-21  1:38         ` Taylor Blau
2023-07-13 21:42   ` [PATCH v5 2/4] t4216: test changed path filters with high bit paths Jonathan Tan
2023-07-13 22:50     ` Junio C Hamano
2023-07-19 17:27       ` Taylor Blau
2023-07-19 17:55         ` [PATCH 0/4] commit-graph: avoid looking at Bloom filter data directly Taylor Blau
2023-07-19 17:55           ` [PATCH 1/4] t/helper/test-read-graph.c: extract `dump_graph_info()` Taylor Blau
2023-07-19 17:55           ` [PATCH 2/4] bloom.h: make `load_bloom_filter_from_graph()` public Taylor Blau
2023-07-19 17:55           ` [PATCH 3/4] t/helper/test-read-graph: implement `bloom-filters` mode Taylor Blau
2023-07-19 17:55           ` [PATCH 4/4] fixup! t4216: test changed path filters with high bit paths Taylor Blau
2023-07-19 19:24           ` [PATCH 0/4] commit-graph: avoid looking at Bloom filter data directly Junio C Hamano
2023-07-20 20:22           ` Jonathan Tan
2023-07-13 21:42   ` [PATCH v5 3/4] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
2023-07-19 18:10     ` Taylor Blau
2023-07-20 20:42       ` Jonathan Tan
2023-07-20 21:02         ` Taylor Blau
2023-07-13 21:42   ` [PATCH v5 4/4] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
2023-07-19 18:24     ` Taylor Blau [this message]
2023-07-20 21:27       ` Jonathan Tan
2023-07-26 23:32         ` Taylor Blau
2023-07-13 22:16   ` [PATCH v5 0/4] Changed path filter hash fix and version bump Junio C Hamano
2023-07-13 22:59     ` Junio C Hamano
2023-07-14 18:48     ` Jonathan Tan
2023-07-20 21:46 ` [PATCH v6 0/7] " Jonathan Tan
2023-07-20 21:46   ` [PATCH v6 1/7] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
2023-07-20 21:46   ` [PATCH v6 2/7] t/helper/test-read-graph.c: extract `dump_graph_info()` Jonathan Tan
2023-07-26 23:26     ` Taylor Blau
2023-07-20 21:46   ` [PATCH v6 3/7] bloom.h: make `load_bloom_filter_from_graph()` public Jonathan Tan
2023-07-20 21:46   ` [PATCH v6 4/7] t/helper/test-read-graph: implement `bloom-filters` mode Jonathan Tan
2023-07-20 21:46   ` [PATCH v6 5/7] t4216: test changed path filters with high bit paths Jonathan Tan
2023-07-26 23:28     ` Taylor Blau
2023-07-20 21:46   ` [PATCH v6 6/7] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
2023-07-20 21:46   ` [PATCH v6 7/7] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
2023-07-25 20:52   ` [PATCH v6 0/7] Changed path filter hash fix and version bump Junio C Hamano
2023-07-26 20:39   ` Junio C Hamano
2023-07-27  0:17     ` Taylor Blau
2023-07-27  0:49       ` Junio C Hamano
2023-07-27 17:39         ` Jonathan Tan
2023-07-27 17:56           ` Taylor Blau
2023-07-27 20:53             ` Jonathan Tan
2023-08-01 18:08               ` Taylor Blau
2023-08-01 18:52                 ` Jonathan Tan
2023-08-03  0:01                 ` Taylor Blau
2023-08-03 13:18                   ` Derrick Stolee
2023-08-03 18:45                     ` Taylor Blau
2023-07-27 18:44         ` Junio C Hamano
2023-08-01 18:41 ` [PATCH v7 " Jonathan Tan
2023-08-01 18:41   ` [PATCH v7 1/7] gitformat-commit-graph: describe version 2 of BDAT Jonathan Tan
2023-08-01 18:41   ` [PATCH v7 2/7] t/helper/test-read-graph.c: extract `dump_graph_info()` Jonathan Tan
2023-08-01 18:41   ` [PATCH v7 3/7] bloom.h: make `load_bloom_filter_from_graph()` public Jonathan Tan
2023-08-01 18:41   ` [PATCH v7 4/7] t/helper/test-read-graph: implement `bloom-filters` mode Jonathan Tan
2023-08-01 18:41   ` [PATCH v7 5/7] t4216: test changed path filters with high bit paths Jonathan Tan
2023-08-01 18:41   ` [PATCH v7 6/7] repo-settings: introduce commitgraph.changedPathsVersion Jonathan Tan
2023-08-01 18:41   ` [PATCH v7 7/7] commit-graph: new filter ver. that fixes murmur3 Jonathan Tan
2023-08-01 18:44   ` [PATCH v7 0/7] Changed path filter hash fix and version bump Junio C Hamano
2023-08-01 20:58     ` Taylor Blau
2023-08-01 21:07       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZLgqbB2JG8+HPoHN@nand.local \
    --to=me@ttaylorr.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).