Git Mailing List Archive mirror
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Jonathan Tan <jonathantanmy@google.com>,
	git@vger.kernel.org, derrickstolee@github.com
Subject: Re: Changed path filter hash differs from murmur3 if char is signed
Date: Thu, 11 May 2023 19:10:32 -0400	[thread overview]
Message-ID: <ZF116EDcmAy7XEbC@nand.local> (raw)
In-Reply-To: <xmqqjzxetrbv.fsf@gitster.g>

On Thu, May 11, 2023 at 03:51:16PM -0700, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
>
> > So...how do we proceed? I can see at least 2 ways:
> >
> >  - Decide that we're going to stick with the details of the existing
> >    implementation and declare that "data" is always interpreted as signed.
> >    In that case, I would put "signed" wherever necessary, rename the
> >    function to something that is not "murmur3", and change the names of
> >    byte1 etc. to indicate that they are not constrained to be less than or
> >    equal to 0xff.
> >
> >  - Bump the version number to 2 and correct the implementation to
> >    match murmur3 (so, "data" is unsigned). Then we would have to think of
> >    a transition plan. One possible one might be to always reject version
> >    1 bloom filters, which I'm personally OK with, but it may seem too
> >    heavy a cost to some since in the perhaps typical case where a repo has
> >    filenames restricted to 0x7f and below, the existing bloom filters are
> >    still correct.
>
> If path filter hashing were merely advisory, in the sense that if a
> matching data is found, great, the processing goes faster, but if
> not, we would get correct results albeit not so quickly, a third
> option would be to just update the implementation without updating
> the version number.  But we may not be so lucky---you must have seen
> a wrong result returned quickly, which is not what we want to see.

Right; from my understanding of Jonathan's message, I believe that we
would get the wrong results if the implementation of not-quite-murmur3
were corrected without updating the on-disk Bloom filters.

We already have the bloom_filter_settings struct, which could also
encode whether or not the data was computed with sign extension or not.
If we are consistent in computing the hashes when we write Bloom filters
and when we re-compute hashes to query them, I'd think we would be able
to reuse the existing filters.

That would be nice to do if we could. Unfortunately, I don't think there
is an easy way to determine the signed-ness of an existing set of Bloom
filters, nor a guarantee that they all have the same signed-ness (IOW,
the Bloom filters could have been built up across multiple copies of
Git, each with different compiler settings).

So I'm not sure that that is a productive direction for us to take.

> But if I recall correctly we made the file format in such a way that
> bumping the version number is cheap in that transition can appear
> seamless.  An updated implementation can just be told to _ignore_
> old and possibly incorrect Bloom filters until it gets told to
> recompute, at which time it can write a correct one with a new
> version number.  So I would prefer your "Bump the version number and
> ignore the old and possibly wrong data".

Version numbers are easy to increment, as is the case when adding new
chunks to the chunked file format used by commit-graph,
multi-pack-index, etc.

However, computing Bloom filters en-masse from scratch is fairly
expensive. At GitHub when we were rolling out Bloom filters to all
repositories a few years ago, I wrote 809e0327f5
(builtin/commit-graph.c: introduce '--max-new-filters=<n>', 2020-09-18)
to avoid spending too much time computing new filters from scratch.

If we do have to re-compute all of the filters from scratch, it may be
worth considering enumerating all of the repository's paths first and
seeing if any of them are >0xff. If none are, we can propagate the
existing filters forward without having to compute new ones.

Better yet, we should be able to reuse existing Bloom filter data for
paths that have all characters <=0xff, and only recompute them where
necessary. That makes much more sense than the previous paragraph.

Thanks,
Taylor

  reply	other threads:[~2023-05-11 23:10 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-11 22:40 Changed path filter hash differs from murmur3 if char is signed Jonathan Tan
2023-05-11 22:51 ` Junio C Hamano
2023-05-11 23:10   ` Taylor Blau [this message]
2023-05-12 17:33     ` Jonathan Tan
2023-05-12 19:42       ` Junio C Hamano
2023-05-12 20:54         ` Jonathan Tan
2023-05-12 21:27           ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZF116EDcmAy7XEbC@nand.local \
    --to=me@ttaylorr.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).