Git Mailing List Archive mirror
 help / color / mirror / Atom feed
From: Jonathan Tan <jonathantanmy@google.com>
To: Taylor Blau <me@ttaylorr.com>
Cc: Jonathan Tan <jonathantanmy@google.com>,
	Junio C Hamano <gitster@pobox.com>,
	git@vger.kernel.org, derrickstolee@github.com
Subject: Re: Changed path filter hash differs from murmur3 if char is signed
Date: Fri, 12 May 2023 10:33:29 -0700	[thread overview]
Message-ID: <20230512173330.1072880-1-jonathantanmy@google.com> (raw)
In-Reply-To: <ZF116EDcmAy7XEbC@nand.local>

Taylor Blau <me@ttaylorr.com> writes:
> On Thu, May 11, 2023 at 03:51:16PM -0700, Junio C Hamano wrote:
> > If path filter hashing were merely advisory, in the sense that if a
> > matching data is found, great, the processing goes faster, but if
> > not, we would get correct results albeit not so quickly, a third
> > option would be to just update the implementation without updating
> > the version number.  But we may not be so lucky---you must have seen
> > a wrong result returned quickly, which is not what we want to see.
> 
> Right; from my understanding of Jonathan's message, I believe that we
> would get the wrong results if the implementation of not-quite-murmur3
> were corrected without updating the on-disk Bloom filters.

Yes - if the bloom filter contained junk data (in our example, created
using a different hash function on filenames that have characters that
exceed 0x7f), the bloom filter would report "no, this commit does not
contain a change in such-and-such path" and then we would skip the
commit, even if the commit did have a change in that path.

> We already have the bloom_filter_settings struct, which could also
> encode whether or not the data was computed with sign extension or not.
> If we are consistent in computing the hashes when we write Bloom filters
> and when we re-compute hashes to query them, I'd think we would be able
> to reuse the existing filters.
> 
> That would be nice to do if we could. Unfortunately, I don't think there
> is an easy way to determine the signed-ness of an existing set of Bloom
> filters, nor a guarantee that they all have the same signed-ness (IOW,
> the Bloom filters could have been built up across multiple copies of
> Git, each with different compiler settings).
> 
> So I'm not sure that that is a productive direction for us to take.

Agreed.

> > But if I recall correctly we made the file format in such a way that
> > bumping the version number is cheap in that transition can appear
> > seamless.  An updated implementation can just be told to _ignore_
> > old and possibly incorrect Bloom filters until it gets told to
> > recompute, at which time it can write a correct one with a new
> > version number.  So I would prefer your "Bump the version number and
> > ignore the old and possibly wrong data".
> 
> Version numbers are easy to increment, as is the case when adding new
> chunks to the chunked file format used by commit-graph,
> multi-pack-index, etc.
> 
> However, computing Bloom filters en-masse from scratch is fairly
> expensive. At GitHub when we were rolling out Bloom filters to all
> repositories a few years ago, I wrote 809e0327f5
> (builtin/commit-graph.c: introduce '--max-new-filters=<n>', 2020-09-18)
> to avoid spending too much time computing new filters from scratch.

Thanks for this pointer.

> If we do have to re-compute all of the filters from scratch, it may be
> worth considering enumerating all of the repository's paths first and
> seeing if any of them are >0xff. If none are, we can propagate the
> existing filters forward without having to compute new ones.

One way server operators can do this is by:
 - patching their Git to ignore the bloom filter version number
 - in a batch job, enumerate all trees and see their filenames
 - for the repos that have <=0x7f filenames, bump version number
 - update Git to the latest version (that uses the new hash)

Enumerating all trees saves on revision walking, which hopefully will
speed things up.

I don't have statistics on this, but if the majority of repos have
only <=0x7f filenames (which seems reasonable to me), this might save
sufficient work that we can proceed with bumping the version number and
ignoring old data.

> Better yet, we should be able to reuse existing Bloom filter data for
> paths that have all characters <=0xff, and only recompute them where
> necessary. That makes much more sense than the previous paragraph.

I would think that at the point that we know the paths that have been
changed for a commit, the cost of computation of the bloom filter (all
in the CPU) is negligible compared to the I/O it took for us to retrieve
all the paths (so the solution of just ignoring old data seems better
to me). But perhaps at large scales, we do want to save the computation
time anyway.

  reply	other threads:[~2023-05-12 17:33 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-11 22:40 Changed path filter hash differs from murmur3 if char is signed Jonathan Tan
2023-05-11 22:51 ` Junio C Hamano
2023-05-11 23:10   ` Taylor Blau
2023-05-12 17:33     ` Jonathan Tan [this message]
2023-05-12 19:42       ` Junio C Hamano
2023-05-12 20:54         ` Jonathan Tan
2023-05-12 21:27           ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230512173330.1072880-1-jonathantanmy@google.com \
    --to=jonathantanmy@google.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).