Git Mailing List Archive mirror
 help / color / mirror / Atom feed
From: Jonathan Tan <jonathantanmy@google.com>
To: git@vger.kernel.org
Cc: Jonathan Tan <jonathantanmy@google.com>, derrickstolee@github.com
Subject: Changed path filter hash differs from murmur3 if char is signed
Date: Thu, 11 May 2023 15:40:59 -0700	[thread overview]
Message-ID: <20230511224101.972442-1-jonathantanmy@google.com> (raw)

This issue arose while I was implementing changed path filters (the
bloom filters in the commit graph file) for JGit [1].

It turns out that if a repository contains paths that have at least one
character greater than 0x7f, Git writes these filters and interprets
them (when read from the commit graph file) differently depending on
whether "char" is signed or unsigned, and when "char" is signed (this
can be controlled by a compiler flag), the implementation of the murmur3
hash differs from the one in Wikipedia and elsewhere. (I looked into
Git's Makefile and didn't see anything that controlled whether "char"
is signed, so I presume that the default of the compiler used applies.)
This can be seen from the murmur3_seeded() function in bloom.c:

> uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
> {
[snip]
> 		uint32_t byte1 = (uint32_t)data[4*i];

Note that data[4*i] is of type "char". When I first read this, I thought
that no sign extension would occur regardless of the signedness of char
(that is, (uint32_t)(signed char)-1 == (uint32_t)(unsigned char)-1)
but it turns out that sign extension does occur if the thing being cast
is signed (that is, (uint32_t)(signed char)-1 != (uint32_t)(unsigned
char)-1) [2]. The implementation of the murmur3 hash in Wikipedia (and
presumably the intention of the author here, since the variable is named
"byte1") expects this value to not exceed 0xff, but this is not the case
if sign extension occurs.

I looked if this was discussed at the time this patch was presented [3]
but couldn't find anything related.

So...how do we proceed? I can see at least 2 ways:

 - Decide that we're going to stick with the details of the existing
   implementation and declare that "data" is always interpreted as signed.
   In that case, I would put "signed" wherever necessary, rename the
   function to something that is not "murmur3", and change the names of
   byte1 etc. to indicate that they are not constrained to be less than or
   equal to 0xff.

 - Bump the version number to 2 and correct the implementation to
   match murmur3 (so, "data" is unsigned). Then we would have to think of
   a transition plan. One possible one might be to always reject version
   1 bloom filters, which I'm personally OK with, but it may seem too
   heavy a cost to some since in the perhaps typical case where a repo has
   filenames restricted to 0x7f and below, the existing bloom filters are
   still correct.

Of the two, I prefer the latter, but I'm open to suggestions.

[1] https://git.eclipse.org/r/c/jgit/jgit/+/201851/2
[2] Verified in practice and also prescribed by the C standard; see
  e.g. https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf section
  6.3.1.3 point 2.
[3] https://lore.kernel.org/git/a5aa3415c05ee9bc67a9471445a20c71a9834673.1586192395.git.gitgitgadget@gmail.com/



             reply	other threads:[~2023-05-11 22:41 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-11 22:40 Jonathan Tan [this message]
2023-05-11 22:51 ` Changed path filter hash differs from murmur3 if char is signed Junio C Hamano
2023-05-11 23:10   ` Taylor Blau
2023-05-12 17:33     ` Jonathan Tan
2023-05-12 19:42       ` Junio C Hamano
2023-05-12 20:54         ` Jonathan Tan
2023-05-12 21:27           ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230511224101.972442-1-jonathantanmy@google.com \
    --to=jonathantanmy@google.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).