Re: Is GIT_DEFAULT_HASH flawed?

Git Mailing List Archive mirror
 help / color / mirror / Atom feed

From: Felipe Contreras <felipe.contreras@gmail.com>
To: demerphq <demerphq@gmail.com>,
	Felipe Contreras <felipe.contreras@gmail.com>
Cc: Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com>,
	"brian m. carlson" <sandals@crustytoothpaste.net>,
	git@vger.kernel.org, Adam Majer <adamm@zombino.com>
Subject: Re: Is GIT_DEFAULT_HASH flawed?
Date: Wed, 03 May 2023 12:20:45 -0600	[thread overview]
Message-ID: <6452a5fdf2bd9_6822945f@chronos.notmuch> (raw)
In-Reply-To: <CANgJU+UasufF7-B8ukEMm_Lv8gu4wUpaVKa9AOBacDHJvi7fxQ@mail.gmail.com>

demerphq wrote:
> On Wed, 3 May 2023 at 02:17, Felipe Contreras
> <felipe.contreras@gmail.com> wrote:
> > Changing the subject as this message seems like a different topic.
> > Jeff King wrote:
> > > On Wed, Apr 26, 2023 at 02:33:30PM -0700, Junio C Hamano wrote:
> > > > "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> > > >
> > > > >  `GIT_DEFAULT_HASH`::
> > > > >   If this variable is set, the default hash algorithm for new
> > > > >   repositories will be set to this value. This value is currently
> > > > > + ignored when cloning if the remote value can be definitively
> > > > > + determined; the setting of the remote repository is used
> > > > > + instead. The value is honored if the remote repository's
> > > > > + algorithm cannot be determined, such as some cases when
> > > > > + the remote repository is empty. The default is "sha1".
> > > > > + THIS VARIABLE IS EXPERIMENTAL! See `--object-format`
> > > > > + in linkgit:git-init[1].
> > > >
> > > > We'd need to evantually cover all the transports (and non-transport
> > > > like the "--local" optimization) so that the object-format and other
> > > > choices are communicated from the origin to a new clone anyway, so
> > > > this extra complexity "until X is fixed, it behaves this way, but
> > > > otherwise the variable is read in the meantime" may be a disservice
> > > > to the end users, even though it may make it easier in the shorter
> > > > term for maintainers of programs that rely on the buggy "git clone"
> > > > that partially honored this environment variable.
> > > >
> > > > In short, I am still not convinced that the above is a good design
> > > > choice in the longer term.
> > >
> > > I also think it is working against the backwards-compatible design of
> > > the hash function transition.
> >
> > To be honest this whole approach seems to be completely flawed to me and
> > against the whole design of git in the first place.
> >
> > In a recent email Linus Torvalds explained why object ids were
> > calculated based {type, size, data} [1], and he explained very clearly
> > that two objects with exactly the same data are not supposed to have the
> > same id if the type is different.
> 
> He said:
> 
> --- quote-begin ---
> The "no aliasing" means that no two distinct pointers can point to the
> same data. So a tagged pointer of type "commit" can not point to the
> same object as a tagged pointer of type "blob". They are distinct
> pointers, even if (maybe) the commit object encoding ends up then
> being identical to a blob object.
> --- quote-end ---
> 
> As far as I could tell he didn't really explain *why* he wanted this,
> and IMO it is non-obvious why he would care if a blob and a commit had
> the same text, and thus the same ID. He just said he didnt want it to
> happen, not why.

But we don't need to understand why to know it's part of the core design.

If something is part of the core design as a rule it's better to not
mess with it.

> I can imagine some aesthetic reasons why you might want to ensure that
> no blob has the same ID as a commit, and I can imagine it might make
> debugging easier at certain points, but it seems unnecessary given the
> data is write once.

I don't know, but to me separating objects makes sense not just conceptually,
but in practice there's a whole class of potential errors that could be
avoided.

For example, I can think of an implementation of `git prune` that would check
commits first, then trees, then blobs, and the blobs that not reachable from
any trees are removed. But if a commit can have the same id as a blob, you have
to think of a different implementation.

If that's not possible, then you just forget about those potential issues.

> > If even the tiniest change such as adding a period to a commit messange
> > changes the object id (and thus semantically makes it a different
> > object), then it makes sense that changing the type of an object also
> > changes the object id (and thus it's also a different object).
> >
> > And because the id of the parent is included in the content of every
> > commit, the top-level id ensures the integrity of the whole graph.
> >
> > But then comes this notion that the hash algorithm is a property of the
> > repository, and not part of the object storage, which means changing the
> > whole hash algorithm of a repository is considered less of a change than
> > adding a period to the commit message, worse: not a change at all.
> 
> I really dont understand why you think having two hash functions
> producing different results for the same data is comparable to a
> single hash producing different results for different data.

That depends on what you consider the "data" to be.

If you consider the content of a blob to be the data, then you wouldn't
have different results if a commit has the same data: it would be the
same id.

If instead you consider the data to be `type+content`, then you would
have different results.

> In one case you have two different continuum of identifiers, with one
> ID per continuum, and in the other you have two different identifiers
> in the same continuum, and  if you a continuum you would have 4
> different identifiers right? Eg, the two cases are really quite
> different at a fundamental level.

That entirely depends on what data you hash.

If you hash `algo+type+size+data` there's only one id per object.
Period.

> > I am reminded of the warning Sam Smith gave to the Git project [2] which
> > seemed to be unheard, but the notion of cryptographic algorithm agility
> > makes complete sense to me.
> >
> > In my view one repository should be able to have part SHA-1 history,
> > part SHA3-256 history, and part BLAKE2b history.
> 
> Isn't this orthagonal to your other points?

Not if you consider changing the hash algorithm of a repository to be an
important part of its history (more important than adding a period to a
commit).

> > Changing the hash algorithm of one commit should change the object id of
> > that commit, and thus make it semantically a different commit.
> >
> > In other words: an object of type "blob" should never be confused with
> > an object of type "blob:sha-256", even if the content is exactly the
> > same.
> 
> This doesn't make sense to me.  As long as we can distinguish the
> hashes produced by the different hash functions in use we can create a
> mapping of the data that is hashed such that we have a 1:1 mapping of
> identifiers of each type at which point it really doesn't matter which
> hash function is used.

Yes, we *can*, that doesn't mean we *should*.

If you do `git commit --ammend --signoff` to add your `Signed-off-by` to
a commit, there's a 1:1 mapping from the original commit, to the new
one, but conceptually in git they are different objects.

I recall Linus Torvalds mentioned he used Monotone as a guideline of
what *not* to do. In Monotone you could add the equivalent of
`Signed-off-by` without changing the hash of the commit, in fact, you
could add any metadata if I recall correctly. But this opens a whole can
of worms because now how do you know you have all the metadata relevant
to the commit?

Making all the metadata of a commit part of the commit solves the
integrity problem Monotone had at the cost of making git commits
essentially immutable: any change means it's a different commit.

If making *any* change in the object, makes it conceptually a different
object, including the type of the object, how on Earth is changing the
hash algorithm not considered a change?

This object:

  ❯ git hash-object -t blob /dev/null
  e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

Is considered different from this object:

  ❯ git hash-object -t tree /dev/null
  4b825dc642cb6eb9a060e54bf8d69288fbee4904

That's why they have a different hash.

Why would these objects be considered the same?

  ❯ git hash-object -t blob /dev/null
  e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

  ❯ git hash-object -t blob /dev/null
  473a0f4c3be8a93681a267e3b1e9a7dcda1185436fe141f7749120a303721813

It makes *zero* sense that adding a period changes the object, adding a
s-o-b changes the object, changing the type changes the object, but
changing the hash algorithm does not.

> > The fact that apparently it's so easy to clone a repository with
> > the wrong hash algorithm should give developers pause, as it means the
> > whole point of using cryptographic hash algorithms to ensure the
> > integrity of the commit history is completely gone.
> 
> This is a leap too far. The fact that it is "so easy to clone a repo
> with the wrong hash algorithm" is completely orthogonal to the
> fundamental principles of hash identifiers from strong hash functions.

Only if you think changing the hash algorithm is a less important part
of an object than adding a period.

Do you honestly think these two should be considered the same object?

a)

  tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904
  author Felipe Contreras <felipe.contreras@gmail.com> 0 -0600
  committer Felipe Contreras <felipe.contreras@gmail.com> 0 -0600

  Initial commit

b)

  tree 6ef19b41225c5369f1c104d45d8d85efa9b057b53b14b4b9b939dd74decc5321
  author Felipe Contreras <felipe.contreras@gmail.com> 0 -0600
  committer Felipe Contreras <felipe.contreras@gmail.com> 0 -0600

  Initial commit

> You seem to be deriving grand conclusions from what sounds to me like
> a simple bug/design-oversight.

I think you are dismissing the brilliant idea that made git's object
storage model so successful.

> > I have not been following the SHA-1 -> OID discussions, but I
> > distinctively recall Linus Torvalds mentioning that the choice of using
> > SHA-1 wasn't even for security purposes, it was to ensure integrity.
> > When I do a `git fetch` as long as the new commits have the same SHA-1
> > as parent as the SHA-1s I have in my repository I can be relatively
> > certain the repository has not been tampered with. Which means that if I
> > do a `git fetch` that suddenly brings SHA-256 commits, some of them must
> > have SHA-1 parents that match the ones I currently have. Otherwise how
> > do I know it's the same history?
> 
> So consider what /could/ happen here. You fetch a commit which uses
> SHA-256 into a repo where all of your local commits use SHA-1. The
> commit you fetched says its parent is some SHA-256 ID you don't know
> about as all your ID's are SHA-1. So git then could go and construct
> an index, hashing each item using SHA-256 instead of SHA-1, and using
> the result to build a bi-directional mapping from SHA-1 to SHA-256 and
> back.  All it has to do then is look into the mapping to find if the
> SHA-256 parent id is present in your repo. If it is then you know it's
> the same history.

Yeah, that *could* happen. Doesn't mean it *should*.

> The key point here is that if you ignore SHAttered artifacts (which
> seems reasonable as you can detect the attack during hashing)  you can
> build a 1:1 map of SHA-1 and SHA-256 ids.  Once you have that mapping
> it doesn't matter which ID is used.

It may not matter to you. It matters to me.

> > Maybe that's one of the reasons people don't seem particularly eager to
> > move away from SHA-1:
> 
> Maybe, but it doesn't make sense to me.  You seem to be putting undue
> weight on an unnecessary aspect of the git design: there doesn't seem
> to be a reason for Linuses "no aliasing" policy, and it seems like one
> could build a git-a-like without it without suffering any significant
> penalties.

The fact you don't see a reason doesn't mean there isn't one. This is an
argument from ignorance fallacy.

The appendix was considered a vestigial organ because nobody could see a
reason for it. Did that mean it served no puprpose? No.

> Regardless, provided that the hash functions allow a 1:1 mapping of
> ID's (which is assumed by using "collision free hash functions"), it
> seems like it really doesn't matter which hash is used at any given
> time.

People who designed CVS, Subversion, Monotone, and Mercurial didn't see
a reason for many of Git's design choices either.

I'd argue they were wrong.

I think changing the hash algorithm of a commit matters.

Cheers.

-- 
Felipe Contreras

next prev parent reply	other threads:[~2023-05-03 18:20 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-05 10:28 git clone of empty repositories doesn't preserve hash Adam Majer
2023-04-05 19:04 ` Junio C Hamano
2023-04-05 19:47   ` Adam Majer
2023-04-05 20:01     ` Jeff King
2023-04-05 20:40       ` Junio C Hamano
2023-04-05 21:15         ` Junio C Hamano
2023-04-05 21:26           ` Jeff King
2023-04-05 22:48           ` brian m. carlson
2023-04-06 13:11           ` Adam Majer
2023-04-25 21:35           ` brian m. carlson
2023-04-25 22:24             ` Junio C Hamano
2023-04-25 23:12             ` Junio C Hamano
2023-04-26  0:20               ` brian m. carlson
2023-04-26 11:25                 ` Jeff King
2023-04-26 15:08                   ` Junio C Hamano
2023-04-26 15:13                     ` [PATCH] doc: GIT_DEFAULT_HASH is and will be ignored during "clone" Junio C Hamano
2023-04-26 21:06                       ` brian m. carlson
2023-04-27  4:46                     ` git clone of empty repositories doesn't preserve hash Jeff King
2023-04-26 10:51               ` Jeff King
2023-04-26 15:42                 ` Junio C Hamano
2023-04-26 20:40                 ` brian m. carlson
2023-04-26 20:53                   ` [PATCH 0/2] Fix empty SHA-256 clones with v0 and v1 brian m. carlson
2023-04-26 20:53                     ` [PATCH 1/2] http: advertise capabilities when cloning empty repos brian m. carlson
2023-04-26 21:14                       ` Junio C Hamano
2023-04-26 21:28                         ` brian m. carlson
2023-04-27  5:00                           ` Jeff King
2023-04-27  5:30                       ` Jeff King
2023-04-27 20:40                         ` Junio C Hamano
2023-04-26 20:53                     ` [PATCH 2/2] Honor GIT_DEFAULT_HASH for empty clones without remote algo brian m. carlson
2023-04-26 21:18                       ` Junio C Hamano
2023-04-26 21:33                       ` Junio C Hamano
2023-04-27  5:43                         ` Jeff King
2023-05-02 23:46                           ` Is GIT_DEFAULT_HASH flawed? Felipe Contreras
2023-05-03  9:03                             ` Adam Majer
2023-05-03 15:44                               ` Felipe Contreras
2023-05-03 17:21                                 ` Adam Majer
2023-05-08  0:34                                   ` Felipe Contreras
2023-05-03  9:09                             ` demerphq
2023-05-03 18:20                               ` Felipe Contreras [this message]
2023-05-03 22:54                             ` brian m. carlson
2023-05-08  2:00                               ` Felipe Contreras
2023-05-08 21:38                                 ` brian m. carlson
2023-05-09 10:32                                   ` Oswald Buddenhagen
2023-05-09 16:47                                     ` Junio C Hamano
2023-04-26 21:12                     ` [PATCH 0/2] Fix empty SHA-256 clones with v0 and v1 Junio C Hamano
2023-04-27  4:56                   ` git clone of empty repositories doesn't preserve hash Jeff King
2023-05-01 17:00                   ` [PATCH v2 0/1] Fix empty SHA-256 clones with v0 and v1 brian m. carlson
2023-05-01 17:00                     ` [PATCH v2 1/1] upload-pack: advertise capabilities when cloning empty repos brian m. carlson
2023-05-01 22:40                       ` Jeff King
2023-05-01 22:51                         ` Junio C Hamano
2023-05-01 17:37                     ` [PATCH v2 0/1] Fix empty SHA-256 clones with v0 and v1 Junio C Hamano
2023-05-17 19:24                   ` [PATCH v3 " brian m. carlson
2023-05-17 19:24                     ` [PATCH v3 1/1] upload-pack: advertise capabilities when cloning empty repos brian m. carlson
2023-05-17 21:48                     ` [PATCH v3 0/1] Fix empty SHA-256 clones with v0 and v1 Junio C Hamano
2023-05-17 22:28                       ` brian m. carlson
2023-05-18 18:28                     ` Jeff King
2023-05-19 15:32                       ` brian m. carlson
2023-04-05 21:23         ` git clone of empty repositories doesn't preserve hash Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6452a5fdf2bd9_6822945f@chronos.notmuch \
    --to=felipe.contreras@gmail.com \
    --cc=adamm@zombino.com \
    --cc=demerphq@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).