Git Mailing List Archive mirror
 help / color / mirror / Atom feed
* git clone of empty repositories doesn't preserve hash
@ 2023-04-05 10:28 Adam Majer
  2023-04-05 19:04 ` Junio C Hamano
  0 siblings, 1 reply; 58+ messages in thread
From: Adam Majer @ 2023-04-05 10:28 UTC (permalink / raw)
  To: git

Hi all,

I've noticed while adding support for sha256 repositories for Gitea that,

git init --bare --object-format=sha256 a
git clone a b

Then the repository b is initialized as default hash, so sha1. It seems 
that receive-pack will list the null OID in the header if there are no 
refs available, but the upload-pack doesn't list anything and hence the 
header with capabilities and the hash function is missing

git receive-pack a
git upload-pack a

What is the right approach here? Could upload-pack send a NULL OID 
followed by header info that is then used by clone?

There is a workaround to specify the hash via GIT_DEFAULT_HASH. Not 
ideal though.

Thanks,
Adam


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-05 10:28 git clone of empty repositories doesn't preserve hash Adam Majer
@ 2023-04-05 19:04 ` Junio C Hamano
  2023-04-05 19:47   ` Adam Majer
  0 siblings, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2023-04-05 19:04 UTC (permalink / raw)
  To: Adam Majer; +Cc: git

Adam Majer <adamm@zombino.com> writes:

> I've noticed while adding support for sha256 repositories for Gitea that,
>
> git init --bare --object-format=sha256 a
> git clone a b
>
> Then the repository b is initialized as default hash, so sha1. It
> seems that receive-pack will list the null OID in the header if there
> are no refs available, but the upload-pack doesn't list anything and
> hence the header with capabilities and the hash function is missing
>
> git receive-pack a
> git upload-pack a
>
> What is the right approach here? Could upload-pack send a NULL OID
> followed by header info that is then used by clone?

Interesting.

Does such a clone copy the name of the primary branch from the
remote repository to the newly created repository?  I recall we had
seen such a feature request but offhand I do not recall how we
solved it.  The need to transmit a capability even when there is no
concrete reference to be fetched to satisfy such a feature request
should be the same as yours, so there may already be a good place to
add that information.  Or we may have ended up not solving the "what
is the name of the primary branch in this empty repository?", in
which case, the solution for this "what is the hash function to be
used in this empty repository?" should be designed to be extensive
enough to allow us later support that feature easily.

Thanks for raising the issue.



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-05 19:04 ` Junio C Hamano
@ 2023-04-05 19:47   ` Adam Majer
  2023-04-05 20:01     ` Jeff King
  0 siblings, 1 reply; 58+ messages in thread
From: Adam Majer @ 2023-04-05 19:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On 4/5/23 21:04, Junio C Hamano wrote:
> Does such a clone copy the name of the primary branch from the
> remote repository to the newly created repository?

Yes it does.

# git init -b maestro --object-format=sha256 a
# git clone a b
# cat b/.git/HEAD
ref: refs/heads/maestro

- Adam


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-05 19:47   ` Adam Majer
@ 2023-04-05 20:01     ` Jeff King
  2023-04-05 20:40       ` Junio C Hamano
  0 siblings, 1 reply; 58+ messages in thread
From: Jeff King @ 2023-04-05 20:01 UTC (permalink / raw)
  To: Adam Majer; +Cc: Junio C Hamano, git

On Wed, Apr 05, 2023 at 09:47:55PM +0200, Adam Majer wrote:

> On 4/5/23 21:04, Junio C Hamano wrote:
> > Does such a clone copy the name of the primary branch from the
> > remote repository to the newly created repository?
> 
> Yes it does.
> 
> # git init -b maestro --object-format=sha256 a
> # git clone a b
> # cat b/.git/HEAD
> ref: refs/heads/maestro

Yeah, we send a special capability line in that case. If you do:

  GIT_TRACE_PACKET=1 git clone a b

you can see that upload-pack indicates that ls-refs understands the
"unborn" capability:

  packet:  upload-pack> version 2
  packet:  upload-pack> agent=git/2.40.0.824.g7b678b1f643
  packet:  upload-pack> ls-refs=unborn
  packet:  upload-pack> fetch=shallow wait-for-done
  packet:  upload-pack> server-option
  packet:  upload-pack> object-format=sha256
  packet:  upload-pack> object-info
  packet:  upload-pack> 0000

And then clone asks for it say "yes, I also understand unborn":

  packet:        clone> command=ls-refs
  packet:        clone> agent=git/2.40.0.824.g7b678b1f643
  packet:        clone> object-format=sha256
  packet:        clone> 0001
  packet:        clone> peel
  packet:        clone> symrefs
  packet:        clone> unborn
  packet:        clone> ref-prefix HEAD
  packet:        clone> ref-prefix refs/heads/
  packet:        clone> ref-prefix refs/tags/
  packet:        clone> 0000

And then upload-pack can send us the extra information:

  packet:  upload-pack> unborn HEAD symref-target:refs/heads/maestro
  packet:  upload-pack> 0000

I think we'd need to do something similar here for object-format.

-Peff

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-05 20:01     ` Jeff King
@ 2023-04-05 20:40       ` Junio C Hamano
  2023-04-05 21:15         ` Junio C Hamano
  2023-04-05 21:23         ` git clone of empty repositories doesn't preserve hash Jeff King
  0 siblings, 2 replies; 58+ messages in thread
From: Junio C Hamano @ 2023-04-05 20:40 UTC (permalink / raw)
  To: Jeff King; +Cc: Adam Majer, git

Jeff King <peff@peff.net> writes:

> Yeah, we send a special capability line in that case. If you do:
>
>   GIT_TRACE_PACKET=1 git clone a b
>
> you can see that upload-pack indicates that ls-refs understands the
> "unborn" capability:
>
>   packet:  upload-pack> version 2
>   packet:  upload-pack> agent=git/2.40.0.824.g7b678b1f643
>   packet:  upload-pack> ls-refs=unborn
>   packet:  upload-pack> fetch=shallow wait-for-done
>   packet:  upload-pack> server-option
>   packet:  upload-pack> object-format=sha256
>   packet:  upload-pack> object-info
>   packet:  upload-pack> 0000
>
> And then clone asks for it say "yes, I also understand unborn":
>
>   packet:        clone> command=ls-refs
>   packet:        clone> agent=git/2.40.0.824.g7b678b1f643
>   packet:        clone> object-format=sha256
>   packet:        clone> 0001
>   packet:        clone> peel
>   packet:        clone> symrefs
>   packet:        clone> unborn
>   packet:        clone> ref-prefix HEAD
>   packet:        clone> ref-prefix refs/heads/
>   packet:        clone> ref-prefix refs/tags/
>   packet:        clone> 0000
>
> And then upload-pack can send us the extra information:
>
>   packet:  upload-pack> unborn HEAD symref-target:refs/heads/maestro
>   packet:  upload-pack> 0000
>
> I think we'd need to do something similar here for object-format.

I guess we only need to touch "git clone" then.  Without being
asked, it advertsizes object-format=sha256 already, and when the
maestro repository is prepared without --object-format=sha256,
upload-pack advertises object-format=sha1 instead.  So it probably
is just the matter of capturing it and using it to populate the
extensions.objectformat with an appropriate value.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-05 20:40       ` Junio C Hamano
@ 2023-04-05 21:15         ` Junio C Hamano
  2023-04-05 21:26           ` Jeff King
                             ` (3 more replies)
  2023-04-05 21:23         ` git clone of empty repositories doesn't preserve hash Jeff King
  1 sibling, 4 replies; 58+ messages in thread
From: Junio C Hamano @ 2023-04-05 21:15 UTC (permalink / raw)
  To: Jeff King; +Cc: Adam Majer, git

Junio C Hamano <gitster@pobox.com> writes:

> I guess we only need to touch "git clone" then.  Without being
> asked, it advertsizes object-format=sha256 already, and when the
> maestro repository is prepared without --object-format=sha256,
> upload-pack advertises object-format=sha1 instead.  So it probably
> is just the matter of capturing it and using it to populate the
> extensions.objectformat with an appropriate value.

It turns out that there was a readily mimickable example in 3d8314f8
(clone: propagate empty remote HEAD even with other branches,
2022-07-07).  The commit lifted code out of a block, in which we
know we are copying from a non-empty repository, to execute also
when talking with an empty repository.  The recording of the
hash-algorithm in the extensions section is done in the same way, so
we can do the same "fix".

----- >8 -----
Subject: [PATCH] clone: propagate object-format when cloning from void

A user could prepare an empty repository and set it to use SHA256 as
the object format.  The new repository created by "git clone" from
such a repository however would not record that it is expecting
objects in the same SHA256 format.  This works as expected if the
source repository is not empty.

Just like we started copying the name of the primary branch from the
remote repository even if it is unborn in 3d8314f8 (clone: propagate
empty remote HEAD even with other branches, 2022-07-07), lift the
code that records the object format out of the block executed only
when cloning from an instantiated repository, so that it works also
when cloning from an empty repository.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/clone.c        | 11 ++++++-----
 t/t5702-protocol-v2.sh | 11 +++++++++++
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git c/builtin/clone.c w/builtin/clone.c
index 462c286274..8f16d18a43 100644
--- c/builtin/clone.c
+++ w/builtin/clone.c
@@ -910,6 +910,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	int err = 0, complete_refs_before_fetch = 1;
 	int submodule_progress;
 	int filter_submodules = 0;
+	int hash_algo;
 
 	struct transport_ls_refs_options transport_ls_refs_options =
 		TRANSPORT_LS_REFS_OPTIONS_INIT;
@@ -1298,15 +1299,15 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		}
 	}
 
-	if (mapped_refs) {
-		int hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
-
 		/*
 		 * Now that we know what algorithm the remote side is using,
 		 * let's set ours to the same thing.
 		 */
-		initialize_repository_version(hash_algo, 1);
-		repo_set_hash_algo(the_repository, hash_algo);
+	hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
+	initialize_repository_version(hash_algo, 1);
+	repo_set_hash_algo(the_repository, hash_algo);
+
+	if (mapped_refs) {
 		/*
 		 * transport_get_remote_refs() may return refs with null sha-1
 		 * in mapped_refs (see struct transport->get_refs_list
diff --git c/t/t5702-protocol-v2.sh w/t/t5702-protocol-v2.sh
index 71aabe30b7..6af5c2062f 100755
--- c/t/t5702-protocol-v2.sh
+++ w/t/t5702-protocol-v2.sh
@@ -269,6 +269,17 @@ test_expect_success 'clone propagates unborn HEAD from non-empty repo' '
 	grep "warning: remote HEAD refers to nonexistent ref" stderr
 '
 
+test_expect_success 'clone propagates object-format from empty repo' '
+	test_when_finished "rm -fr src256 dst256" &&
+
+	echo sha256 >expect &&
+	git init --object-format=sha256 src256 &&
+	git clone src256 dst256 &&
+	git -C dst256 rev-parse --show-object-format >actual &&
+
+	test_cmp expect actual
+'
+
 test_expect_success 'bare clone propagates unborn HEAD from non-empty repo' '
 	test_when_finished "rm -rf file_unborn_parent file_unborn_child.git" &&
 

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-05 20:40       ` Junio C Hamano
  2023-04-05 21:15         ` Junio C Hamano
@ 2023-04-05 21:23         ` Jeff King
  1 sibling, 0 replies; 58+ messages in thread
From: Jeff King @ 2023-04-05 21:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Adam Majer, git

On Wed, Apr 05, 2023 at 01:40:22PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Yeah, we send a special capability line in that case. If you do:
> >
> >   GIT_TRACE_PACKET=1 git clone a b
> >
> > you can see that upload-pack indicates that ls-refs understands the
> > "unborn" capability:
> >
> >   packet:  upload-pack> version 2
> >   packet:  upload-pack> agent=git/2.40.0.824.g7b678b1f643
> >   packet:  upload-pack> ls-refs=unborn
> >   packet:  upload-pack> fetch=shallow wait-for-done
> >   packet:  upload-pack> server-option
> >   packet:  upload-pack> object-format=sha256
> >   packet:  upload-pack> object-info
> >   packet:  upload-pack> 0000
> >
> > And then clone asks for it say "yes, I also understand unborn":
> >
> >   packet:        clone> command=ls-refs
> >   packet:        clone> agent=git/2.40.0.824.g7b678b1f643
> >   packet:        clone> object-format=sha256
> >   packet:        clone> 0001
> >   packet:        clone> peel
> >   packet:        clone> symrefs
> >   packet:        clone> unborn
> >   packet:        clone> ref-prefix HEAD
> >   packet:        clone> ref-prefix refs/heads/
> >   packet:        clone> ref-prefix refs/tags/
> >   packet:        clone> 0000
> >
> > And then upload-pack can send us the extra information:
> >
> >   packet:  upload-pack> unborn HEAD symref-target:refs/heads/maestro
> >   packet:  upload-pack> 0000
> >
> > I think we'd need to do something similar here for object-format.
> 
> I guess we only need to touch "git clone" then.  Without being
> asked, it advertsizes object-format=sha256 already, and when the
> maestro repository is prepared without --object-format=sha256,
> upload-pack advertises object-format=sha1 instead.  So it probably
> is just the matter of capturing it and using it to populate the
> extensions.objectformat with an appropriate value.

Ah, yeah, you're right. I was thinking that capability advertisement was
"by the way, I understand sha256". But I may just be showing my
ignorance of the current state of the hash-transition protocol
extensions. :)

I'm actually surprised this does not Just Work already. The client gets
the intended algorithm from that capability line, which is how we know
how to parse any ls-refs output at all (we are not just guessing from
the length). But we only bother to set it in the local config if we have
refs to fetch, which seems like a bug.

So the solution is maybe something like this:

diff --git a/builtin/clone.c b/builtin/clone.c
index 462c286274c..5eca95cb892 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1296,19 +1296,22 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 			clear_bundle_list(transport->bundles);
 			FREE_AND_NULL(transport->bundles);
 		}
 	}
 
-	if (mapped_refs) {
+	{
 		int hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
 
 		/*
 		 * Now that we know what algorithm the remote side is using,
 		 * let's set ours to the same thing.
 		 */
 		initialize_repository_version(hash_algo, 1);
 		repo_set_hash_algo(the_repository, hash_algo);
+	}
+
+	if (mapped_refs) {
 		/*
 		 * transport_get_remote_refs() may return refs with null sha-1
 		 * in mapped_refs (see struct transport->get_refs_list
 		 * comment). In that case we need fetch it early because
 		 * remote_head code below relies on it.

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-05 21:15         ` Junio C Hamano
@ 2023-04-05 21:26           ` Jeff King
  2023-04-05 22:48           ` brian m. carlson
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 58+ messages in thread
From: Jeff King @ 2023-04-05 21:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Adam Majer, git

On Wed, Apr 05, 2023 at 02:15:33PM -0700, Junio C Hamano wrote:

> ----- >8 -----
> Subject: [PATCH] clone: propagate object-format when cloning from void
> 
> A user could prepare an empty repository and set it to use SHA256 as
> the object format.  The new repository created by "git clone" from
> such a repository however would not record that it is expecting
> objects in the same SHA256 format.  This works as expected if the
> source repository is not empty.
> 
> Just like we started copying the name of the primary branch from the
> remote repository even if it is unborn in 3d8314f8 (clone: propagate
> empty remote HEAD even with other branches, 2022-07-07), lift the
> code that records the object format out of the block executed only
> when cloning from an instantiated repository, so that it works also
> when cloning from an empty repository.

Heh, our mails just crossed, but I stumbled upon the same patch. So yes,
this looks good to me.

I suspect that setting the flag in the_repository might not matter,
since we do not have any refs or objects to manipulate, and what we care
about here is that initialize_repository_version() is run. But I agree
that the two conceptually belong together, and hoisting both out of the
block is the right thing to do.

> diff --git c/t/t5702-protocol-v2.sh w/t/t5702-protocol-v2.sh
> index 71aabe30b7..6af5c2062f 100755
> --- c/t/t5702-protocol-v2.sh
> +++ w/t/t5702-protocol-v2.sh
> @@ -269,6 +269,17 @@ test_expect_success 'clone propagates unborn HEAD from non-empty repo' '
>  	grep "warning: remote HEAD refers to nonexistent ref" stderr
>  '
>  
> +test_expect_success 'clone propagates object-format from empty repo' '
> +	test_when_finished "rm -fr src256 dst256" &&
> +
> +	echo sha256 >expect &&
> +	git init --object-format=sha256 src256 &&
> +	git clone src256 dst256 &&
> +	git -C dst256 rev-parse --show-object-format >actual &&
> +
> +	test_cmp expect actual
> +'

And this test looks straightforward. Very nice.

-Peff

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-05 21:15         ` Junio C Hamano
  2023-04-05 21:26           ` Jeff King
@ 2023-04-05 22:48           ` brian m. carlson
  2023-04-06 13:11           ` Adam Majer
  2023-04-25 21:35           ` brian m. carlson
  3 siblings, 0 replies; 58+ messages in thread
From: brian m. carlson @ 2023-04-05 22:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Adam Majer, git

[-- Attachment #1: Type: text/plain, Size: 1106 bytes --]

On 2023-04-05 at 21:15:33, Junio C Hamano wrote:
> A user could prepare an empty repository and set it to use SHA256 as
> the object format.  The new repository created by "git clone" from
> such a repository however would not record that it is expecting
> objects in the same SHA256 format.  This works as expected if the
> source repository is not empty.
> 
> Just like we started copying the name of the primary branch from the
> remote repository even if it is unborn in 3d8314f8 (clone: propagate
> empty remote HEAD even with other branches, 2022-07-07), lift the
> code that records the object format out of the block executed only
> when cloning from an instantiated repository, so that it works also
> when cloning from an empty repository.

Yeah, this looks like the right thing to do.  I know this did work
originally, at least for protocol v2, but I may have neglected to add a
test when I wrote it and it regressed.  Thanks for the patch, which I
think is obviously correct, and adding a test for this case.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-05 21:15         ` Junio C Hamano
  2023-04-05 21:26           ` Jeff King
  2023-04-05 22:48           ` brian m. carlson
@ 2023-04-06 13:11           ` Adam Majer
  2023-04-25 21:35           ` brian m. carlson
  3 siblings, 0 replies; 58+ messages in thread
From: Adam Majer @ 2023-04-06 13:11 UTC (permalink / raw)
  To: Junio C Hamano, Jeff King; +Cc: git

On 4/5/23 23:15, Junio C Hamano wrote:
> Subject: [PATCH] clone: propagate object-format when cloning from void
> 

Thank you, this fixes this issue.

- Adam

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-05 21:15         ` Junio C Hamano
                             ` (2 preceding siblings ...)
  2023-04-06 13:11           ` Adam Majer
@ 2023-04-25 21:35           ` brian m. carlson
  2023-04-25 22:24             ` Junio C Hamano
  2023-04-25 23:12             ` Junio C Hamano
  3 siblings, 2 replies; 58+ messages in thread
From: brian m. carlson @ 2023-04-25 21:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Adam Majer, git

[-- Attachment #1: Type: text/plain, Size: 2116 bytes --]

On 2023-04-05 at 21:15:33, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> ----- >8 -----
> Subject: [PATCH] clone: propagate object-format when cloning from void
> 
> A user could prepare an empty repository and set it to use SHA256 as
> the object format.  The new repository created by "git clone" from
> such a repository however would not record that it is expecting
> objects in the same SHA256 format.  This works as expected if the
> source repository is not empty.
> 
> Just like we started copying the name of the primary branch from the
> remote repository even if it is unborn in 3d8314f8 (clone: propagate
> empty remote HEAD even with other branches, 2022-07-07), lift the
> code that records the object format out of the block executed only
> when cloning from an instantiated repository, so that it works also
> when cloning from an empty repository.
> 
> Signed-off-by: Junio C Hamano <gitster@pobox.com>

So it looks like this has made it into master and works for v2 but
breaks things for v0 and v1.  I noticed because the Git LFS testsuite is
broken and the following test demonstrates it:

----- %< -----
diff --git a/t/t5700-protocol-v1.sh b/t/t5700-protocol-v1.sh
index 6c8d4c6cf1..3be5057579 100755
--- a/t/t5700-protocol-v1.sh
+++ b/t/t5700-protocol-v1.sh
@@ -244,6 +244,17 @@ test_expect_success 'push with ssh:// using protocol v1' '
 	grep "push< version 1" log
 '
 
+test_expect_success 'clone propagates object-format from empty repo' '
+	test_when_finished "rm -fr src256 dst256" &&
+
+	echo sha256 >expect &&
+	git init --object-format=sha256 src256 &&
+	git clone src256 dst256 &&
+	git -C dst256 rev-parse --show-object-format >actual &&
+
+	test_cmp expect actual
+'
+
 # Test protocol v1 with 'http://' transport
 #
 . "$TEST_DIRECTORY"/lib-httpd.sh
----- %< -----

If nobody looks at this, I'll take a look tomorrow and hopefully send a
patch.  I just wanted to point this out to the list right away in the
interest of getting it noticed.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-25 21:35           ` brian m. carlson
@ 2023-04-25 22:24             ` Junio C Hamano
  2023-04-25 23:12             ` Junio C Hamano
  1 sibling, 0 replies; 58+ messages in thread
From: Junio C Hamano @ 2023-04-25 22:24 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Jeff King, Adam Majer, git

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> So it looks like this has made it into master and works for v2 but
> breaks things for v0 and v1.  I noticed because the Git LFS testsuite is
> broken and the following test demonstrates it:
>
> ----- %< -----
> diff --git a/t/t5700-protocol-v1.sh b/t/t5700-protocol-v1.sh
> index 6c8d4c6cf1..3be5057579 100755
> --- a/t/t5700-protocol-v1.sh
> +++ b/t/t5700-protocol-v1.sh
> @@ -244,6 +244,17 @@ test_expect_success 'push with ssh:// using protocol v1' '
>  	grep "push< version 1" log
>  '
>  
> +test_expect_success 'clone propagates object-format from empty repo' '
> +	test_when_finished "rm -fr src256 dst256" &&
> +
> +	echo sha256 >expect &&
> +	git init --object-format=sha256 src256 &&
> +	git clone src256 dst256 &&
> +	git -C dst256 rev-parse --show-object-format >actual &&
> +
> +	test_cmp expect actual
> +'
> +
>  # Test protocol v1 with 'http://' transport
>  #
>  . "$TEST_DIRECTORY"/lib-httpd.sh
> ----- %< -----

Interesting.  I applied

 - the above addition to the t/t5700

 - reverse of 96f4113a (Merge branch 'jc/clone-object-format-from-void',
   2023-04-11), but only the part outside t/

on top of today's 'master', and tried to run t5700 and t5702 (the
latter has new tests added by 96f4113a to protect the change in the
topic from future breakage for v2).  It seems both t5700 and t5702
fails.

The latter failing is very much expected; the code change reverted
by the above experiment is what made it work in the first place.

But the former not working, after reverting the change, is totally
unexpected---doesn't it mean that the change in question does not
have much to do in the breakage?

In fact, the new test to 5700 applied on top of v2.40.1 fails, too.

Or are you complaining that the fix merged to 'master' only covers
the current protocol and not historical v0/v1 protocol?  I couldn't
tell, and didn't get that impression from the phrasing "breaks",
implying whatever that was "broken" used to be working.

Puzzled.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-25 21:35           ` brian m. carlson
  2023-04-25 22:24             ` Junio C Hamano
@ 2023-04-25 23:12             ` Junio C Hamano
  2023-04-26  0:20               ` brian m. carlson
  2023-04-26 10:51               ` Jeff King
  1 sibling, 2 replies; 58+ messages in thread
From: Junio C Hamano @ 2023-04-25 23:12 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Jeff King, Adam Majer, git

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> If nobody looks at this, I'll take a look tomorrow and hopefully send a
> patch.  I just wanted to point this out to the list right away in the
> interest of getting it noticed.

Thanks.  

The topic in question has been in 'master' for 2 weeks, since it was
merged at 96f4113a (Merge branch 'jc/clone-object-format-from-void',
2023-04-11), and as I said, what your test demonstrates is not a
regression caused by the topic but "was broken, did not get
addressed, is still broken".  So it does not sound like it needs
"right away" kind of attention.

> +test_expect_success 'clone propagates object-format from empty repo' '
> +	test_when_finished "rm -fr src256 dst256" &&
> +
> +	echo sha256 >expect &&
> +	git init --object-format=sha256 src256 &&
> +	git clone src256 dst256 &&

This needs to be at least "git clone --no-local" for it to be v0/v1
protocol test.  It is testing the local optimization codepath.

We could peek the original repository and copy the hash function
name to the target as part of the local cloning.  That obviously
is outside the scope of the earlier fix that worked only at the
protocol level.

And I think even with "--no-local" to make it about v0/v1 protocol,
the outcome is still pretty much expected.  If we make the above
command line to

	GIT_TRACE_PACKET=1 \
	git clone --no-local src256 dst256 &&

to clone over the on-the-wire protocol, then we see

    Cloning into 'dst256'...
    packet:  upload-pack> 0000
    packet:        clone< 0000
    warning: You appear to have cloned an empty repository.
    packet:        clone> 0000
    packet:  upload-pack< 0000
    --- expect      2023-04-25 22:55:39.771850195 +0000
    ...

in the output of "sh t5700-*.sh -i -v".  Without any ref, v0/v1 can
not carry any capabilities, because there is no ref information to
tuck the capabilities on.

I unfortunately doubt that any solution would exist that does not
break compatibility with the deployed clients that expect the
current v0/v1.

Thanks.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-25 23:12             ` Junio C Hamano
@ 2023-04-26  0:20               ` brian m. carlson
  2023-04-26 11:25                 ` Jeff King
  2023-04-26 10:51               ` Jeff King
  1 sibling, 1 reply; 58+ messages in thread
From: brian m. carlson @ 2023-04-26  0:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Adam Majer, git

[-- Attachment #1: Type: text/plain, Size: 1687 bytes --]

On 2023-04-25 at 23:12:15, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > If nobody looks at this, I'll take a look tomorrow and hopefully send a
> > patch.  I just wanted to point this out to the list right away in the
> > interest of getting it noticed.
> 
> Thanks.  
> 
> The topic in question has been in 'master' for 2 weeks, since it was
> merged at 96f4113a (Merge branch 'jc/clone-object-format-from-void',
> 2023-04-11), and as I said, what your test demonstrates is not a
> regression caused by the topic but "was broken, did not get
> addressed, is still broken".  So it does not sound like it needs
> "right away" kind of attention.

In my case, the clone is over HTTP, so this may not be the ideal way to
reproduce it and it may need a better testcase, but it does bisect to
the patch above and it is new in master (and doesn't reproduce in
2.40.0).  Note that in our case in the Git LFS testsuite, we're using
GIT_DEFAULT_HASH=sha256.

I believe what is happening is that for some reason, the object-format
data in v0 and v1 is not being read properly, and so we're now setting
it to sha1 whereas before we were reading the value from the default
setting of the repository (sha256).

It very well may be that it's always been broken and this has just made
it obvious that it's broken, but I'll look tomorrow and probably send a
patch.  I don't think we should revert this change, but I do think we
need to fix it before 2.41, since I think it means right now that all
clones over protocol v0 and v1 end up with a SHA-1 repository.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-25 23:12             ` Junio C Hamano
  2023-04-26  0:20               ` brian m. carlson
@ 2023-04-26 10:51               ` Jeff King
  2023-04-26 15:42                 ` Junio C Hamano
  2023-04-26 20:40                 ` brian m. carlson
  1 sibling, 2 replies; 58+ messages in thread
From: Jeff King @ 2023-04-26 10:51 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: brian m. carlson, Adam Majer, git

On Tue, Apr 25, 2023 at 04:12:15PM -0700, Junio C Hamano wrote:

> And I think even with "--no-local" to make it about v0/v1 protocol,
> the outcome is still pretty much expected.  If we make the above
> command line to
> 
> 	GIT_TRACE_PACKET=1 \
> 	git clone --no-local src256 dst256 &&
> 
> to clone over the on-the-wire protocol, then we see
> 
>     Cloning into 'dst256'...
>     packet:  upload-pack> 0000
>     packet:        clone< 0000
>     warning: You appear to have cloned an empty repository.
>     packet:        clone> 0000
>     packet:  upload-pack< 0000
>     --- expect      2023-04-25 22:55:39.771850195 +0000
>     ...
> 
> in the output of "sh t5700-*.sh -i -v".  Without any ref, v0/v1 can
> not carry any capabilities, because there is no ref information to
> tuck the capabilities on.
> 
> I unfortunately doubt that any solution would exist that does not
> break compatibility with the deployed clients that expect the
> current v0/v1.

We could send a capabilities^{} line, which Git has supported on the
client side since eb398797cd (connect: advertized capability is not a
ref, 2016-09-09). So sending it should not break even old clients
(though we would have to check what alternate implementations like
libgit2 or dulwich do; we know JGit supports it).

However, the object-format support here was broken until the very recent
13e67aa39b (v0 protocol: fix sha1/sha256 confusion for capabilities^{},
2023-04-14), so it would only be useful going forward (before then we'd
die(), but maybe that is preferable to having the wrong object format?).

I'm not sure it's worth the effort, though. If you want to use sha256
everywhere and tell the other side about it, you need a modern client
anyway, and that means the ability to speak v2. So this would only
matter if for some reason the v2 probe was being ignored (e.g., proxies
eating it, ssh refusing environment variable, etc), which itself are
things that ideally would be fixed (and can maybe one day even go away
if we optimistically default to v2).

-Peff

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-26  0:20               ` brian m. carlson
@ 2023-04-26 11:25                 ` Jeff King
  2023-04-26 15:08                   ` Junio C Hamano
  0 siblings, 1 reply; 58+ messages in thread
From: Jeff King @ 2023-04-26 11:25 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Junio C Hamano, Adam Majer, git

On Wed, Apr 26, 2023 at 12:20:00AM +0000, brian m. carlson wrote:

> In my case, the clone is over HTTP, so this may not be the ideal way to
> reproduce it and it may need a better testcase, but it does bisect to
> the patch above and it is new in master (and doesn't reproduce in
> 2.40.0).  Note that in our case in the Git LFS testsuite, we're using
> GIT_DEFAULT_HASH=sha256.
> 
> I believe what is happening is that for some reason, the object-format
> data in v0 and v1 is not being read properly, and so we're now setting
> it to sha1 whereas before we were reading the value from the default
> setting of the repository (sha256).

I'm having trouble finding any breakage at all. E.g., this test passes:

diff --git a/t/t5551-http-fetch-smart.sh b/t/t5551-http-fetch-smart.sh
index 0908534f25..95b10288e7 100755
--- a/t/t5551-http-fetch-smart.sh
+++ b/t/t5551-http-fetch-smart.sh
@@ -704,4 +704,24 @@ test_expect_success 'no empty path components' '
 	! grep "//" log
 '
 
+test_expect_success 'v0 clone over http recognizes object-format' '
+	git init --bare --object-format=sha256 \
+		"$HTTPD_DOCUMENT_ROOT_PATH/sha256.git" &&
+
+	# do not test an empty repo. In v0, we have no way for an
+	# empty server to report its object format, so we would
+	# always default to sha1. We could in theory test that
+	# a client who wants to default to sha256 will realize
+	# the other side is sha1, but we have no way to set that local
+	# default. Unlike git-init, git-clone does not support
+	# --object-format, nor GIT_DEFAULT_HASH.
+	git -C "$HTTPD_DOCUMENT_ROOT_PATH/sha256.git" --work-tree=. \
+		commit --allow-empty -m foo &&
+
+	git -c protocol.version=0 clone $HTTPD_URL/smart/sha256.git sha256 &&
+	git -C sha256 rev-parse --show-object-format >actual &&
+	echo sha256 >expect &&
+	test_cmp expect actual
+'
+
 test_done

I'd expect it to break in the empty-repo case, for the reasons given in
the comment (v0 cannot communicate object-format in an empty repo). But
that is nothing new.

It sounds from your description that your test is running in a mode
where the client defaults to sha256 (though I'm not sure how, since we
explicitly document that GIT_DEFAULT_HASH should not affect clone), and
then you clone an empty sha256 repository via v0, expecting the result
to be sha256.

But I think that is a wrong expectation, at least from the
client's perspective. An empty repository cannot communicate its
object-format over v0, so the client should assume its v0, and should
then itself become v0. And that last "should itself become" is what
Junio's patch fixed.

The first part, "empty repository cannot communicate its object-format
over v0" is the part is "it's always been broken". We could fix it, but
I'm not sure if it is worth the trouble (see my other message).

> It very well may be that it's always been broken and this has just made
> it obvious that it's broken, but I'll look tomorrow and probably send a
> patch.  I don't think we should revert this change, but I do think we
> need to fix it before 2.41, since I think it means right now that all
> clones over protocol v0 and v1 end up with a SHA-1 repository.

Hopefully my guess at what your test is doing is correct, and I didn't
just leave us off on a tangent. ;)

But if it is, then I think that everything in Git is OK. Non-empty repos
over v0 work correctly both before and after Junio's patch. Empty ones
before his patch were erroneously using sha256 if they preferred it
locally, even when the other side really was sha1. They _also_ were
using sha256 erroneously when the other side was sha256 but wasn't able
to report it (because of v0 limitations). Which is counter-intuitive,
perhaps, but was still the wrong thing for a client to do.

-Peff

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-26 11:25                 ` Jeff King
@ 2023-04-26 15:08                   ` Junio C Hamano
  2023-04-26 15:13                     ` [PATCH] doc: GIT_DEFAULT_HASH is and will be ignored during "clone" Junio C Hamano
  2023-04-27  4:46                     ` git clone of empty repositories doesn't preserve hash Jeff King
  0 siblings, 2 replies; 58+ messages in thread
From: Junio C Hamano @ 2023-04-26 15:08 UTC (permalink / raw)
  To: brian m. carlson, Jeff King; +Cc: Adam Majer, git

Jeff King <peff@peff.net> writes:

> It sounds from your description that your test is running in a mode
> where the client defaults to sha256 (though I'm not sure how, since we
> explicitly document that GIT_DEFAULT_HASH should not affect clone), and
> then you clone an empty sha256 repository via v0, expecting the result
> to be sha256.

Thanks for coming up with an excellent guess that helped come up
with a reproduction.

With Brian's patch a bit tweaked (attached below), the test does
fail with the current 'master' and passes before the merge in
question.  And the trace clearly shows that without being told
anything about the object format via the capability, the client
chooses to honor GIT_DEFAULT_HASH to initialize the new repository
with sha256.

There is this entry in the release notes of 2.29.0:

 * "git clone" that clones from SHA-1 repository, while
   GIT_DEFAULT_HASH set to use SHA-256 already, resulted in an
   unusable repository that half-claims to be SHA-256 repository
   with SHA-1 objects and refs.  This has been corrected.

This refers to 47ac9703 (builtin/clone: avoid failure with
GIT_DEFAULT_HASH, 2020-09-20).  It seems that the "fix" described
there was incomplete and did not address "clone a void" case, and
the patch under discussion finished the incomplete fix without
knowing by accident.  The test that was added by 47ac9703 to t5601
does use non-empty repositories (one with SHA-1, the other with
SHA-256) and tries to check the interaction with "clone" with the
environment variable.  With the patch under discussion, this test
did not break and "clone" is still ignoring the GIT_DEFAULT_HASH
variable.

Also the description in the document of GIT_DEFAULT_HASH makes
readers ambivalent:

`GIT_DEFAULT_HASH`::
	If this variable is set, the default hash algorithm for new
	repositories will be set to this value. This value is currently
	ignored when cloning; the setting of the remote repository
	is used instead. The default is "sha1". THIS VARIABLE IS
	EXPERIMENTAL! See `--object-format` in linkgit:git-init[1].

To me, "is currently ignored" hints that the author, or somebody
close to the author, of this entry feels that it is a bug that
"clone" does not honor it.  The log message of 47ac9703 also phrases
"failed to honor the GIT_DEFAULT_HASH" as if it were a bad thing [*].

On the other hand, it is clearly documented as experimental so
anything that has been relying on the behaviour of any command with
a particular value set to this variable are waiting to be broken.

I am torn about this.  Even though we may have been very clear that
GIT_DEFAULT_HASH should not kick in in this case, "clone" was buggy
(in the sense that it did not behave as documented in an empty
repository) and whatever thing that changed behaviour that Brian
noticed was ignoring the documentation and taking advantage of that
"bug", and Hyrum's law ensues.

In the longer term, after/when we allow incremental/over-the-wire
migration of object-format, i.e. clone from SHA-1 repository to
create SHA-256 repository (or vice versa) and fetching and pushing
between them would bidirectionally convert the object format on the
fly, it is likely we would teach a new option "--object-format" to
"git clone" to say "you would use whatever object format the origin
uses by default, but this time, I am telling you to use this format
on our side, doing on-the-fly object format conversion as needed".

So it would be OK to make sure that GIT_DEFAULT_HASH is ignored by
"clone" as documented now, and even after on-the-fly object-format
migration is implemented, I would think.


[Footnote]

* ... but I think it was misguided.  What it says there is this:

    And we also don't want to initialize the repository as SHA-1
    initially, since that means if we're cloning an empty
    repository, we'll have failed to honor the GIT_DEFAULT_HASH
    variable and will end up with a SHA-1 repository, not a SHA-256
    repository.

If this were "When we're cloning an empty repository, we'd have
failed to honor the object format the other side has chosen and will
end up with a SHA-1 repository, not a SHA-256 repository.", then it
is very much in line with the reality before the patch under
discussion and also in line with the official stance that "clone"
should not honor GIT_DEFAULT_HASH.

Where the original description breaks down is when the other side is
SHA-1 and this side has GIT_DEFAULT_HASH set to SHA-256.  If we
honored the variable, we'd create a SHA-256 repository that will
talk to SHA-1 repository before the rest of the system is ready.

 t/t5700-protocol-v1.sh | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git i/t/t5700-protocol-v1.sh w/t/t5700-protocol-v1.sh
index 6c8d4c6cf1..2f39dd2e05 100755
--- i/t/t5700-protocol-v1.sh
+++ w/t/t5700-protocol-v1.sh
@@ -244,6 +244,19 @@ test_expect_success 'push with ssh:// using protocol v1' '
 	grep "push< version 1" log
 '
 
+test_expect_success 'clone propagates object-format from empty repo' '
+	test_when_finished "rm -fr src256 dst256" &&
+
+	echo sha256 >expect &&
+	git init --object-format=sha256 src256 &&
+	GIT_DEFAULT_HASH=sha256 \
+	GIT_TRACE_PACKET=1 \
+	git clone --no-local src256 dst256 &&
+	git -C dst256 rev-parse --show-object-format >actual &&
+
+	test_cmp expect actual
+'
+
 # Test protocol v1 with 'http://' transport
 #
 . "$TEST_DIRECTORY"/lib-httpd.sh

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH] doc: GIT_DEFAULT_HASH is and will be ignored during "clone"
  2023-04-26 15:08                   ` Junio C Hamano
@ 2023-04-26 15:13                     ` Junio C Hamano
  2023-04-26 21:06                       ` brian m. carlson
  2023-04-27  4:46                     ` git clone of empty repositories doesn't preserve hash Jeff King
  1 sibling, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2023-04-26 15:13 UTC (permalink / raw)
  To: git; +Cc: brian m. carlson, Jeff King, Adam Majer

The phrasing "is currently ignored" was prone to be misinterpreted
as if we were wishing if it were honored.  Rephrase it to make it
clear that the experimental variable will be ignored.

In the longer term, after/when we allow incremental/over-the-wire
migration of the object-format, i.e. cloning from an SHA-1
repository to create an SHA-256 repository (or vice versa) and
fetching and pushing between them would bidirectionally convert the
object format on the fly, it is likely that we would teach a new
option "--object-format" to "git clone" to say "you would use
whatever object format the origin uses by default, but this time, I
am telling you to use this format on our side, doing on-the-fly
object format conversion as needed".  So it is perfectly OK to
ignore the settings of this experimental variable, even after such
an extension happens that makes it necessary for us to have a way to
create a new repository that uses different object format from the
origin repository.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 * An obvious follow-up to the previous discussion.

 Documentation/git.txt | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git c/Documentation/git.txt w/Documentation/git.txt
index 74973d3cc4..54b043899f 100644
--- c/Documentation/git.txt
+++ w/Documentation/git.txt
@@ -546,9 +546,9 @@ double-quotes and respecting backslash escapes. E.g., the value
 
 `GIT_DEFAULT_HASH`::
 	If this variable is set, the default hash algorithm for new
-	repositories will be set to this value. This value is currently
-	ignored when cloning; the setting of the remote repository
-	is used instead. The default is "sha1". THIS VARIABLE IS
+	repositories will be set to this value. This value is
+	ignored when cloning and the setting of the remote repository
+	is always used. The default is "sha1". THIS VARIABLE IS
 	EXPERIMENTAL! See `--object-format` in linkgit:git-init[1].
 
 Git Commits

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-26 10:51               ` Jeff King
@ 2023-04-26 15:42                 ` Junio C Hamano
  2023-04-26 20:40                 ` brian m. carlson
  1 sibling, 0 replies; 58+ messages in thread
From: Junio C Hamano @ 2023-04-26 15:42 UTC (permalink / raw)
  To: Jeff King; +Cc: brian m. carlson, Adam Majer, git

Jeff King <peff@peff.net> writes:

> We could send a capabilities^{} line, which Git has supported on the
> client side since eb398797cd (connect: advertized capability is not a
> ref, 2016-09-09). So sending it should not break even old clients
> (though we would have to check what alternate implementations like
> libgit2 or dulwich do; we know JGit supports it).

Ah, I forgot all about that JGit workaround.  Yes, we can exploit
it, and any implementation that does not understand it correctly
(including git before 13e67aa3 (v0 protocol: fix sha1/sha256
confusion for capabilities^{}, 2023-04-14)) does not work with JGit
when cloning an empty repository anyway, so it is not all that bad.

But I tend to agree with your conclusion that it may be an update
for the sake of completeness to retrofit v0/v1 on the serving side.
It would not help any real-world use cases all that much.

Thanks.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-26 10:51               ` Jeff King
  2023-04-26 15:42                 ` Junio C Hamano
@ 2023-04-26 20:40                 ` brian m. carlson
  2023-04-26 20:53                   ` [PATCH 0/2] Fix empty SHA-256 clones with v0 and v1 brian m. carlson
                                     ` (3 more replies)
  1 sibling, 4 replies; 58+ messages in thread
From: brian m. carlson @ 2023-04-26 20:40 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Adam Majer, git

[-- Attachment #1: Type: text/plain, Size: 2072 bytes --]

On 2023-04-26 at 10:51:34, Jeff King wrote:
> We could send a capabilities^{} line, which Git has supported on the
> client side since eb398797cd (connect: advertized capability is not a
> ref, 2016-09-09). So sending it should not break even old clients
> (though we would have to check what alternate implementations like
> libgit2 or dulwich do; we know JGit supports it).

I have a patch which does exactly this, which I will be sending shortly.
I've confirmed that libgit2 and JGit support it, which is unsurprising,
since all of the implementations, Git included, share the same code.  In
addition, this is the behaviour we document as supporting, so all
implementations should support it.

> However, the object-format support here was broken until the very recent
> 13e67aa39b (v0 protocol: fix sha1/sha256 confusion for capabilities^{},
> 2023-04-14), so it would only be useful going forward (before then we'd
> die(), but maybe that is preferable to having the wrong object format?).

I think it's better to die than to silently have the wrong object
format, and it also prevents the problem if other clients using v0 or v1
(which effectively have to be supported for compatibility, while v2 is
optional) try to clone from a fixed server.

> I'm not sure it's worth the effort, though. If you want to use sha256
> everywhere and tell the other side about it, you need a modern client
> anyway, and that means the ability to speak v2. So this would only
> matter if for some reason the v2 probe was being ignored (e.g., proxies
> eating it, ssh refusing environment variable, etc), which itself are
> things that ideally would be fixed (and can maybe one day even go away
> if we optimistically default to v2).

Using v2 everywhere is difficult because many SSH servers still don't
pass GIT_PROTOCOL by default, meaning that we're stuck with v0 and v1.
In retrospect, sending an environment variable here was not a great
decision, but we're stuck with it now.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 0/2] Fix empty SHA-256 clones with v0 and v1
  2023-04-26 20:40                 ` brian m. carlson
@ 2023-04-26 20:53                   ` brian m. carlson
  2023-04-26 20:53                     ` [PATCH 1/2] http: advertise capabilities when cloning empty repos brian m. carlson
                                       ` (2 more replies)
  2023-04-27  4:56                   ` git clone of empty repositories doesn't preserve hash Jeff King
                                     ` (2 subsequent siblings)
  3 siblings, 3 replies; 58+ messages in thread
From: brian m. carlson @ 2023-04-26 20:53 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Adam Majer

We recently fixed empty clones with SHA-256 over protocol v2 by
honouring the hash algorithm specified even when no refs are present.
However, in doing so, we made it impossible to set up a v0 or v1
repository by cloning from an empty SHA-256 repository.  In doing so, we
also broke the Git LFS testsuite for SHA-256 repositories.

This series consists of two patches.  The first introduces the dummy
`capabilities^{}` entry for fetches and clones from an empty repository
for v0 and v1, just as we do for clones.  This is already supported by
older versions of Git, as well as libgit2 and JGit.

The second introduces some backwards compatibility to avoid regressing
the old behaviour of using GIT_DEFAULT_HASH to initialize the proper
hash in this case.  We add a flag to see if we explicitly obtained a
hash algorithm from the remote side, and if not, we honour
GIT_DEFAULT_HASH, as before.

brian m. carlson (2):
  http: advertise capabilities when cloning empty repos
  Honor GIT_DEFAULT_HASH for empty clones without remote algo

 Documentation/git.txt       | 10 +++++++---
 builtin/clone.c             |  8 +++++---
 connect.c                   |  5 ++++-
 pkt-line.h                  |  2 ++
 t/t5551-http-fetch-smart.sh | 27 +++++++++++++++++++++++++++
 t/t5700-protocol-v1.sh      | 32 ++++++++++++++++++++++++++++++--
 transport-helper.c          |  1 +
 transport.c                 | 14 ++++++++++++++
 transport.h                 | 14 ++++++++++++++
 upload-pack.c               |  4 ++++
 10 files changed, 108 insertions(+), 9 deletions(-)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 1/2] http: advertise capabilities when cloning empty repos
  2023-04-26 20:53                   ` [PATCH 0/2] Fix empty SHA-256 clones with v0 and v1 brian m. carlson
@ 2023-04-26 20:53                     ` brian m. carlson
  2023-04-26 21:14                       ` Junio C Hamano
  2023-04-27  5:30                       ` Jeff King
  2023-04-26 20:53                     ` [PATCH 2/2] Honor GIT_DEFAULT_HASH for empty clones without remote algo brian m. carlson
  2023-04-26 21:12                     ` [PATCH 0/2] Fix empty SHA-256 clones with v0 and v1 Junio C Hamano
  2 siblings, 2 replies; 58+ messages in thread
From: brian m. carlson @ 2023-04-26 20:53 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Adam Majer

From: "brian m. carlson" <bk2204@github.com>

When cloning an empty repository, the HTTP protocol version 0 currently
offers nothing but the header and flush packets for the /info/refs
endpoint. This means that no capabilities are provided, so the client
side doesn't know what capabilities are present.

However, this does pose a problem when working with SHA-256
repositories, since we use the capabilities to know the remote side's
object format (hash algorithm).  It used to be possible to set the
correct algorithm with `GIT_DEFAULT_HASH` (which is what the Git LFS
testsuite did), but this no longer works as of 8b214c2e9d ("clone:
propagate object-format when cloning from void", 2023-04-05), since
there we always read the hash algorithm from the remote.  If there is no
hash algorithm provided, we default to SHA-1 for backwards
compatibility.

Fortunately, the push version of the protocol already indicates a clue
for how to solve this.  When the /info/refs endpoint is accessed for a
push and the remote is empty, we include a dummy "capabilities^{}" ref
pointing to the all-zeros object ID.  The protocol documentation already
indicates this should _always_ be sent, even for fetches and clones, so
let's just do that, which means we'll properly announce the hash
algorithm as part of the capabilities.  This just works with the
existing code because we share the same ref code for fetches and clones,
and libgit2 does as well.

Signed-off-by: brian m. carlson <bk2204@github.com>
---
 t/t5551-http-fetch-smart.sh | 27 +++++++++++++++++++++++++++
 t/t5700-protocol-v1.sh      | 21 +++++++++++++++++++--
 upload-pack.c               |  4 ++++
 3 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/t/t5551-http-fetch-smart.sh b/t/t5551-http-fetch-smart.sh
index 0908534f25..21b7767cbd 100755
--- a/t/t5551-http-fetch-smart.sh
+++ b/t/t5551-http-fetch-smart.sh
@@ -611,6 +611,33 @@ test_expect_success 'client falls back from v2 to v0 to match server' '
 	grep symref=HEAD:refs/heads/ trace
 '
 
+test_expect_success 'create empty http-accessible SHA-256 repository' '
+	mkdir "$HTTPD_DOCUMENT_ROOT_PATH/sha256.git" &&
+	(cd "$HTTPD_DOCUMENT_ROOT_PATH/sha256.git" &&
+	 git --bare init --object-format=sha256
+	)
+'
+
+test_expect_success 'clone empty SHA-256 repository with protocol v2' '
+	rm -fr sha256 &&
+	echo sha256 >expected &&
+	git -c protocol.version=2 clone "$HTTPD_URL/smart/sha256.git" &&
+	git -C sha256 rev-parse --show-object-format >actual &&
+	test_cmp actual expected &&
+	git ls-remote "$HTTPD_URL/smart/sha256.git" >actual &&
+	test_must_be_empty actual
+'
+
+test_expect_success 'clone empty SHA-256 repository with protocol v0' '
+	rm -fr sha256 &&
+	echo sha256 >expected &&
+	GIT_TRACE=1 GIT_TRACE_PACKET=1 git -c protocol.version=0 clone "$HTTPD_URL/smart/sha256.git" &&
+	git -C sha256 rev-parse --show-object-format >actual &&
+	test_cmp actual expected &&
+	git ls-remote "$HTTPD_URL/smart/sha256.git" >actual &&
+	test_must_be_empty actual
+'
+
 test_expect_success 'passing hostname resolution information works' '
 	BOGUS_HOST=gitbogusexamplehost.invalid &&
 	BOGUS_HTTPD_URL=$HTTPD_PROTO://$BOGUS_HOST:$LIB_HTTPD_PORT &&
diff --git a/t/t5700-protocol-v1.sh b/t/t5700-protocol-v1.sh
index 6c8d4c6cf1..3cd9db9012 100755
--- a/t/t5700-protocol-v1.sh
+++ b/t/t5700-protocol-v1.sh
@@ -249,10 +249,12 @@ test_expect_success 'push with ssh:// using protocol v1' '
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
-test_expect_success 'create repo to be served by http:// transport' '
+test_expect_success 'create repos to be served by http:// transport' '
 	git init "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" &&
 	git -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" config http.receivepack true &&
-	test_commit -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" one
+	test_commit -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" one &&
+	git init --object-format=sha256 "$HTTPD_DOCUMENT_ROOT_PATH/sha256" &&
+	git -C "$HTTPD_DOCUMENT_ROOT_PATH/sha256" config http.receivepack true
 '
 
 test_expect_success 'clone with http:// using protocol v1' '
@@ -269,6 +271,21 @@ test_expect_success 'clone with http:// using protocol v1' '
 	grep "git< version 1" log
 '
 
+test_expect_success 'clone with http:// using protocol v1 with empty SHA-256 repo' '
+	GIT_TRACE_PACKET=1 GIT_TRACE_CURL=1 git -c protocol.version=1 \
+		clone "$HTTPD_URL/smart/sha256" sha256 2>log &&
+
+	cat log &&
+	echo sha256 >expect &&
+	git -C sha256 rev-parse --show-object-format >actual &&
+	test_cmp expect actual &&
+
+	# Client requested to use protocol v1
+	grep "Git-Protocol: version=1" log &&
+	# Server responded using protocol v1
+	grep "git< version 1" log
+'
+
 test_expect_success 'fetch with http:// using protocol v1' '
 	test_commit -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" two &&
 
diff --git a/upload-pack.c b/upload-pack.c
index 08633dc121..5ef9b162b6 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -120,6 +120,7 @@ struct upload_pack_data {
 	unsigned allow_ref_in_want : 1;				/* v2 only */
 	unsigned allow_sideband_all : 1;			/* v2 only */
 	unsigned advertise_sid : 1;
+	unsigned sent_capabilities : 1;
 };
 
 static void upload_pack_data_init(struct upload_pack_data *data)
@@ -1240,6 +1241,7 @@ static int send_ref(const char *refname, const struct object_id *oid,
 			     git_user_agent_sanitized());
 		strbuf_release(&symref_info);
 		strbuf_release(&session_id);
+		data->sent_capabilities = 1;
 	} else {
 		packet_fwrite_fmt(stdout, "%s %s\n", oid_to_hex(oid), refname_nons);
 	}
@@ -1379,6 +1381,8 @@ void upload_pack(const int advertise_refs, const int stateless_rpc,
 			data.no_done = 1;
 		head_ref_namespaced(send_ref, &data);
 		for_each_namespaced_ref(send_ref, &data);
+		if (!data.sent_capabilities && advertise_refs)
+			send_ref("capabilities^{}", null_oid(), 0, &data);
 		/*
 		 * fflush stdout before calling advertise_shallow_grafts because send_ref
 		 * uses stdio.

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 2/2] Honor GIT_DEFAULT_HASH for empty clones without remote algo
  2023-04-26 20:53                   ` [PATCH 0/2] Fix empty SHA-256 clones with v0 and v1 brian m. carlson
  2023-04-26 20:53                     ` [PATCH 1/2] http: advertise capabilities when cloning empty repos brian m. carlson
@ 2023-04-26 20:53                     ` brian m. carlson
  2023-04-26 21:18                       ` Junio C Hamano
  2023-04-26 21:33                       ` Junio C Hamano
  2023-04-26 21:12                     ` [PATCH 0/2] Fix empty SHA-256 clones with v0 and v1 Junio C Hamano
  2 siblings, 2 replies; 58+ messages in thread
From: brian m. carlson @ 2023-04-26 20:53 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Adam Majer

From: "brian m. carlson" <bk2204@github.com>

The previous commit introduced a change that allows HTTP v0 and v1
operations to determine the hash of an empty remote repository by
sending capabilities.  However, there are still some cases, such as when
cloning locally, where the capabilities are not sent.  This is because
for local operations, we don't strip out the fake "capabilities^{}" ref,
and thus "git ls-remote" would produce incorrect values if we did.

However, up until 8b214c2e9d ("clone: propagate object-format when
cloning from void", 2023-04-05), we honored GIT_DEFAULT_HASH in this
case, so let's continue to do that.  Check whether the hash algorithm
was explicitly set, and if so, continue to use that value.  If not, use
the default value for GIT_DEFAULT_HASH to ensure that we can at least
properly configure an empty clone whose hash algorithm we know.

Note that without this patch, git clone cannot create a SHA-256
repository from an empty remote without protocol v2 (except over HTTP,
as in the previous patch).
---
 Documentation/git.txt  | 10 +++++++---
 builtin/clone.c        |  8 +++++---
 connect.c              |  5 ++++-
 pkt-line.h             |  2 ++
 t/t5700-protocol-v1.sh | 11 +++++++++++
 transport-helper.c     |  1 +
 transport.c            | 14 ++++++++++++++
 transport.h            | 14 ++++++++++++++
 8 files changed, 58 insertions(+), 7 deletions(-)

diff --git a/Documentation/git.txt b/Documentation/git.txt
index 74973d3cc4..48eda9f883 100644
--- a/Documentation/git.txt
+++ b/Documentation/git.txt
@@ -547,9 +547,13 @@ double-quotes and respecting backslash escapes. E.g., the value
 `GIT_DEFAULT_HASH`::
 	If this variable is set, the default hash algorithm for new
 	repositories will be set to this value. This value is currently
-	ignored when cloning; the setting of the remote repository
-	is used instead. The default is "sha1". THIS VARIABLE IS
-	EXPERIMENTAL! See `--object-format` in linkgit:git-init[1].
+	ignored when cloning if the remote value can be definitively
+	determined; the setting of the remote repository is used
+	instead. The value is honored if the remote repository's
+	algorithm cannot be determined, such as some cases when
+	the remote repository is empty. The default is "sha1".
+	THIS VARIABLE IS EXPERIMENTAL! See `--object-format`
+	in linkgit:git-init[1].
 
 Git Commits
 ~~~~~~~~~~~
diff --git a/builtin/clone.c b/builtin/clone.c
index 186845ef0b..c207798de9 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1316,13 +1316,15 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		}
 	}
 
+	if (transport_get_hash_algo_explicit(transport)) {
 		/*
 		 * Now that we know what algorithm the remote side is using,
 		 * let's set ours to the same thing.
 		 */
-	hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
-	initialize_repository_version(hash_algo, 1);
-	repo_set_hash_algo(the_repository, hash_algo);
+		hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
+		initialize_repository_version(hash_algo, 1);
+		repo_set_hash_algo(the_repository, hash_algo);
+	}
 
 	if (mapped_refs) {
 		/*
diff --git a/connect.c b/connect.c
index 3a0186280c..40cb9bf261 100644
--- a/connect.c
+++ b/connect.c
@@ -243,8 +243,10 @@ static void process_capabilities(struct packet_reader *reader, int *linelen)
 	if (feat_val) {
 		char *hash_name = xstrndup(feat_val, feat_len);
 		int hash_algo = hash_algo_by_name(hash_name);
-		if (hash_algo != GIT_HASH_UNKNOWN)
+		if (hash_algo != GIT_HASH_UNKNOWN) {
 			reader->hash_algo = &hash_algos[hash_algo];
+			reader->hash_algo_explicit = 1;
+		}
 		free(hash_name);
 	} else {
 		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
@@ -493,6 +495,7 @@ static void send_capabilities(int fd_out, struct packet_reader *reader)
 		if (hash_algo == GIT_HASH_UNKNOWN)
 			die(_("unknown object format '%s' specified by server"), hash_name);
 		reader->hash_algo = &hash_algos[hash_algo];
+		reader->hash_algo_explicit = 1;
 		packet_write_fmt(fd_out, "object-format=%s", reader->hash_algo->name);
 	} else {
 		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
diff --git a/pkt-line.h b/pkt-line.h
index 8e9846f315..10700a9d8c 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -190,6 +190,8 @@ struct packet_reader {
 	int line_peeked;
 
 	unsigned use_sideband : 1;
+	/* indicates if we saw an explicit capability */
+	unsigned hash_algo_explicit : 1;
 	const char *me;
 
 	/* hash algorithm in use */
diff --git a/t/t5700-protocol-v1.sh b/t/t5700-protocol-v1.sh
index 3cd9db9012..ad24c7fe64 100755
--- a/t/t5700-protocol-v1.sh
+++ b/t/t5700-protocol-v1.sh
@@ -244,6 +244,17 @@ test_expect_success 'push with ssh:// using protocol v1' '
 	grep "push< version 1" log
 '
 
+test_expect_success 'clone propagates object-format from empty repo' '
+	test_when_finished "rm -fr src256 dst256" &&
+
+	echo sha256 >expect &&
+	git init --object-format=sha256 src256 &&
+	GIT_DEFAULT_HASH=sha256 git -c protocol.version=1 clone --no-local src256 dst256 &&
+	git -C dst256 rev-parse --show-object-format >actual &&
+
+	test_cmp expect actual
+'
+
 # Test protocol v1 with 'http://' transport
 #
 . "$TEST_DIRECTORY"/lib-httpd.sh
diff --git a/transport-helper.c b/transport-helper.c
index 6b816940dc..c65cf7c620 100644
--- a/transport-helper.c
+++ b/transport-helper.c
@@ -1236,6 +1236,7 @@ static struct ref *get_refs_list_using_list(struct transport *transport,
 					die(_("unsupported object format '%s'"),
 					    value);
 				transport->hash_algo = &hash_algos[algo];
+				transport->hash_algo_explicit = 1;
 			}
 			continue;
 		}
diff --git a/transport.c b/transport.c
index 67afdae57c..7774487e8d 100644
--- a/transport.c
+++ b/transport.c
@@ -147,6 +147,12 @@ static void get_refs_from_bundle_inner(struct transport *transport)
 		die(_("could not read bundle '%s'"), transport->url);
 
 	transport->hash_algo = data->header.hash_algo;
+	/*
+	 * This is always set, even if we didn't get an explicit object-format
+	 * capability, since we know that a missing capability or a v2 bundle
+	 * definitively indicates SHA-1.
+	 */
+	transport->hash_algo_explicit = 1;
 }
 
 static struct ref *get_refs_from_bundle(struct transport *transport,
@@ -190,6 +196,7 @@ static int fetch_refs_from_bundle(struct transport *transport,
 	ret = unbundle(the_repository, &data->header, data->fd,
 		       &extra_index_pack_args, 0);
 	transport->hash_algo = data->header.hash_algo;
+	transport->hash_algo_explicit = 1;
 	return ret;
 }
 
@@ -360,6 +367,7 @@ static struct ref *handshake(struct transport *transport, int for_push,
 	}
 	data->finished_handshake = 1;
 	transport->hash_algo = reader.hash_algo;
+	transport->hash_algo_explicit = reader.hash_algo_explicit;
 
 	if (reader.line_peeked)
 		BUG("buffer must be empty at the end of handshake()");
@@ -1190,6 +1198,7 @@ struct transport *transport_get(struct remote *remote, const char *url)
 	}
 
 	ret->hash_algo = &hash_algos[GIT_HASH_SHA1];
+	ret->hash_algo_explicit = 0;
 
 	return ret;
 }
@@ -1199,6 +1208,11 @@ const struct git_hash_algo *transport_get_hash_algo(struct transport *transport)
 	return transport->hash_algo;
 }
 
+int transport_get_hash_algo_explicit(struct transport *transport)
+{
+	return transport->hash_algo_explicit;
+}
+
 int transport_set_option(struct transport *transport,
 			 const char *name, const char *value)
 {
diff --git a/transport.h b/transport.h
index 6393cd9823..ce67eefc58 100644
--- a/transport.h
+++ b/transport.h
@@ -128,6 +128,11 @@ struct transport {
 	 * in transport_set_verbosity().
 	 **/
 	unsigned progress : 1;
+	/*
+	 * Indicates whether the hash algorithm was initialized explicitly as
+	 * opposed to using a fallback.
+	 */
+	unsigned hash_algo_explicit : 1;
 	/*
 	 * If transport is at least potentially smart, this points to
 	 * git_transport_options structure to use in case transport
@@ -305,6 +310,15 @@ int transport_get_remote_bundle_uri(struct transport *transport);
  * This can only be called after fetching the remote refs.
  */
 const struct git_hash_algo *transport_get_hash_algo(struct transport *transport);
+/*
+ * Fetch whether the hash algorithm provided was explicitly set.
+ *
+ * If this value is false, "transport_get_hash_algo" will always return a value
+ * of SHA-1, which is the default algorithm if none is specified.
+ *
+ * This can only be called after fetching the remote refs.
+ */
+int transport_get_hash_algo_explicit(struct transport *transport);
 int transport_fetch_refs(struct transport *transport, struct ref *refs);
 
 /*

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH] doc: GIT_DEFAULT_HASH is and will be ignored during "clone"
  2023-04-26 15:13                     ` [PATCH] doc: GIT_DEFAULT_HASH is and will be ignored during "clone" Junio C Hamano
@ 2023-04-26 21:06                       ` brian m. carlson
  0 siblings, 0 replies; 58+ messages in thread
From: brian m. carlson @ 2023-04-26 21:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Adam Majer

[-- Attachment #1: Type: text/plain, Size: 1429 bytes --]

On 2023-04-26 at 15:13:55, Junio C Hamano wrote:
> The phrasing "is currently ignored" was prone to be misinterpreted
> as if we were wishing if it were honored.  Rephrase it to make it
> clear that the experimental variable will be ignored.
> 
> In the longer term, after/when we allow incremental/over-the-wire
> migration of the object-format, i.e. cloning from an SHA-1
> repository to create an SHA-256 repository (or vice versa) and
> fetching and pushing between them would bidirectionally convert the
> object format on the fly, it is likely that we would teach a new
> option "--object-format" to "git clone" to say "you would use
> whatever object format the origin uses by default, but this time, I
> am telling you to use this format on our side, doing on-the-fly
> object format conversion as needed".  So it is perfectly OK to
> ignore the settings of this experimental variable, even after such
> an extension happens that makes it necessary for us to have a way to
> create a new repository that uses different object format from the
> origin repository.

I have a different proposal which clarifies when it will and will not be
honoured in my series.  I think we would want to honour this variable
once we have SHA-1 and SHA-256 interop, and can convert on the fly, so I
think keeping the "currently" here is a good idea.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] Fix empty SHA-256 clones with v0 and v1
  2023-04-26 20:53                   ` [PATCH 0/2] Fix empty SHA-256 clones with v0 and v1 brian m. carlson
  2023-04-26 20:53                     ` [PATCH 1/2] http: advertise capabilities when cloning empty repos brian m. carlson
  2023-04-26 20:53                     ` [PATCH 2/2] Honor GIT_DEFAULT_HASH for empty clones without remote algo brian m. carlson
@ 2023-04-26 21:12                     ` Junio C Hamano
  2 siblings, 0 replies; 58+ messages in thread
From: Junio C Hamano @ 2023-04-26 21:12 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jeff King, Adam Majer

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> The second introduces some backwards compatibility to avoid regressing
> the old behaviour of using GIT_DEFAULT_HASH to initialize the proper
> hash in this case.  We add a flag to see if we explicitly obtained a
> hash algorithm from the remote side, and if not, we honour
> GIT_DEFAULT_HASH, as before.

I am fairly negative on this half of the series.  The first one is
excellent, though.

THanks.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/2] http: advertise capabilities when cloning empty repos
  2023-04-26 20:53                     ` [PATCH 1/2] http: advertise capabilities when cloning empty repos brian m. carlson
@ 2023-04-26 21:14                       ` Junio C Hamano
  2023-04-26 21:28                         ` brian m. carlson
  2023-04-27  5:30                       ` Jeff King
  1 sibling, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2023-04-26 21:14 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jeff King, Adam Majer

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> From: "brian m. carlson" <bk2204@github.com>
>
> When cloning an empty repository, the HTTP protocol version 0 currently
> offers nothing but the header and flush packets for the /info/refs
> endpoint. This means that no capabilities are provided, so the client
> side doesn't know what capabilities are present.
>
> However, this does pose a problem when working with SHA-256
> repositories, since we use the capabilities to know the remote side's
> object format (hash algorithm).  It used to be possible to set the
> correct algorithm with `GIT_DEFAULT_HASH` (which is what the Git LFS
> testsuite did), but this no longer works as of 8b214c2e9d ("clone:

"this no longer works as of" -> "this was a mistake and was fixed by".

> propagate object-format when cloning from void", 2023-04-05), since
> there we always read the hash algorithm from the remote.  If there is no
> hash algorithm provided, we default to SHA-1 for backwards
> compatibility.

Other than that, looks good to me.

Thanks.  Will queue.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] Honor GIT_DEFAULT_HASH for empty clones without remote algo
  2023-04-26 20:53                     ` [PATCH 2/2] Honor GIT_DEFAULT_HASH for empty clones without remote algo brian m. carlson
@ 2023-04-26 21:18                       ` Junio C Hamano
  2023-04-26 21:33                       ` Junio C Hamano
  1 sibling, 0 replies; 58+ messages in thread
From: Junio C Hamano @ 2023-04-26 21:18 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jeff King, Adam Majer

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> However, up until 8b214c2e9d ("clone: propagate object-format when
> cloning from void", 2023-04-05), we honored GIT_DEFAULT_HASH in this
> case, so let's continue to do that.

Let's not.  Once we identified the bug of mistakenly honoring a
wrong variable, let's fix it and keep it fixed.

The local optimization, if necessary, can be taught to peek the
source repository and propagating the object format selection to the
destination repository.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/2] http: advertise capabilities when cloning empty repos
  2023-04-26 21:14                       ` Junio C Hamano
@ 2023-04-26 21:28                         ` brian m. carlson
  2023-04-27  5:00                           ` Jeff King
  0 siblings, 1 reply; 58+ messages in thread
From: brian m. carlson @ 2023-04-26 21:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Adam Majer

[-- Attachment #1: Type: text/plain, Size: 1707 bytes --]

On 2023-04-26 at 21:14:37, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > From: "brian m. carlson" <bk2204@github.com>
> >
> > When cloning an empty repository, the HTTP protocol version 0 currently
> > offers nothing but the header and flush packets for the /info/refs
> > endpoint. This means that no capabilities are provided, so the client
> > side doesn't know what capabilities are present.
> >
> > However, this does pose a problem when working with SHA-256
> > repositories, since we use the capabilities to know the remote side's
> > object format (hash algorithm).  It used to be possible to set the
> > correct algorithm with `GIT_DEFAULT_HASH` (which is what the Git LFS
> > testsuite did), but this no longer works as of 8b214c2e9d ("clone:
> 
> "this no longer works as of" -> "this was a mistake and was fixed by".

I tend to disagree.  While I agree that change is valuable because it
fixes v2, which we want, it does cause a change in user-visible
behaviour, which broke the Git LFS testsuite.  Whether we like things
working that way or not, clearly there were people relying on it.

Fortunately, in that case, Git LFS can just enable protocol v2 and
things work again, but I think "this no longer works" is accurate and
more neutral, and addresses the issue.  We wouldn't have to deal with
that issue if we could gracefully handle git clone --local with older
versions of the protocol, but one of the tests fails when we do that.
I'll take some more time to see if I can come up with a nice way to
gracefully handle that, and if so, I'll send a v2.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] Honor GIT_DEFAULT_HASH for empty clones without remote algo
  2023-04-26 20:53                     ` [PATCH 2/2] Honor GIT_DEFAULT_HASH for empty clones without remote algo brian m. carlson
  2023-04-26 21:18                       ` Junio C Hamano
@ 2023-04-26 21:33                       ` Junio C Hamano
  2023-04-27  5:43                         ` Jeff King
  1 sibling, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2023-04-26 21:33 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jeff King, Adam Majer

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

>  `GIT_DEFAULT_HASH`::
>  	If this variable is set, the default hash algorithm for new
>  	repositories will be set to this value. This value is currently
> +	ignored when cloning if the remote value can be definitively
> +	determined; the setting of the remote repository is used
> +	instead. The value is honored if the remote repository's
> +	algorithm cannot be determined, such as some cases when
> +	the remote repository is empty. The default is "sha1".
> +	THIS VARIABLE IS EXPERIMENTAL! See `--object-format`
> +	in linkgit:git-init[1].

We'd need to evantually cover all the transports (and non-transport
like the "--local" optimization) so that the object-format and other
choices are communicated from the origin to a new clone anyway, so
this extra complexity "until X is fixed, it behaves this way, but
otherwise the variable is read in the meantime" may be a disservice
to the end users, even though it may make it easier in the shorter
term for maintainers of programs that rely on the buggy "git clone"
that partially honored this environment variable.

In short, I am still not convinced that the above is a good design
choice in the longer term.

Thanks.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-26 15:08                   ` Junio C Hamano
  2023-04-26 15:13                     ` [PATCH] doc: GIT_DEFAULT_HASH is and will be ignored during "clone" Junio C Hamano
@ 2023-04-27  4:46                     ` Jeff King
  1 sibling, 0 replies; 58+ messages in thread
From: Jeff King @ 2023-04-27  4:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: brian m. carlson, Adam Majer, git

On Wed, Apr 26, 2023 at 08:08:28AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > It sounds from your description that your test is running in a mode
> > where the client defaults to sha256 (though I'm not sure how, since we
> > explicitly document that GIT_DEFAULT_HASH should not affect clone), and
> > then you clone an empty sha256 repository via v0, expecting the result
> > to be sha256.
> 
> Thanks for coming up with an excellent guess that helped come up
> with a reproduction.
> 
> With Brian's patch a bit tweaked (attached below), the test does
> fail with the current 'master' and passes before the merge in
> question.  And the trace clearly shows that without being told
> anything about the object format via the capability, the client
> chooses to honor GIT_DEFAULT_HASH to initialize the new repository
> with sha256.

Ah, OK. The fact that we do respect GIT_DEFAULT_HASH there solves my
remaining confusion. And in retrospect it makes sense, because clone is
going to use the equivalent of "git init" under the hood. So we'll start
with _something_, and then adjust it based on what we read from the
other side. And that "something" respects GIT_DEFAULT_HASH.

Sort of a side note:

  This behavior is somewhat due to the fact that clone is implemented as
  "init, then fetch into the new repository". It could also conceptually
  be "talk to the remote, then init, then actually fetch". It never
  mattered before, but with options like object-format, the "init" step
  may be affected by things the remote said.

  So in our world, clone has to "fix up" parts of the repository that
  were initialized by tweaking the object-format config of the live
  repo. In a world where the very first thing it did was talk to the
  remote, then it would pass the option along to init in the first
  place.

  It's probably not worth trying to re-architect clone, though (not just
  in terms of work, but who knows what other subtle assumptions are
  baked into the current ordering). And it's not like it _solves_ this
  issue. It just might have made finding it a little less confusing.

> [Footnote]
> 
> * ... but I think it was misguided.  What it says there is this:
> 
>     And we also don't want to initialize the repository as SHA-1
>     initially, since that means if we're cloning an empty
>     repository, we'll have failed to honor the GIT_DEFAULT_HASH
>     variable and will end up with a SHA-1 repository, not a SHA-256
>     repository.
> 
> If this were "When we're cloning an empty repository, we'd have
> failed to honor the object format the other side has chosen and will
> end up with a SHA-1 repository, not a SHA-256 repository.", then it
> is very much in line with the reality before the patch under
> discussion and also in line with the official stance that "clone"
> should not honor GIT_DEFAULT_HASH.
> 
> Where the original description breaks down is when the other side is
> SHA-1 and this side has GIT_DEFAULT_HASH set to SHA-256.  If we
> honored the variable, we'd create a SHA-256 repository that will
> talk to SHA-1 repository before the rest of the system is ready.

Exactly. We are (and always have been) wrong 50% of the time in an empty
repo for v0. Your recent patch fixed v2 in an empty repo, but in doing
so flipped which halves of v0 were wrong.

-Peff

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git clone of empty repositories doesn't preserve hash
  2023-04-26 20:40                 ` brian m. carlson
  2023-04-26 20:53                   ` [PATCH 0/2] Fix empty SHA-256 clones with v0 and v1 brian m. carlson
@ 2023-04-27  4:56                   ` Jeff King
  2023-05-01 17:00                   ` [PATCH v2 0/1] Fix empty SHA-256 clones with v0 and v1 brian m. carlson
  2023-05-17 19:24                   ` [PATCH v3 " brian m. carlson
  3 siblings, 0 replies; 58+ messages in thread
From: Jeff King @ 2023-04-27  4:56 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Junio C Hamano, Adam Majer, git

On Wed, Apr 26, 2023 at 08:40:48PM +0000, brian m. carlson wrote:

> On 2023-04-26 at 10:51:34, Jeff King wrote:
> > We could send a capabilities^{} line, which Git has supported on the
> > client side since eb398797cd (connect: advertized capability is not a
> > ref, 2016-09-09). So sending it should not break even old clients
> > (though we would have to check what alternate implementations like
> > libgit2 or dulwich do; we know JGit supports it).
> 
> I have a patch which does exactly this, which I will be sending shortly.
> I've confirmed that libgit2 and JGit support it, which is unsurprising,
> since all of the implementations, Git included, share the same code.  In
> addition, this is the behaviour we document as supporting, so all
> implementations should support it.

Yeah, I was worried about how accurate that "should" is. :)

Since you checked the others, I peeked at dulwich's code, and it appears
to support it, too (I didn't actually run a test, but the code is pretty
clear). It doesn't support sha256 at all, but that's OK. What I'd be
concerned about is breaking clients when there is no useful capability
to advertise (though of course we could decide to send capabilities^{}
only when there is something useful to say).

> > However, the object-format support here was broken until the very recent
> > 13e67aa39b (v0 protocol: fix sha1/sha256 confusion for capabilities^{},
> > 2023-04-14), so it would only be useful going forward (before then we'd
> > die(), but maybe that is preferable to having the wrong object format?).
> 
> I think it's better to die than to silently have the wrong object
> format, and it also prevents the problem if other clients using v0 or v1
> (which effectively have to be supported for compatibility, while v2 is
> optional) try to clone from a fixed server.

OK, good. So we can ignore that recently fixed bug (which only affects
the case where older versions are cloning something whose hash format
does not match theirs).

> > I'm not sure it's worth the effort, though. If you want to use sha256
> > everywhere and tell the other side about it, you need a modern client
> > anyway, and that means the ability to speak v2. So this would only
> > matter if for some reason the v2 probe was being ignored (e.g., proxies
> > eating it, ssh refusing environment variable, etc), which itself are
> > things that ideally would be fixed (and can maybe one day even go away
> > if we optimistically default to v2).
> 
> Using v2 everywhere is difficult because many SSH servers still don't
> pass GIT_PROTOCOL by default, meaning that we're stuck with v0 and v1.
> In retrospect, sending an environment variable here was not a great
> decision, but we're stuck with it now.

True (I even got bit by this at one point, not realizing that I wasn't
using v2 with one of my personal servers). I certainly don't have an
objection if you're willing to do the work.

It would also allow us to fix the long-broken "unborn HEAD" problem for
v0 in an empty repo (though I am also fine if we don't go that far).

-Peff

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/2] http: advertise capabilities when cloning empty repos
  2023-04-26 21:28                         ` brian m. carlson
@ 2023-04-27  5:00                           ` Jeff King
  0 siblings, 0 replies; 58+ messages in thread
From: Jeff King @ 2023-04-27  5:00 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Junio C Hamano, git, Adam Majer

On Wed, Apr 26, 2023 at 09:28:31PM +0000, brian m. carlson wrote:

> > > However, this does pose a problem when working with SHA-256
> > > repositories, since we use the capabilities to know the remote side's
> > > object format (hash algorithm).  It used to be possible to set the
> > > correct algorithm with `GIT_DEFAULT_HASH` (which is what the Git LFS
> > > testsuite did), but this no longer works as of 8b214c2e9d ("clone:
> > 
> > "this no longer works as of" -> "this was a mistake and was fixed by".
> 
> I tend to disagree.  While I agree that change is valuable because it
> fixes v2, which we want, it does cause a change in user-visible
> behaviour, which broke the Git LFS testsuite.  Whether we like things
> working that way or not, clearly there were people relying on it.
> 
> Fortunately, in that case, Git LFS can just enable protocol v2 and
> things work again, but I think "this no longer works" is accurate and
> more neutral, and addresses the issue.  We wouldn't have to deal with
> that issue if we could gracefully handle git clone --local with older
> versions of the protocol, but one of the tests fails when we do that.
> I'll take some more time to see if I can come up with a nice way to
> gracefully handle that, and if so, I'll send a v2.

Reiterating what I said upthread, I think it was always 50% broken.
Taking the local hash format over the remote one was always the wrong
thing to do, but it sometimes worked out (because we happened to match
the remote).

But the opposite case:

  git init --object-format=sha1 dst.git
  GIT_DEFAULT_HASH=sha256 git clone dst.git

would previously have done the wrong thing. We just flipped which half
was broken and which was not.

-Peff

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/2] http: advertise capabilities when cloning empty repos
  2023-04-26 20:53                     ` [PATCH 1/2] http: advertise capabilities when cloning empty repos brian m. carlson
  2023-04-26 21:14                       ` Junio C Hamano
@ 2023-04-27  5:30                       ` Jeff King
  2023-04-27 20:40                         ` Junio C Hamano
  1 sibling, 1 reply; 58+ messages in thread
From: Jeff King @ 2023-04-27  5:30 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Junio C Hamano, Adam Majer

On Wed, Apr 26, 2023 at 08:53:23PM +0000, brian m. carlson wrote:

> From: "brian m. carlson" <bk2204@github.com>
> 
> When cloning an empty repository, the HTTP protocol version 0 currently
> offers nothing but the header and flush packets for the /info/refs
> endpoint. This means that no capabilities are provided, so the client
> side doesn't know what capabilities are present.

Is this really an HTTP problem?

If I do:

  git init --bare --object-format=sha256 remote.git
  git -c protocol.version=0 clone --bare remote.git local.git
  git -C local.git rev-parse --show-object-format

I will get sha1, which is wrong. Likewise with GIT_DEFAULT_HASH=sha256
on the clone (after Junio's recent patch), regardless of what the server
claims. This is really a git-protocol issue that affects all transports.

So I think in this hunk:

> @@ -1379,6 +1381,8 @@ void upload_pack(const int advertise_refs, const int stateless_rpc,
>  			data.no_done = 1;
>  		head_ref_namespaced(send_ref, &data);
>  		for_each_namespaced_ref(send_ref, &data);
> +		if (!data.sent_capabilities && advertise_refs)
> +			send_ref("capabilities^{}", null_oid(), 0, &data);
>  		/*
>  		 * fflush stdout before calling advertise_shallow_grafts because send_ref
>  		 * uses stdio.

you would want to drop the "&& advertise_refs" bit, after which both of
the cases above would yield a sha256 repository.

There is one other catch, though. Doing as I suggest results in a
failure in t5509, because the new code does not interact correctly with
namespaces. That is true of your version, as well; it's just that the
test suite does not cover the combination of namespaces, http, and empty
repos.

The issue is that send_ref() will try to strip the namespace, and end up
with NULL (which on my glibc system ends up with a ref named "(null)",
but obviously could segfault, too).

Something like this fixes it:

diff --git a/environment.c b/environment.c
index 8a96997539..37cd66b295 100644
--- a/environment.c
+++ b/environment.c
@@ -234,6 +234,8 @@ const char *get_git_namespace(void)
 const char *strip_namespace(const char *namespaced_ref)
 {
 	const char *out;
+	if (!strcmp(namespaced_ref, "capabilities^{}"))
+		return namespaced_ref; /* magic ref */
 	if (skip_prefix(namespaced_ref, get_git_namespace(), &out))
 		return out;
 	return NULL;

but I suspect it would be cleaner to refactor send_ref() to allow
sending a name more directly.

(As an aside, it feels like send_ref() is also wrong not to check for
NULL from strip_namespace(), but I guess in practice we do not feed
it names outside of the namespace. Might be a good candidate for a BUG()
check or other assertion).

> +test_expect_success 'clone empty SHA-256 repository with protocol v0' '
> +	rm -fr sha256 &&
> +	echo sha256 >expected &&
> +	GIT_TRACE=1 GIT_TRACE_PACKET=1 git -c protocol.version=0 clone "$HTTPD_URL/smart/sha256.git" &&
> +	git -C sha256 rev-parse --show-object-format >actual &&
> +	test_cmp actual expected &&
> +	git ls-remote "$HTTPD_URL/smart/sha256.git" >actual &&
> +	test_must_be_empty actual
> +'

This looks reasonable, though I think if we do not need HTTP to
demonstrate the issue (and I don't think we do), then we should probably
avoid it, just to get test coverage on platforms that don't support
HTTP.

-Peff

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] Honor GIT_DEFAULT_HASH for empty clones without remote algo
  2023-04-26 21:33                       ` Junio C Hamano
@ 2023-04-27  5:43                         ` Jeff King
  2023-05-02 23:46                           ` Is GIT_DEFAULT_HASH flawed? Felipe Contreras
  0 siblings, 1 reply; 58+ messages in thread
From: Jeff King @ 2023-04-27  5:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: brian m. carlson, git, Adam Majer

On Wed, Apr 26, 2023 at 02:33:30PM -0700, Junio C Hamano wrote:

> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> >  `GIT_DEFAULT_HASH`::
> >  	If this variable is set, the default hash algorithm for new
> >  	repositories will be set to this value. This value is currently
> > +	ignored when cloning if the remote value can be definitively
> > +	determined; the setting of the remote repository is used
> > +	instead. The value is honored if the remote repository's
> > +	algorithm cannot be determined, such as some cases when
> > +	the remote repository is empty. The default is "sha1".
> > +	THIS VARIABLE IS EXPERIMENTAL! See `--object-format`
> > +	in linkgit:git-init[1].
> 
> We'd need to evantually cover all the transports (and non-transport
> like the "--local" optimization) so that the object-format and other
> choices are communicated from the origin to a new clone anyway, so
> this extra complexity "until X is fixed, it behaves this way, but
> otherwise the variable is read in the meantime" may be a disservice
> to the end users, even though it may make it easier in the shorter
> term for maintainers of programs that rely on the buggy "git clone"
> that partially honored this environment variable.
> 
> In short, I am still not convinced that the above is a good design
> choice in the longer term.

I also think it is working against the backwards-compatible design of
the hash function transition. If we do not see an object-format line
from the remote, then either:

  1. They sent us capabilities, but it did not include object-format. So
     if we are in GIT_DEFAULT_HASH=sha256 mode locally, but the other
     side is an older version of Git (or even a current version of other
     implementations, like Dulwich) that do not send object-format at
     all, then we will not correctly fall back to assuming they are
     sha1. In a non-empty repo, this means we'll fail to parse their ref
     advertisement (we'll expect sha256 hashes but get sha1), and
     cloning will be broken.

  2. They did not send us capabilities, because the repo is empty (and
     the server does not have brian's patch 1). The hash transition doc
     says we're supposed to assume they're sha1. It's _sort of_
     academic, in that they also are not telling us about any refs on
     their side. But we may end up mis-matched with the server (again,
     this is the 50/50 thing; we don't know what their format is).
     Presumably that bites us later when we try to push up new objects
     (but would eventually work when we support interop).

I think handling (2) is iffy as a goal, but the collateral damage of (1)
is a complete show-stopper for this patch. If we wanted to do (2) by
itself, we'd have to distinguish "did they even send us a capabilities
line" as a separate case (but I tend to agree with you that it is not
worth doing for now).

-Peff

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/2] http: advertise capabilities when cloning empty repos
  2023-04-27  5:30                       ` Jeff King
@ 2023-04-27 20:40                         ` Junio C Hamano
  0 siblings, 0 replies; 58+ messages in thread
From: Junio C Hamano @ 2023-04-27 20:40 UTC (permalink / raw)
  To: Jeff King; +Cc: brian m. carlson, git, Adam Majer

Jeff King <peff@peff.net> writes:

> So I think in this hunk:
>
>> @@ -1379,6 +1381,8 @@ void upload_pack(const int advertise_refs, const int stateless_rpc,
>>  			data.no_done = 1;
>>  		head_ref_namespaced(send_ref, &data);
>>  		for_each_namespaced_ref(send_ref, &data);
>> +		if (!data.sent_capabilities && advertise_refs)
>> +			send_ref("capabilities^{}", null_oid(), 0, &data);
>>  		/*
>>  		 * fflush stdout before calling advertise_shallow_grafts because send_ref
>>  		 * uses stdio.
>
> you would want to drop the "&& advertise_refs" bit, after which both of
> the cases above would yield a sha256 repository.

Good suggestion.

>> +test_expect_success 'clone empty SHA-256 repository with protocol v0' '
>> +	rm -fr sha256 &&
>> +	echo sha256 >expected &&
>> +	GIT_TRACE=1 GIT_TRACE_PACKET=1 git -c protocol.version=0 clone "$HTTPD_URL/smart/sha256.git" &&
>> +	git -C sha256 rev-parse --show-object-format >actual &&
>> +	test_cmp actual expected &&
>> +	git ls-remote "$HTTPD_URL/smart/sha256.git" >actual &&
>> +	test_must_be_empty actual
>> +'
>
> This looks reasonable, though I think if we do not need HTTP to
> demonstrate the issue (and I don't think we do), then we should probably
> avoid it, just to get test coverage on platforms that don't support
> HTTP.

HTTP tests tend to be more cumbersome to set up and harder to debug
than the plain vanilla "over the pipe on the same machine"
transport, so I tend to agree with the statement.

They however represent a more common use case, so having HTTP tests
in addition to non-HTTP tests would be nicer, if we can afford to.

Thanks.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v2 0/1] Fix empty SHA-256 clones with v0 and v1
  2023-04-26 20:40                 ` brian m. carlson
  2023-04-26 20:53                   ` [PATCH 0/2] Fix empty SHA-256 clones with v0 and v1 brian m. carlson
  2023-04-27  4:56                   ` git clone of empty repositories doesn't preserve hash Jeff King
@ 2023-05-01 17:00                   ` brian m. carlson
  2023-05-01 17:00                     ` [PATCH v2 1/1] upload-pack: advertise capabilities when cloning empty repos brian m. carlson
  2023-05-01 17:37                     ` [PATCH v2 0/1] Fix empty SHA-256 clones with v0 and v1 Junio C Hamano
  2023-05-17 19:24                   ` [PATCH v3 " brian m. carlson
  3 siblings, 2 replies; 58+ messages in thread
From: brian m. carlson @ 2023-05-01 17:00 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Adam Majer

We recently fixed empty clones with SHA-256 over protocol v2 by
honouring the hash algorithm specified even when no refs are present.
However, in doing so, we made it impossible to set up a v0 or v1
repository by cloning from an empty SHA-256 repository.  In doing so, we
also broke the Git LFS testsuite for SHA-256 repositories.

This series introduces the dummy `capabilities^{}` entry for fetches and
clones from an empty repository for v0 and v1, just as we do for clones.
This is already supported by older versions of Git, as well as libgit2,
dulwich, and JGit.

Unlike in v1, we wire this up for all protocols and fix the NULL pointer
dereference that would occur in that case, as well as add some more
tests.  The second patch has been dropped, since it is no longer needed
and was not very popular.

Changes since v1:
* Drop patch to honour GIT_DEFAULT_HASH
* Support all requests, not just HTTP.
* Add more tests.
* Fix NULL pointer dereference.

brian m. carlson (1):
  upload-pack: advertise capabilities when cloning empty repos

 t/t5551-http-fetch-smart.sh | 27 +++++++++++++++++++++++++++
 t/t5700-protocol-v1.sh      | 31 +++++++++++++++++++++++++++++--
 upload-pack.c               |  7 ++++++-
 3 files changed, 62 insertions(+), 3 deletions(-)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v2 1/1] upload-pack: advertise capabilities when cloning empty repos
  2023-05-01 17:00                   ` [PATCH v2 0/1] Fix empty SHA-256 clones with v0 and v1 brian m. carlson
@ 2023-05-01 17:00                     ` brian m. carlson
  2023-05-01 22:40                       ` Jeff King
  2023-05-01 17:37                     ` [PATCH v2 0/1] Fix empty SHA-256 clones with v0 and v1 Junio C Hamano
  1 sibling, 1 reply; 58+ messages in thread
From: brian m. carlson @ 2023-05-01 17:00 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Adam Majer

From: "brian m. carlson" <bk2204@github.com>

When cloning an empty repository, protocol versions 0 and 1 currently
offer nothing but the header and flush packets for the /info/refs
endpoint. This means that no capabilities are provided, so the client
side doesn't know what capabilities are present.

However, this does pose a problem when working with SHA-256
repositories, since we use the capabilities to know the remote side's
object format (hash algorithm).  As of 8b214c2e9d ("clone: propagate
object-format when cloning from void", 2023-04-05), this has been fixed
for protocol v2, since there we always read the hash algorithm from the
remote.

Fortunately, the push version of the protocol already indicates a clue
for how to solve this.  When the /info/refs endpoint is accessed for a
push and the remote is empty, we include a dummy "capabilities^{}" ref
pointing to the all-zeros object ID.  The protocol documentation already
indicates this should _always_ be sent, even for fetches and clones, so
let's just do that, which means we'll properly announce the hash
algorithm as part of the capabilities.  This just works with the
existing code because we share the same ref code for fetches and clones,
and libgit2, JGit, and dulwich do as well.

There is one minor issue to fix, though.  When we call send_ref with
namespaces, we would return NULL with the capabilities entry, which
would cause a crash.  Instead, let's make sure we don't try to strip the
namespace if we're using our special capabilities entry.

Add several sets of tests for HTTP as well as for local clones.

Signed-off-by: brian m. carlson <bk2204@github.com>
---
 t/t5551-http-fetch-smart.sh | 27 +++++++++++++++++++++++++++
 t/t5700-protocol-v1.sh      | 31 +++++++++++++++++++++++++++++--
 upload-pack.c               |  7 ++++++-
 3 files changed, 62 insertions(+), 3 deletions(-)

diff --git a/t/t5551-http-fetch-smart.sh b/t/t5551-http-fetch-smart.sh
index 0908534f25..21b7767cbd 100755
--- a/t/t5551-http-fetch-smart.sh
+++ b/t/t5551-http-fetch-smart.sh
@@ -611,6 +611,33 @@ test_expect_success 'client falls back from v2 to v0 to match server' '
 	grep symref=HEAD:refs/heads/ trace
 '
 
+test_expect_success 'create empty http-accessible SHA-256 repository' '
+	mkdir "$HTTPD_DOCUMENT_ROOT_PATH/sha256.git" &&
+	(cd "$HTTPD_DOCUMENT_ROOT_PATH/sha256.git" &&
+	 git --bare init --object-format=sha256
+	)
+'
+
+test_expect_success 'clone empty SHA-256 repository with protocol v2' '
+	rm -fr sha256 &&
+	echo sha256 >expected &&
+	git -c protocol.version=2 clone "$HTTPD_URL/smart/sha256.git" &&
+	git -C sha256 rev-parse --show-object-format >actual &&
+	test_cmp actual expected &&
+	git ls-remote "$HTTPD_URL/smart/sha256.git" >actual &&
+	test_must_be_empty actual
+'
+
+test_expect_success 'clone empty SHA-256 repository with protocol v0' '
+	rm -fr sha256 &&
+	echo sha256 >expected &&
+	GIT_TRACE=1 GIT_TRACE_PACKET=1 git -c protocol.version=0 clone "$HTTPD_URL/smart/sha256.git" &&
+	git -C sha256 rev-parse --show-object-format >actual &&
+	test_cmp actual expected &&
+	git ls-remote "$HTTPD_URL/smart/sha256.git" >actual &&
+	test_must_be_empty actual
+'
+
 test_expect_success 'passing hostname resolution information works' '
 	BOGUS_HOST=gitbogusexamplehost.invalid &&
 	BOGUS_HTTPD_URL=$HTTPD_PROTO://$BOGUS_HOST:$LIB_HTTPD_PORT &&
diff --git a/t/t5700-protocol-v1.sh b/t/t5700-protocol-v1.sh
index 6c8d4c6cf1..a73b4d4ff6 100755
--- a/t/t5700-protocol-v1.sh
+++ b/t/t5700-protocol-v1.sh
@@ -244,15 +244,28 @@ test_expect_success 'push with ssh:// using protocol v1' '
 	grep "push< version 1" log
 '
 
+test_expect_success 'clone propagates object-format from empty repo' '
+	test_when_finished "rm -fr src256 dst256" &&
+
+	echo sha256 >expect &&
+	git init --object-format=sha256 src256 &&
+	git clone --no-local src256 dst256 &&
+	git -C dst256 rev-parse --show-object-format >actual &&
+
+	test_cmp expect actual
+'
+
 # Test protocol v1 with 'http://' transport
 #
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
-test_expect_success 'create repo to be served by http:// transport' '
+test_expect_success 'create repos to be served by http:// transport' '
 	git init "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" &&
 	git -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" config http.receivepack true &&
-	test_commit -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" one
+	test_commit -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" one &&
+	git init --object-format=sha256 "$HTTPD_DOCUMENT_ROOT_PATH/sha256" &&
+	git -C "$HTTPD_DOCUMENT_ROOT_PATH/sha256" config http.receivepack true
 '
 
 test_expect_success 'clone with http:// using protocol v1' '
@@ -269,6 +282,20 @@ test_expect_success 'clone with http:// using protocol v1' '
 	grep "git< version 1" log
 '
 
+test_expect_success 'clone with http:// using protocol v1 with empty SHA-256 repo' '
+	GIT_TRACE_PACKET=1 GIT_TRACE_CURL=1 git -c protocol.version=1 \
+		clone "$HTTPD_URL/smart/sha256" sha256 2>log &&
+
+	echo sha256 >expect &&
+	git -C sha256 rev-parse --show-object-format >actual &&
+	test_cmp expect actual &&
+
+	# Client requested to use protocol v1
+	grep "Git-Protocol: version=1" log &&
+	# Server responded using protocol v1
+	grep "git< version 1" log
+'
+
 test_expect_success 'fetch with http:// using protocol v1' '
 	test_commit -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" two &&
 
diff --git a/upload-pack.c b/upload-pack.c
index 08633dc121..d7b31d0527 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -120,6 +120,7 @@ struct upload_pack_data {
 	unsigned allow_ref_in_want : 1;				/* v2 only */
 	unsigned allow_sideband_all : 1;			/* v2 only */
 	unsigned advertise_sid : 1;
+	unsigned sent_capabilities : 1;
 };
 
 static void upload_pack_data_init(struct upload_pack_data *data)
@@ -1212,7 +1213,8 @@ static int send_ref(const char *refname, const struct object_id *oid,
 	static const char *capabilities = "multi_ack thin-pack side-band"
 		" side-band-64k ofs-delta shallow deepen-since deepen-not"
 		" deepen-relative no-progress include-tag multi_ack_detailed";
-	const char *refname_nons = strip_namespace(refname);
+	const char *refname_nons = !strcmp(refname, "capabilities^{}") ?
+				   refname : strip_namespace(refname);
 	struct object_id peeled;
 	struct upload_pack_data *data = cb_data;
 
@@ -1240,6 +1242,7 @@ static int send_ref(const char *refname, const struct object_id *oid,
 			     git_user_agent_sanitized());
 		strbuf_release(&symref_info);
 		strbuf_release(&session_id);
+		data->sent_capabilities = 1;
 	} else {
 		packet_fwrite_fmt(stdout, "%s %s\n", oid_to_hex(oid), refname_nons);
 	}
@@ -1379,6 +1382,8 @@ void upload_pack(const int advertise_refs, const int stateless_rpc,
 			data.no_done = 1;
 		head_ref_namespaced(send_ref, &data);
 		for_each_namespaced_ref(send_ref, &data);
+		if (!data.sent_capabilities)
+			send_ref("capabilities^{}", null_oid(), 0, &data);
 		/*
 		 * fflush stdout before calling advertise_shallow_grafts because send_ref
 		 * uses stdio.

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/1] Fix empty SHA-256 clones with v0 and v1
  2023-05-01 17:00                   ` [PATCH v2 0/1] Fix empty SHA-256 clones with v0 and v1 brian m. carlson
  2023-05-01 17:00                     ` [PATCH v2 1/1] upload-pack: advertise capabilities when cloning empty repos brian m. carlson
@ 2023-05-01 17:37                     ` Junio C Hamano
  1 sibling, 0 replies; 58+ messages in thread
From: Junio C Hamano @ 2023-05-01 17:37 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jeff King, Adam Majer

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> Changes since v1:
> * Support all requests, not just HTTP.
> * Add more tests.
> * Fix NULL pointer dereference.

Great.  Will queue.

Let's merge it down to 'next' soonish.

Thanks.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 1/1] upload-pack: advertise capabilities when cloning empty repos
  2023-05-01 17:00                     ` [PATCH v2 1/1] upload-pack: advertise capabilities when cloning empty repos brian m. carlson
@ 2023-05-01 22:40                       ` Jeff King
  2023-05-01 22:51                         ` Junio C Hamano
  0 siblings, 1 reply; 58+ messages in thread
From: Jeff King @ 2023-05-01 22:40 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Junio C Hamano, Adam Majer

On Mon, May 01, 2023 at 05:00:18PM +0000, brian m. carlson wrote:

> There is one minor issue to fix, though.  When we call send_ref with
> namespaces, we would return NULL with the capabilities entry, which
> would cause a crash.  Instead, let's make sure we don't try to strip the
> namespace if we're using our special capabilities entry.

Thanks, this hunk:

> @@ -1212,7 +1213,8 @@ static int send_ref(const char *refname, const struct object_id *oid,
>  	static const char *capabilities = "multi_ack thin-pack side-band"
>  		" side-band-64k ofs-delta shallow deepen-since deepen-not"
>  		" deepen-relative no-progress include-tag multi_ack_detailed";
> -	const char *refname_nons = strip_namespace(refname);
> +	const char *refname_nons = !strcmp(refname, "capabilities^{}") ?
> +				   refname : strip_namespace(refname);
>  	struct object_id peeled;
>  	struct upload_pack_data *data = cb_data;

looks much better than sticking it in strip_namespace() as I did
earlier. I did wonder about refactoring further:

diff --git a/upload-pack.c b/upload-pack.c
index d7b31d0527..e1d75d7c3c 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -1207,19 +1207,17 @@ static void format_session_id(struct strbuf *buf, struct upload_pack_data *d) {
 		strbuf_addf(buf, " session-id=%s", trace2_session_id());
 }
 
-static int send_ref(const char *refname, const struct object_id *oid,
-		    int flag UNUSED, void *cb_data)
+static void write_v0_ref(struct upload_pack_data *data,
+			 const char *refname, const char *refname_nons,
+			 const struct object_id *oid)
 {
 	static const char *capabilities = "multi_ack thin-pack side-band"
 		" side-band-64k ofs-delta shallow deepen-since deepen-not"
 		" deepen-relative no-progress include-tag multi_ack_detailed";
-	const char *refname_nons = !strcmp(refname, "capabilities^{}") ?
-				   refname : strip_namespace(refname);
 	struct object_id peeled;
-	struct upload_pack_data *data = cb_data;
 
 	if (mark_our_ref(refname_nons, refname, oid, &data->hidden_refs))
-		return 0;
+		return;
 
 	if (capabilities) {
 		struct strbuf symref_info = STRBUF_INIT;
@@ -1249,6 +1247,12 @@ static int send_ref(const char *refname, const struct object_id *oid,
 	capabilities = NULL;
 	if (!peel_iterated_oid(oid, &peeled))
 		packet_fwrite_fmt(stdout, "%s %s^{}\n", oid_to_hex(&peeled), refname_nons);
+}
+
+static int send_ref(const char *refname, const struct object_id *oid,
+		    int flag UNUSED, void *cb_data)
+{
+	write_v0_ref(cb_data, refname, strip_namespace(refname), oid);
 	return 0;
 }
 
@@ -1382,8 +1386,10 @@ void upload_pack(const int advertise_refs, const int stateless_rpc,
 			data.no_done = 1;
 		head_ref_namespaced(send_ref, &data);
 		for_each_namespaced_ref(send_ref, &data);
-		if (!data.sent_capabilities)
-			send_ref("capabilities^{}", null_oid(), 0, &data);
+		if (!data.sent_capabilities) {
+			const char *ref = "capabilities^{}";
+			write_v0_ref(&data, ref, ref, null_oid());
+		}
 		/*
 		 * fflush stdout before calling advertise_shallow_grafts because send_ref
 		 * uses stdio.

which avoids doing an extra strcmp() on every ref. But probably it is
not that big a deal either way.

> Add several sets of tests for HTTP as well as for local clones.

This part puzzled me a bit. There's a local test in t5700, which is
good. But it also gets HTTP tests. What do they offer versus the ones in
t5551 (or vice versa)?

-Peff

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 1/1] upload-pack: advertise capabilities when cloning empty repos
  2023-05-01 22:40                       ` Jeff King
@ 2023-05-01 22:51                         ` Junio C Hamano
  0 siblings, 0 replies; 58+ messages in thread
From: Junio C Hamano @ 2023-05-01 22:51 UTC (permalink / raw)
  To: Jeff King; +Cc: brian m. carlson, git, Adam Majer

Jeff King <peff@peff.net> writes:

> @@ -1382,8 +1386,10 @@ void upload_pack(const int advertise_refs, const int stateless_rpc,
>  			data.no_done = 1;
>  		head_ref_namespaced(send_ref, &data);
>  		for_each_namespaced_ref(send_ref, &data);
> -		if (!data.sent_capabilities)
> -			send_ref("capabilities^{}", null_oid(), 0, &data);
> +		if (!data.sent_capabilities) {
> +			const char *ref = "capabilities^{}";
> +			write_v0_ref(&data, ref, ref, null_oid());
> +		}

Ah, this separation of duties wrt the namespace stripping makes the
result easier to read.

The version brian posted looked good enough to me, but if we are to
have another iteration, incorporating this change would be nice.

Thanks.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Is GIT_DEFAULT_HASH flawed?
  2023-04-27  5:43                         ` Jeff King
@ 2023-05-02 23:46                           ` Felipe Contreras
  2023-05-03  9:03                             ` Adam Majer
                                               ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Felipe Contreras @ 2023-05-02 23:46 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano; +Cc: brian m. carlson, git, Adam Majer

Hi,

Changing the subject as this message seems like a different topic.

Jeff King wrote:
> On Wed, Apr 26, 2023 at 02:33:30PM -0700, Junio C Hamano wrote:
> > "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> > 
> > >  `GIT_DEFAULT_HASH`::
> > >  	If this variable is set, the default hash algorithm for new
> > >  	repositories will be set to this value. This value is currently
> > > +	ignored when cloning if the remote value can be definitively
> > > +	determined; the setting of the remote repository is used
> > > +	instead. The value is honored if the remote repository's
> > > +	algorithm cannot be determined, such as some cases when
> > > +	the remote repository is empty. The default is "sha1".
> > > +	THIS VARIABLE IS EXPERIMENTAL! See `--object-format`
> > > +	in linkgit:git-init[1].
> > 
> > We'd need to evantually cover all the transports (and non-transport
> > like the "--local" optimization) so that the object-format and other
> > choices are communicated from the origin to a new clone anyway, so
> > this extra complexity "until X is fixed, it behaves this way, but
> > otherwise the variable is read in the meantime" may be a disservice
> > to the end users, even though it may make it easier in the shorter
> > term for maintainers of programs that rely on the buggy "git clone"
> > that partially honored this environment variable.
> > 
> > In short, I am still not convinced that the above is a good design
> > choice in the longer term.
> 
> I also think it is working against the backwards-compatible design of
> the hash function transition.

To be honest this whole approach seems to be completely flawed to me and
against the whole design of git in the first place.

In a recent email Linus Torvalds explained why object ids were
calculated based {type, size, data} [1], and he explained very clearly
that two objects with exactly the same data are not supposed to have the
same id if the type is different.

If even the tiniest change such as adding a period to a commit messange
changes the object id (and thus semantically makes it a different
object), then it makes sense that changing the type of an object also
changes the object id (and thus it's also a different object).

And because the id of the parent is included in the content of every
commit, the top-level id ensures the integrity of the whole graph.

But then comes this notion that the hash algorithm is a property of the
repository, and not part of the object storage, which means changing the
whole hash algorithm of a repository is considered less of a change than
adding a period to the commit message, worse: not a change at all.

I am reminded of the warning Sam Smith gave to the Git project [2] which
seemed to be unheard, but the notion of cryptographic algorithm agility
makes complete sense to me.

In my view one repository should be able to have part SHA-1 history,
part SHA3-256 history, and part BLAKE2b history.

Changing the hash algorithm of one commit should change the object id of
that commit, and thus make it semantically a different commit.

In other words: an object of type "blob" should never be confused with
an object of type "blob:sha-256", even if the content is exactly the
same.

The fact that apparently it's so easy to clone a repository with
the wrong hash algorithm should give developers pause, as it means the
whole point of using cryptographic hash algorithms to ensure the
integrity of the commit history is completely gone.

I have not been following the SHA-1 -> OID discussions, but I
distinctively recall Linus Torvalds mentioning that the choice of using
SHA-1 wasn't even for security purposes, it was to ensure integrity.
When I do a `git fetch` as long as the new commits have the same SHA-1
as parent as the SHA-1s I have in my repository I can be relatively
certain the repository has not been tampered with. Which means that if I
do a `git fetch` that suddenly brings SHA-256 commits, some of them must
have SHA-1 parents that match the ones I currently have. Otherwise how
do I know it's the same history?

Maybe that's one of the reasons people don't seem particularly eager to
move away from SHA-1:

Better the SHA-1 you know, than the SHA-256 you don't.

Cheers.

[1] https://lore.kernel.org/git/CAHk-=wjr-CMLX2Jo2++rwcv0VNr+HmZqXEVXNsJGiPRUwNxzBQ@mail.gmail.com/
[2] https://lore.kernel.org/git/D433038A-2643-4F63-8677-CA8AB6904AE1@samuelsmith.org/

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Is GIT_DEFAULT_HASH flawed?
  2023-05-02 23:46                           ` Is GIT_DEFAULT_HASH flawed? Felipe Contreras
@ 2023-05-03  9:03                             ` Adam Majer
  2023-05-03 15:44                               ` Felipe Contreras
  2023-05-03  9:09                             ` demerphq
  2023-05-03 22:54                             ` brian m. carlson
  2 siblings, 1 reply; 58+ messages in thread
From: Adam Majer @ 2023-05-03  9:03 UTC (permalink / raw)
  To: Felipe Contreras, Jeff King, Junio C Hamano; +Cc: brian m. carlson, git

On 5/3/23 01:46, Felipe Contreras wrote:
> To be honest this whole approach seems to be completely flawed to me and
> against the whole design of git in the first place.

The discussion above is mostly moot now since this has been fixed in 
later patches in this thread, AFAIK. It's also moot for other reasons, 
like the hash function transition plan is not really implemented, yet.

Also, this was about corner-case, like it often is.


> In a recent email Linus Torvalds explained why object ids were
> calculated based {type, size, data} [1], and he explained very clearly
> that two objects with exactly the same data are not supposed to have the
> same id if the type is different.

This is different. But aside, type + size + data are not really much 
different from just having data in a hash function. There are plenty of 
hash collisions where

     HASH(type + size + data) == HASH(type + size + data')

by definition of how these functions work. The problem is always in 
finding these collisions. But anyway...

> In my view one repository should be able to have part SHA-1 history,
> part SHA3-256 history, and part BLAKE2b history.

Yes, that would be great. Please provide patch series for this :-)

> I have not been following the SHA-1 -> OID discussions, but I
> distinctively recall Linus Torvalds mentioning that the choice of using
> SHA-1 wasn't even for security purposes, it was to ensure integrity.

These are different sides of the same coin. Hashes are used to provide 
integrity. Hashes like MD4, MD5, SHA1, SHA256 are there for integrity. 
Some of these are no longer recommended and some are completely broken.

> Better the SHA-1 you know, than the SHA-256 you don't.

Wrong conclusion ;) Also, we know SHA-256

The problem in git-core and virtually all clients and other 
implementations is/was that SHA1 was hardcoded and assumed to be THE ONE 
and ONLY hash. It will take quite a bit of work outside of git-core to 
remove this one assumption (remember two digit year and 2000? - yes I'm 
old). Once this hash assumption is removed, you can start talking about 
adding other hashes and interop.

Keep in mind -- hashes are there for object reference. They are the glue 
in git. But there is really nothing stopping us from recalculating them 
"on the fly". If you have SHA1 repo, you can calculate a SHA256 or 
whatever hash for any type object. That's not the problem, conceptually 
speaking.

Finally, let not have a "bike shed" discussion about this. The 
GIT_DEFAULT_HASH is meant to be used by `git init` in-lieu of 
--object-format parameter, so it's not flawed. When used in other 
applications, it probably indicates a bug. But we can't fix all the bugs 
at once :-)

Cheers,
- Adam

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Is GIT_DEFAULT_HASH flawed?
  2023-05-02 23:46                           ` Is GIT_DEFAULT_HASH flawed? Felipe Contreras
  2023-05-03  9:03                             ` Adam Majer
@ 2023-05-03  9:09                             ` demerphq
  2023-05-03 18:20                               ` Felipe Contreras
  2023-05-03 22:54                             ` brian m. carlson
  2 siblings, 1 reply; 58+ messages in thread
From: demerphq @ 2023-05-03  9:09 UTC (permalink / raw)
  To: Felipe Contreras
  Cc: Jeff King, Junio C Hamano, brian m. carlson, git, Adam Majer

On Wed, 3 May 2023 at 02:17, Felipe Contreras
<felipe.contreras@gmail.com> wrote:
>
> Hi,
>
> Changing the subject as this message seems like a different topic.
>
> Jeff King wrote:
> > On Wed, Apr 26, 2023 at 02:33:30PM -0700, Junio C Hamano wrote:
> > > "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> > >
> > > >  `GIT_DEFAULT_HASH`::
> > > >   If this variable is set, the default hash algorithm for new
> > > >   repositories will be set to this value. This value is currently
> > > > + ignored when cloning if the remote value can be definitively
> > > > + determined; the setting of the remote repository is used
> > > > + instead. The value is honored if the remote repository's
> > > > + algorithm cannot be determined, such as some cases when
> > > > + the remote repository is empty. The default is "sha1".
> > > > + THIS VARIABLE IS EXPERIMENTAL! See `--object-format`
> > > > + in linkgit:git-init[1].
> > >
> > > We'd need to evantually cover all the transports (and non-transport
> > > like the "--local" optimization) so that the object-format and other
> > > choices are communicated from the origin to a new clone anyway, so
> > > this extra complexity "until X is fixed, it behaves this way, but
> > > otherwise the variable is read in the meantime" may be a disservice
> > > to the end users, even though it may make it easier in the shorter
> > > term for maintainers of programs that rely on the buggy "git clone"
> > > that partially honored this environment variable.
> > >
> > > In short, I am still not convinced that the above is a good design
> > > choice in the longer term.
> >
> > I also think it is working against the backwards-compatible design of
> > the hash function transition.
>
> To be honest this whole approach seems to be completely flawed to me and
> against the whole design of git in the first place.
>
> In a recent email Linus Torvalds explained why object ids were
> calculated based {type, size, data} [1], and he explained very clearly
> that two objects with exactly the same data are not supposed to have the
> same id if the type is different.

He said:

--- quote-begin ---
The "no aliasing" means that no two distinct pointers can point to the
same data. So a tagged pointer of type "commit" can not point to the
same object as a tagged pointer of type "blob". They are distinct
pointers, even if (maybe) the commit object encoding ends up then
being identical to a blob object.
--- quote-end ---

As far as I could tell he didn't really explain *why* he wanted this,
and IMO it is non-obvious why he would care if a blob and a commit had
the same text, and thus the same ID. He just said he didnt want it to
happen, not why. I can imagine some aesthetic reasons why you might
want to ensure that no blob has the same ID as a commit, and I can
imagine it might make debugging easier at certain points, but it seems
unnecessary given the data is write once.

> If even the tiniest change such as adding a period to a commit messange
> changes the object id (and thus semantically makes it a different
> object), then it makes sense that changing the type of an object also
> changes the object id (and thus it's also a different object).
>
> And because the id of the parent is included in the content of every
> commit, the top-level id ensures the integrity of the whole graph.
>
> But then comes this notion that the hash algorithm is a property of the
> repository, and not part of the object storage, which means changing the
> whole hash algorithm of a repository is considered less of a change than
> adding a period to the commit message, worse: not a change at all.

I really dont understand why you think having two hash functions
producing different results for the same data is comparable to a
single hash producing different results for different data. In one
case you have two different continuum of identifiers, with one ID per
continuum, and in the other you have two different identifiers in the
same continuum, and  if you a continuum you would have 4 different
identifiers right? Eg, the two cases are really quite different at a
fundamental level.

> I am reminded of the warning Sam Smith gave to the Git project [2] which
> seemed to be unheard, but the notion of cryptographic algorithm agility
> makes complete sense to me.
>
> In my view one repository should be able to have part SHA-1 history,
> part SHA3-256 history, and part BLAKE2b history.

Isn't this orthagonal to your other points?

> Changing the hash algorithm of one commit should change the object id of
> that commit, and thus make it semantically a different commit.
>
> In other words: an object of type "blob" should never be confused with
> an object of type "blob:sha-256", even if the content is exactly the
> same.

This doesn't make sense to me.  As long as we can distinguish the
hashes produced by the different hash functions in use we can create a
mapping of the data that is hashed such that we have a 1:1 mapping of
identifiers of each type at which point it really doesn't matter which
hash function is used.

> The fact that apparently it's so easy to clone a repository with
> the wrong hash algorithm should give developers pause, as it means the
> whole point of using cryptographic hash algorithms to ensure the
> integrity of the commit history is completely gone.

This is a leap too far. The fact that it is "so easy to clone a repo
with the wrong hash algorithm" is completely orthogonal to the
fundamental principles of hash identifiers from strong hash functions.
You seem to be deriving grand conclusions from what sounds to me like
a simple bug/design-oversight.

> I have not been following the SHA-1 -> OID discussions, but I
> distinctively recall Linus Torvalds mentioning that the choice of using
> SHA-1 wasn't even for security purposes, it was to ensure integrity.
> When I do a `git fetch` as long as the new commits have the same SHA-1
> as parent as the SHA-1s I have in my repository I can be relatively
> certain the repository has not been tampered with. Which means that if I
> do a `git fetch` that suddenly brings SHA-256 commits, some of them must
> have SHA-1 parents that match the ones I currently have. Otherwise how
> do I know it's the same history?

So consider what /could/ happen here. You fetch a commit which uses
SHA-256 into a repo where all of your local commits use SHA-1. The
commit you fetched says its parent is some SHA-256 ID you don't know
about as all your ID's are SHA-1. So git then could go and construct
an index, hashing each item using SHA-256 instead of SHA-1, and using
the result to build a bi-directional mapping from SHA-1 to SHA-256 and
back.  All it has to do then is look into the mapping to find if the
SHA-256 parent id is present in your repo. If it is then you know it's
the same history.

The key point here is that if you ignore SHAttered artifacts (which
seems reasonable as you can detect the attack during hashing)  you can
build a 1:1 map of SHA-1 and SHA-256 ids.  Once you have that mapping
it doesn't matter which ID is used.

> Maybe that's one of the reasons people don't seem particularly eager to
> move away from SHA-1:

Maybe, but it doesn't make sense to me.  You seem to be putting undue
weight on an unnecessary aspect of the git design: there doesn't seem
to be a reason for Linuses "no aliasing" policy, and it seems like one
could build a git-a-like without it without suffering any significant
penalties. Regardless, provided that the hash functions allow a 1:1
mapping of ID's (which is assumed by using "collision free hash
functions"), it seems like it really doesn't matter which hash is used
at any given time.

cheers,
Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Is GIT_DEFAULT_HASH flawed?
  2023-05-03  9:03                             ` Adam Majer
@ 2023-05-03 15:44                               ` Felipe Contreras
  2023-05-03 17:21                                 ` Adam Majer
  0 siblings, 1 reply; 58+ messages in thread
From: Felipe Contreras @ 2023-05-03 15:44 UTC (permalink / raw)
  To: Adam Majer, Felipe Contreras, Jeff King, Junio C Hamano
  Cc: brian m. carlson, git

Adam Majer wrote:
> On 5/3/23 01:46, Felipe Contreras wrote:
> > To be honest this whole approach seems to be completely flawed to me and
> > against the whole design of git in the first place.
> 
> The discussion above is mostly moot now since this has been fixed in 
> later patches in this thread, AFAIK.

That particular isssue might be fixed, but that issue should never have
happened in the first place if the design was correct.

A bad design makes certain errors prone to happen, a good design makes the same
errors happen rarely, a great design makes those errors impossible.

Git was designed to make it *impossible* to confuse two commits with similar
data.

The symptom might have been fixed, that doesn't mean there's no underlying
problem.

> It's also moot for other reasons, like the hash function transition plan is
> not really implemented, yet.

The implemention of the plan isn't the problem, it's the plan itself.

> Also, this was about corner-case, like it often is.

A corner-case that should be impossible.

> > In a recent email Linus Torvalds explained why object ids were
> > calculated based {type, size, data} [1], and he explained very clearly
> > that two objects with exactly the same data are not supposed to have the
> > same id if the type is different.
> 
> This is different. But aside, type + size + data are not really much 
> different from just having data in a hash function.

It's completely different.

> There are plenty of hash collisions where
> 
>      HASH(type + size + data) == HASH(type + size + data')
> 
> by definition of how these functions work. The problem is always in 
> finding these collisions. But anyway...

I don't think you understand why Linus Torvalds chose to hash objects.

> > In my view one repository should be able to have part SHA-1 history,
> > part SHA3-256 history, and part BLAKE2b history.
> 
> Yes, that would be great. Please provide patch series for this :-)

I have hundreds of patches being ignored, why would I write yet another patch
series that will be ignored?

> > I have not been following the SHA-1 -> OID discussions, but I
> > distinctively recall Linus Torvalds mentioning that the choice of using
> > SHA-1 wasn't even for security purposes, it was to ensure integrity.
> 
> These are different sides of the same coin. Hashes are used to provide 
> integrity. Hashes like MD4, MD5, SHA1, SHA256 are there for integrity. 
> Some of these are no longer recommended and some are completely broken.

There are different philosophical views of what "security" means, and it seems
pretty clear to me that your view does not align with the view of Linus
Torvalds.

> > Better the SHA-1 you know, than the SHA-256 you don't.
> 
> Wrong conclusion ;) Also, we know SHA-256

You don't understand what is being said.

Which hash is more trustworthy?

 a. 69c786637d7a7fe3b2b8f7d989af095f5f49c3a8
 b. d891b12414e1d9331f8cbb15acfe690671974f27ba76e2b423294cfb7a055f2f

If you answer b just beacuse it's SHA-256 you don't understand security.

b is a random commit I generated, a is the current git.git master.

A SHA-1 hash from a source you trust is inifinitely more trustworthy than a
random SHA-256 hash. Even a known MD5 hash is better in this respect.

> Keep in mind -- hashes are there for object reference.

No. I don't think you understand why Linus Torvalds used hashes.

> If you have SHA1 repo, you can calculate a SHA256 or whatever hash for any
> type object.

I know it *can* be done, I understand how hash algorithms work, but just
because something *can* be done doesn't mean it *should*.

You *can* generate a SHA-1 of a blob's data, instead of a SHA-1 of a blob's
`type + size + data`, does that mean we should? No.

> Finally, let not have a "bike shed" discussion about this.

Discussing the original design of git's object storage which has withstood the
test of time for 18 years is not "bike sheding".

I don't even think you understand what I'm trying to say.

---

Why do you think these commands generate different hashes?

  git hash-object -t blob /dev/null
  git hash-object -t tree /dev/null

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Is GIT_DEFAULT_HASH flawed?
  2023-05-03 15:44                               ` Felipe Contreras
@ 2023-05-03 17:21                                 ` Adam Majer
  2023-05-08  0:34                                   ` Felipe Contreras
  0 siblings, 1 reply; 58+ messages in thread
From: Adam Majer @ 2023-05-03 17:21 UTC (permalink / raw)
  To: Felipe Contreras, Jeff King, Junio C Hamano; +Cc: brian m. carlson, git



On May 3, 2023 5:44:24 p.m. GMT+02:00, Felipe Contreras <felipe.contreras@gmail.com> wrote:
>Git was designed to make it *impossible* to confuse two commits with similar
>data.

That was never ever the problem here.



>> This is different. But aside, type + size + data are not really much 
>> different from just having data in a hash function.
>
>It's completely different.

How so? Type and size are just about 2 and a dozen bits of data, respectfully.


>There are different philosophical views of what "security" means, and it seems
>pretty clear to me that your view does not align with the view of Linus
>Torvalds.


I'm not sure why you are name dropping Linus everywhere or assuming you know more than anyone here about hash functions.

Your explanation is quite clear to me (and probably everyone else here). But I'll just leave it at that.

Cheers,
Adam


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Is GIT_DEFAULT_HASH flawed?
  2023-05-03  9:09                             ` demerphq
@ 2023-05-03 18:20                               ` Felipe Contreras
  0 siblings, 0 replies; 58+ messages in thread
From: Felipe Contreras @ 2023-05-03 18:20 UTC (permalink / raw)
  To: demerphq, Felipe Contreras
  Cc: Jeff King, Junio C Hamano, brian m. carlson, git, Adam Majer

demerphq wrote:
> On Wed, 3 May 2023 at 02:17, Felipe Contreras
> <felipe.contreras@gmail.com> wrote:
> > Changing the subject as this message seems like a different topic.
> > Jeff King wrote:
> > > On Wed, Apr 26, 2023 at 02:33:30PM -0700, Junio C Hamano wrote:
> > > > "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> > > >
> > > > >  `GIT_DEFAULT_HASH`::
> > > > >   If this variable is set, the default hash algorithm for new
> > > > >   repositories will be set to this value. This value is currently
> > > > > + ignored when cloning if the remote value can be definitively
> > > > > + determined; the setting of the remote repository is used
> > > > > + instead. The value is honored if the remote repository's
> > > > > + algorithm cannot be determined, such as some cases when
> > > > > + the remote repository is empty. The default is "sha1".
> > > > > + THIS VARIABLE IS EXPERIMENTAL! See `--object-format`
> > > > > + in linkgit:git-init[1].
> > > >
> > > > We'd need to evantually cover all the transports (and non-transport
> > > > like the "--local" optimization) so that the object-format and other
> > > > choices are communicated from the origin to a new clone anyway, so
> > > > this extra complexity "until X is fixed, it behaves this way, but
> > > > otherwise the variable is read in the meantime" may be a disservice
> > > > to the end users, even though it may make it easier in the shorter
> > > > term for maintainers of programs that rely on the buggy "git clone"
> > > > that partially honored this environment variable.
> > > >
> > > > In short, I am still not convinced that the above is a good design
> > > > choice in the longer term.
> > >
> > > I also think it is working against the backwards-compatible design of
> > > the hash function transition.
> >
> > To be honest this whole approach seems to be completely flawed to me and
> > against the whole design of git in the first place.
> >
> > In a recent email Linus Torvalds explained why object ids were
> > calculated based {type, size, data} [1], and he explained very clearly
> > that two objects with exactly the same data are not supposed to have the
> > same id if the type is different.
> 
> He said:
> 
> --- quote-begin ---
> The "no aliasing" means that no two distinct pointers can point to the
> same data. So a tagged pointer of type "commit" can not point to the
> same object as a tagged pointer of type "blob". They are distinct
> pointers, even if (maybe) the commit object encoding ends up then
> being identical to a blob object.
> --- quote-end ---
> 
> As far as I could tell he didn't really explain *why* he wanted this,
> and IMO it is non-obvious why he would care if a blob and a commit had
> the same text, and thus the same ID. He just said he didnt want it to
> happen, not why.

But we don't need to understand why to know it's part of the core design.

If something is part of the core design as a rule it's better to not
mess with it.

> I can imagine some aesthetic reasons why you might want to ensure that
> no blob has the same ID as a commit, and I can imagine it might make
> debugging easier at certain points, but it seems unnecessary given the
> data is write once.

I don't know, but to me separating objects makes sense not just conceptually,
but in practice there's a whole class of potential errors that could be
avoided.

For example, I can think of an implementation of `git prune` that would check
commits first, then trees, then blobs, and the blobs that not reachable from
any trees are removed. But if a commit can have the same id as a blob, you have
to think of a different implementation.

If that's not possible, then you just forget about those potential issues.

> > If even the tiniest change such as adding a period to a commit messange
> > changes the object id (and thus semantically makes it a different
> > object), then it makes sense that changing the type of an object also
> > changes the object id (and thus it's also a different object).
> >
> > And because the id of the parent is included in the content of every
> > commit, the top-level id ensures the integrity of the whole graph.
> >
> > But then comes this notion that the hash algorithm is a property of the
> > repository, and not part of the object storage, which means changing the
> > whole hash algorithm of a repository is considered less of a change than
> > adding a period to the commit message, worse: not a change at all.
> 
> I really dont understand why you think having two hash functions
> producing different results for the same data is comparable to a
> single hash producing different results for different data.

That depends on what you consider the "data" to be.

If you consider the content of a blob to be the data, then you wouldn't
have different results if a commit has the same data: it would be the
same id.

If instead you consider the data to be `type+content`, then you would
have different results.

> In one case you have two different continuum of identifiers, with one
> ID per continuum, and in the other you have two different identifiers
> in the same continuum, and  if you a continuum you would have 4
> different identifiers right? Eg, the two cases are really quite
> different at a fundamental level.

That entirely depends on what data you hash.

If you hash `algo+type+size+data` there's only one id per object.
Period.

> > I am reminded of the warning Sam Smith gave to the Git project [2] which
> > seemed to be unheard, but the notion of cryptographic algorithm agility
> > makes complete sense to me.
> >
> > In my view one repository should be able to have part SHA-1 history,
> > part SHA3-256 history, and part BLAKE2b history.
> 
> Isn't this orthagonal to your other points?

Not if you consider changing the hash algorithm of a repository to be an
important part of its history (more important than adding a period to a
commit).

> > Changing the hash algorithm of one commit should change the object id of
> > that commit, and thus make it semantically a different commit.
> >
> > In other words: an object of type "blob" should never be confused with
> > an object of type "blob:sha-256", even if the content is exactly the
> > same.
> 
> This doesn't make sense to me.  As long as we can distinguish the
> hashes produced by the different hash functions in use we can create a
> mapping of the data that is hashed such that we have a 1:1 mapping of
> identifiers of each type at which point it really doesn't matter which
> hash function is used.

Yes, we *can*, that doesn't mean we *should*.

If you do `git commit --ammend --signoff` to add your `Signed-off-by` to
a commit, there's a 1:1 mapping from the original commit, to the new
one, but conceptually in git they are different objects.

I recall Linus Torvalds mentioned he used Monotone as a guideline of
what *not* to do. In Monotone you could add the equivalent of
`Signed-off-by` without changing the hash of the commit, in fact, you
could add any metadata if I recall correctly. But this opens a whole can
of worms because now how do you know you have all the metadata relevant
to the commit?

Making all the metadata of a commit part of the commit solves the
integrity problem Monotone had at the cost of making git commits
essentially immutable: any change means it's a different commit.

If making *any* change in the object, makes it conceptually a different
object, including the type of the object, how on Earth is changing the
hash algorithm not considered a change?

This object:

  ❯ git hash-object -t blob /dev/null
  e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

Is considered different from this object:

  ❯ git hash-object -t tree /dev/null
  4b825dc642cb6eb9a060e54bf8d69288fbee4904

That's why they have a different hash.

Why would these objects be considered the same?

  ❯ git hash-object -t blob /dev/null
  e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

  ❯ git hash-object -t blob /dev/null
  473a0f4c3be8a93681a267e3b1e9a7dcda1185436fe141f7749120a303721813

It makes *zero* sense that adding a period changes the object, adding a
s-o-b changes the object, changing the type changes the object, but
changing the hash algorithm does not.

> > The fact that apparently it's so easy to clone a repository with
> > the wrong hash algorithm should give developers pause, as it means the
> > whole point of using cryptographic hash algorithms to ensure the
> > integrity of the commit history is completely gone.
> 
> This is a leap too far. The fact that it is "so easy to clone a repo
> with the wrong hash algorithm" is completely orthogonal to the
> fundamental principles of hash identifiers from strong hash functions.

Only if you think changing the hash algorithm is a less important part
of an object than adding a period.

Do you honestly think these two should be considered the same object?

a)

  tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904
  author Felipe Contreras <felipe.contreras@gmail.com> 0 -0600
  committer Felipe Contreras <felipe.contreras@gmail.com> 0 -0600

  Initial commit

b)

  tree 6ef19b41225c5369f1c104d45d8d85efa9b057b53b14b4b9b939dd74decc5321
  author Felipe Contreras <felipe.contreras@gmail.com> 0 -0600
  committer Felipe Contreras <felipe.contreras@gmail.com> 0 -0600

  Initial commit

> You seem to be deriving grand conclusions from what sounds to me like
> a simple bug/design-oversight.

I think you are dismissing the brilliant idea that made git's object
storage model so successful.

> > I have not been following the SHA-1 -> OID discussions, but I
> > distinctively recall Linus Torvalds mentioning that the choice of using
> > SHA-1 wasn't even for security purposes, it was to ensure integrity.
> > When I do a `git fetch` as long as the new commits have the same SHA-1
> > as parent as the SHA-1s I have in my repository I can be relatively
> > certain the repository has not been tampered with. Which means that if I
> > do a `git fetch` that suddenly brings SHA-256 commits, some of them must
> > have SHA-1 parents that match the ones I currently have. Otherwise how
> > do I know it's the same history?
> 
> So consider what /could/ happen here. You fetch a commit which uses
> SHA-256 into a repo where all of your local commits use SHA-1. The
> commit you fetched says its parent is some SHA-256 ID you don't know
> about as all your ID's are SHA-1. So git then could go and construct
> an index, hashing each item using SHA-256 instead of SHA-1, and using
> the result to build a bi-directional mapping from SHA-1 to SHA-256 and
> back.  All it has to do then is look into the mapping to find if the
> SHA-256 parent id is present in your repo. If it is then you know it's
> the same history.

Yeah, that *could* happen. Doesn't mean it *should*.

> The key point here is that if you ignore SHAttered artifacts (which
> seems reasonable as you can detect the attack during hashing)  you can
> build a 1:1 map of SHA-1 and SHA-256 ids.  Once you have that mapping
> it doesn't matter which ID is used.

It may not matter to you. It matters to me.

> > Maybe that's one of the reasons people don't seem particularly eager to
> > move away from SHA-1:
> 
> Maybe, but it doesn't make sense to me.  You seem to be putting undue
> weight on an unnecessary aspect of the git design: there doesn't seem
> to be a reason for Linuses "no aliasing" policy, and it seems like one
> could build a git-a-like without it without suffering any significant
> penalties.

The fact you don't see a reason doesn't mean there isn't one. This is an
argument from ignorance fallacy.

The appendix was considered a vestigial organ because nobody could see a
reason for it. Did that mean it served no puprpose? No.

> Regardless, provided that the hash functions allow a 1:1 mapping of
> ID's (which is assumed by using "collision free hash functions"), it
> seems like it really doesn't matter which hash is used at any given
> time.

People who designed CVS, Subversion, Monotone, and Mercurial didn't see
a reason for many of Git's design choices either.

I'd argue they were wrong.

I think changing the hash algorithm of a commit matters.

Cheers.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Is GIT_DEFAULT_HASH flawed?
  2023-05-02 23:46                           ` Is GIT_DEFAULT_HASH flawed? Felipe Contreras
  2023-05-03  9:03                             ` Adam Majer
  2023-05-03  9:09                             ` demerphq
@ 2023-05-03 22:54                             ` brian m. carlson
  2023-05-08  2:00                               ` Felipe Contreras
  2 siblings, 1 reply; 58+ messages in thread
From: brian m. carlson @ 2023-05-03 22:54 UTC (permalink / raw)
  To: Felipe Contreras; +Cc: Jeff King, Junio C Hamano, git, Adam Majer

[-- Attachment #1: Type: text/plain, Size: 3096 bytes --]

On 2023-05-02 at 23:46:02, Felipe Contreras wrote:
> In my view one repository should be able to have part SHA-1 history,
> part SHA3-256 history, and part BLAKE2b history.

That is practically very difficult and it means that it's hard to have
confidence in the later history because SHA-1 is weak and you have to
rely on it to verify the SHA-256 history later.  Since attacks always
get better, SHA-1 will eventually be so weak that collisions can be
computed in the amount of time we now take for MD4 or MD5 collisions
(i.e., seconds), and with your plan, we'd have to retain that history
forever with the resulting lack of confidence in part of the history.

This also doesn't work with various structures like trees, the index,
and pack and index formats, which have no indication of the algorithm
used and simply rely on fixed-size, often 4-byte aligned object IDs
without any metadata.  In addition, the internals of the code often
don't pass around enough data to make these values variable and thus
this approach would substantially complicate the code in many ways.

Also, we've already decided on the current design a long time ago with
the transition plan after extensive, thoughtful discussion by many
people.  Very few people other than me have worked on sending patches to
work on the hash function transition, and that work up to now has all
been done on my personal time, without compensation of any sort, out of
a desire to improve the project.  Lots of people have opined on how it
should have been different without sending any patches.

If you would like to propose patches for the extensive amount of work to
implement your solution, then we could consider them, although I will
warn you that your approach will likely require at least several hundred
patches.  However, I refer you to the list archives to determine why
your approach is not the one we chose and is not, in my view, the best
path forward.  I should also be clear that I have no intention of
submitting patches to change our approach now or in the future, or
redoing the patches I've already sent.

> The fact that apparently it's so easy to clone a repository with
> the wrong hash algorithm should give developers pause, as it means the
> whole point of using cryptographic hash algorithms to ensure the
> integrity of the commit history is completely gone.

No, it doesn't.  It means that our empty repositories until recently
lacked any indication of the algorithm or other capabilities, which was
a mistake in our original protocol design that has now been corrected.

If you interact with the repository later on when it has data, then if
you're using the wrong hash algorithm, you'll find that you get a
helpful error message that that's not yet supported.  If you patched Git
to ignore that check, you'd find that your repository would just be very
broken in many ways with lots of random crashing and seemingly unrelated
error messages instead of subtly using the wrong algorithm.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Is GIT_DEFAULT_HASH flawed?
  2023-05-03 17:21                                 ` Adam Majer
@ 2023-05-08  0:34                                   ` Felipe Contreras
  0 siblings, 0 replies; 58+ messages in thread
From: Felipe Contreras @ 2023-05-08  0:34 UTC (permalink / raw)
  To: Adam Majer, Felipe Contreras, Jeff King, Junio C Hamano
  Cc: brian m. carlson, git

Adam Majer wrote:
> On May 3, 2023 5:44:24 p.m. GMT+02:00, Felipe Contreras <felipe.contreras@gmail.com> wrote:
> >Git was designed to make it *impossible* to confuse two commits with similar
> >data.
> 
> That was never ever the problem here.

But it will be.

> >> This is different. But aside, type + size + data are not really much 
> >> different from just having data in a hash function.
> >
> >It's completely different.
> 
> How so? Type and size are just about 2 and a dozen bits of data, respectfully.

Do you understand how checksums work?

Compare these two objects:

 1. 0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
 2. 6ba62a7c5e3e9a260c5a30adf2756882c02f12a6

Are they a) "not much different", or b) "completely different"?

Answer: doesn't matter, they are *different*. Period.

> >There are different philosophical views of what "security" means, and it seems
> >pretty clear to me that your view does not align with the view of Linus
> >Torvalds.
> 
> 
> I'm not sure why you are name dropping Linus everywhere

I don't know if you are aware, but Linus Torvalds is the author of git.

He also happens to be the author of the most successful software project
in history: Linux.

So generally his design choices are considered to be good.

> or assuming you know more than anyone here about hash functions.

I don't assume such a thing.

But I'm pretty certain not many people are aware of the integrity issues
VCSs presented circa 2004, that git hashes solved in 2005, because if
they did, they could have created an object model storage similar to
git's, and no one did (except Linus Torvalds).

> Your explanation is quite clear to me (and probably everyone else
> here). But I'll just leave it at that.

Is it? Then you would have no trouble steel manning my argument, which
you haven't done.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Is GIT_DEFAULT_HASH flawed?
  2023-05-03 22:54                             ` brian m. carlson
@ 2023-05-08  2:00                               ` Felipe Contreras
  2023-05-08 21:38                                 ` brian m. carlson
  0 siblings, 1 reply; 58+ messages in thread
From: Felipe Contreras @ 2023-05-08  2:00 UTC (permalink / raw)
  To: brian m. carlson, Felipe Contreras
  Cc: Jeff King, Junio C Hamano, git, Adam Majer

brian m. carlson wrote:
> On 2023-05-02 at 23:46:02, Felipe Contreras wrote:
> > In my view one repository should be able to have part SHA-1 history,
> > part SHA3-256 history, and part BLAKE2b history.
> 
> That is practically very difficult and it means that it's hard to have
> confidence in the later history because SHA-1 is weak and you have to
> rely on it to verify the SHA-256 history later.

Why would I have to rely on SHA-1 to verify the SHA-256 history later
on?

> Since attacks always get better, SHA-1 will eventually be so weak that
> collisions can be computed in the amount of time we now take for MD4
> or MD5 collisions (i.e., seconds), and with your plan, we'd have to
> retain that history forever with the resulting lack of confidence in
> part of the history.

We have to do the same with your plan as well.

Your plan relies on SHA-256 being interchangeable with SHA-1, so if the
Git project decided to switch *today* to SHA-256, we would have two
object ids:

 1. 69c786637d7a7fe3b2b8f7d989af095f5f49c3a8
 2. 2b4ebdace10518280172449c012af17b51e9d46e023a91a5d3dd3a8ad9e4a116

This object would refer to a tree and a parent object with SHA-1 ids,
which would be OK, because they would be interchangeable with some
corresponding SHA-256 ids.

Isn't that your plan?

Therefore the SHA-1 of the parent of the commit, and the tree of the
commit would be trusted and retained forever.

> This also doesn't work with various structures like trees, the index,
> and pack and index formats, which have no indication of the algorithm
> used and simply rely on fixed-size, often 4-byte aligned object IDs
> without any metadata.

So? The index and pack objects can be regenerated, so at any point in
time they could be regenerated for SHA-1 or SHA-256.

The tree object is a no-brainer. For an object of type "commit:256" you
require a tree of type "tree:256". Easy.

> In addition, the internals of the code often don't pass around enough
> data to make these values variable and thus this approach would
> substantially complicate the code in many ways.

Really? `enum object_type` is not passed around?

> Also, we've already decided on the current design a long time ago with
> the transition plan after extensive, thoughtful discussion by many
> people.

Who is "we"?

I've participated in many discussions in the git mailing list where the
consensus is that 99% of people decide to do something, and that
something never happens.

The fact that "we" have decided something doesn't carry as much weight
as you seem to think it does.

Moreover, haven't "we" decided that this transitioning plan is
*tentantive*, and the SHA-256 feature is *experimental*?

> Very few people other than me have worked on sending patches to
> work on the hash function transition, and that work up to now has all
> been done on my personal time, without compensation of any sort, out of
> a desire to improve the project.

Which seems to suggest if there is a need, it's not very pressing.

Doesn't it?

> Lots of people have opined on how it should have been different
> without sending any patches.

As is typical.

> If you would like to propose patches for the extensive amount of work
> to implement your solution, then we could consider them, although I
> will warn you that your approach will likely require at least several
> hundred patches.

That's not an issue. I've started projects with several hundred patches
just to prove that something is possible.

> However, I refer you to the list archives to determine why
> your approach is not the one we chose and is not, in my view, the best
> path forward.

Yeah? Provide me with *one* mail proposing my approach.

> I should also be clear that I have no intention of submitting patches
> to change our approach now or in the future, or redoing the patches
> I've already sent.

You don't have to. (and it's not really necessary as it's typically the
case that people don't provide patches for designs that compete against
their own).

> > The fact that apparently it's so easy to clone a repository with
> > the wrong hash algorithm should give developers pause, as it means the
> > whole point of using cryptographic hash algorithms to ensure the
> > integrity of the commit history is completely gone.
> 
> No, it doesn't.  It means that our empty repositories until recently
> lacked any indication of the algorithm or other capabilities, which was
> a mistake in our original protocol design that has now been corrected.

Yes it does.

Can I clone a repository that already transitioned to SHA-256, and then
push a SHA-1 commit?

Well, of course I can't because that's not currently implemented, but if
we followed the current plan that apparently "we" have decided on, it
should be.

> If you interact with the repository later on when it has data, then if
> you're using the wrong hash algorithm, you'll find that you get a
> helpful error message that that's not yet supported.

That isn't true according to your plan.

SHA-1 would be interchangeable with SHA-256, would it not? So according
to the current plan, I would be able to push a SHA-1 commit on a SHA-256
repository.

---

Is it not the case that the current plan aims to have support for SHA-1
and SHA-256 object ids at the same time?

In other words: in your ideal world, the following object ids would
*both* refer to the same git object:

 1. 69c786637d7a7fe3b2b8f7d989af095f5f49c3a8
 2. 2b4ebdace10518280172449c012af17b51e9d46e023a91a5d3dd3a8ad9e4a116

Would they not?

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Is GIT_DEFAULT_HASH flawed?
  2023-05-08  2:00                               ` Felipe Contreras
@ 2023-05-08 21:38                                 ` brian m. carlson
  2023-05-09 10:32                                   ` Oswald Buddenhagen
  0 siblings, 1 reply; 58+ messages in thread
From: brian m. carlson @ 2023-05-08 21:38 UTC (permalink / raw)
  To: Felipe Contreras; +Cc: Jeff King, Junio C Hamano, git, Adam Majer

[-- Attachment #1: Type: text/plain, Size: 5737 bytes --]

On 2023-05-08 at 02:00:56, Felipe Contreras wrote:
> brian m. carlson wrote:
> > On 2023-05-02 at 23:46:02, Felipe Contreras wrote:
> > > In my view one repository should be able to have part SHA-1 history,
> > > part SHA3-256 history, and part BLAKE2b history.
> > 
> > That is practically very difficult and it means that it's hard to have
> > confidence in the later history because SHA-1 is weak and you have to
> > rely on it to verify the SHA-256 history later.
> 
> Why would I have to rely on SHA-1 to verify the SHA-256 history later
> on?

If your history contains mixed and matched hash algorithms, you'll need
to be able to verify those commits to the root to have any confidence in
a signed commit or tag, which means trusting SHA-1 if you have any SHA-1
commits in the repository.

> > Since attacks always get better, SHA-1 will eventually be so weak that
> > collisions can be computed in the amount of time we now take for MD4
> > or MD5 collisions (i.e., seconds), and with your plan, we'd have to
> > retain that history forever with the resulting lack of confidence in
> > part of the history.
> 
> We have to do the same with your plan as well.
> 
> Your plan relies on SHA-256 being interchangeable with SHA-1, so if the
> Git project decided to switch *today* to SHA-256, we would have two
> object ids:
> 
>  1. 69c786637d7a7fe3b2b8f7d989af095f5f49c3a8
>  2. 2b4ebdace10518280172449c012af17b51e9d46e023a91a5d3dd3a8ad9e4a116
> 
> This object would refer to a tree and a parent object with SHA-1 ids,
> which would be OK, because they would be interchangeable with some
> corresponding SHA-256 ids.
> 
> Isn't that your plan?

For a period of time, yes.  At some point, people will abandon SHA-1 and
won't use it anymore for a particular repository, and then its security
doesn't matter.

> Therefore the SHA-1 of the parent of the commit, and the tree of the
> commit would be trusted and retained forever.

Nope.  At some point, we just turn off SHA-1.

> > This also doesn't work with various structures like trees, the index,
> > and pack and index formats, which have no indication of the algorithm
> > used and simply rely on fixed-size, often 4-byte aligned object IDs
> > without any metadata.
> 
> So? The index and pack objects can be regenerated, so at any point in
> time they could be regenerated for SHA-1 or SHA-256.

Right, and that point you've basically converted the repository over.

> The tree object is a no-brainer. For an object of type "commit:256" you
> require a tree of type "tree:256". Easy.

That doesn't work with the pack format because there are only seven
valid types of objects, and five of them are used.

> > Also, we've already decided on the current design a long time ago with
> > the transition plan after extensive, thoughtful discussion by many
> > people.
> 
> Who is "we"?

The list members.

> I've participated in many discussions in the git mailing list where the
> consensus is that 99% of people decide to do something, and that
> something never happens.
> 
> The fact that "we" have decided something doesn't carry as much weight
> as you seem to think it does.

Jonathan Nieder decided to propose an approach for how we'd go about
this, and it was discussed extensively on the list and the parts that
I've implemented have almost completely conformed to that documentation.

I think his approach was very thoughtful and addressed many questions
about how the project was to proceed, and I appreciate that he sent it
and that others contributed in a helpful way.

The project wouldn't have been possible unless we had a clear decision
on how to implement things.  There was an opportunity at the time to
comment on the approach and propose alternatives, and we didn't choose
to adopt any.

> Moreover, haven't "we" decided that this transitioning plan is
> *tentantive*, and the SHA-256 feature is *experimental*?

We've documented that it's experimental because it's not seeing wide
use.  I expect that will change, at which point it will no longer be
experimental.  The design is not tentative and there are no plans to
change it.

> > Very few people other than me have worked on sending patches to
> > work on the hash function transition, and that work up to now has all
> > been done on my personal time, without compensation of any sort, out of
> > a desire to improve the project.
> 
> Which seems to suggest if there is a need, it's not very pressing.
> 
> Doesn't it?

No, I don't agree.  An approach to continue to use SHA-1 indefinitely
means that Git will not be viable at many major organizations and
in many major governments which have restrictions on using insecure
cryptography.

I think this ends the point at which I'd like to respond to your
proposal further.  I continue to feel that it's argumentative,
unproductive, and not in the best interests of the project, and I don't
think continuing here is going to shed any more light on the topic.

Ultimately, I feel that my previous statement that we're not going to
see eye to eye on most things continues to be correct.  I also don't
think arguing with you is a productive use of my time or a positive
contribution to the community, and in general, I'm going to avoid
commenting on your patches or proposals because I can't see a way that
we can interact in a useful and helpful way on the list or elsewhere. As
a result, I'll kindly ask you to refrain from CCing me on further emails
to the list or otherwise emailing me for the indefinite future, and I'll
do the same for you after this email.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Is GIT_DEFAULT_HASH flawed?
  2023-05-08 21:38                                 ` brian m. carlson
@ 2023-05-09 10:32                                   ` Oswald Buddenhagen
  2023-05-09 16:47                                     ` Junio C Hamano
  0 siblings, 1 reply; 58+ messages in thread
From: Oswald Buddenhagen @ 2023-05-09 10:32 UTC (permalink / raw)
  To: brian m. carlson, Felipe Contreras, Jeff King, Junio C Hamano,
	git, Adam Majer

On Mon, May 08, 2023 at 09:38:56PM +0000, brian m. carlson wrote:
>On 2023-05-08 at 02:00:56, Felipe Contreras wrote:
>> brian m. carlson wrote:
>> > On 2023-05-02 at 23:46:02, Felipe Contreras wrote:
>> > > In my view one repository should be able to have part SHA-1 history,
>> > > part SHA3-256 history, and part BLAKE2b history.
>> > 
>> > That is practically very difficult and it means that it's hard to have
>> > confidence in the later history because SHA-1 is weak and you have to
>> > rely on it to verify the SHA-256 history later.
>> 
>> Why would I have to rely on SHA-1 to verify the SHA-256 history later
>> on?
>
>If your history contains mixed and matched hash algorithms, you'll need
>to be able to verify those commits to the root to have any confidence in
>a signed commit or tag, which means trusting SHA-1 if you have any SHA-1
>commits in the repository.
>
the history is traversed from the end anyway, so having sha-1 in the 
history is entirely irrelevant for verifying sha-256 commits, assuming 
one may only upgrade the algorithm.

the transition plan implies the intent to ultimately get rid of old 
algos, but this is a non-starter, because old histories need to remain 
accessible indefinitely (you can't rewrite all external references, and 
even for in-history references this would be unreliable and would 
falsify historical builds).

i won't try making an argument for mixed histories, as i'm assuming i 
wouldn't add anything that hasn't already been written.

-- ossi

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Is GIT_DEFAULT_HASH flawed?
  2023-05-09 10:32                                   ` Oswald Buddenhagen
@ 2023-05-09 16:47                                     ` Junio C Hamano
  0 siblings, 0 replies; 58+ messages in thread
From: Junio C Hamano @ 2023-05-09 16:47 UTC (permalink / raw)
  To: Oswald Buddenhagen
  Cc: brian m. carlson, Felipe Contreras, Jeff King, git, Adam Majer

Oswald Buddenhagen <oswald.buddenhagen@gmx.de> writes:

>>If your history contains mixed and matched hash algorithms, you'll need
>>to be able to verify those commits to the root to have any confidence in
>>a signed commit or tag, which means trusting SHA-1 if you have any SHA-1
>>commits in the repository.
>>
> the history is traversed from the end anyway, so having sha-1 in the
> history is entirely irrelevant for verifying sha-256 commits, assuming
> one may only upgrade the algorithm.

That depends on what is meant by "verify a commit".  If we are only
interested in the tree contents, then the newer commits whose trees
are hashed with a more secure hash algorithm recursively down to the
blobs would lack any weak link hashed by a less secure algorithm.
Most people do not care as deeply how their project tree came to the
shape it has today as they care about what is in the recent trees,
so this is an acceptable stance to take.

If the less secure algorithm becomes so weak that the history, up to
the last commit that was signed by it, can be rewritten arbitrarily,
however, an attacker can lie about why the code that survives to
this day looks the way it does by forging old parts of the history,
and mislead today's developers to do wrong things.  The only way to
prevent that kind of attack is to verify a commit by recursively
making sure not just the commit and its tree, but its parents are
all authentic, and less secure algorithm may make it impossible.

The old history hashed with the old algorithm needs to be kept to
help external references, but I think as a mitigation, those who
care about that part of history can create a copy of the history
hashed with the new algorithm and publish the correspondence between
the two parallel histories to assure the integrity of that part of
the old history.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v3 0/1] Fix empty SHA-256 clones with v0 and v1
  2023-04-26 20:40                 ` brian m. carlson
                                     ` (2 preceding siblings ...)
  2023-05-01 17:00                   ` [PATCH v2 0/1] Fix empty SHA-256 clones with v0 and v1 brian m. carlson
@ 2023-05-17 19:24                   ` brian m. carlson
  2023-05-17 19:24                     ` [PATCH v3 1/1] upload-pack: advertise capabilities when cloning empty repos brian m. carlson
                                       ` (2 more replies)
  3 siblings, 3 replies; 58+ messages in thread
From: brian m. carlson @ 2023-05-17 19:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Adam Majer

We recently fixed empty clones with SHA-256 over protocol v2 by
honouring the hash algorithm specified even when no refs are present.
However, in doing so, we made it impossible to set up a v0 or v1
repository by cloning from an empty SHA-256 repository.  In doing so, we
also broke the Git LFS testsuite for SHA-256 repositories.

This series introduces the dummy `capabilities^{}` entry for fetches and
clones from an empty repository for v0 and v1, just as we do for clones.
This is already supported by older versions of Git, as well as libgit2,
dulwich, and JGit.

Changes since v2:
* Move advertisement of fake capabilities ref to a separate function to
  avoid an extra strcmp.

Changes since v1:
* Drop patch to honour GIT_DEFAULT_HASH
* Support all requests, not just HTTP.
* Add more tests.
* Fix NULL pointer dereference.

brian m. carlson (1):
  upload-pack: advertise capabilities when cloning empty repos

 t/t5551-http-fetch-smart.sh | 27 +++++++++++++++++++++++++++
 t/t5700-protocol-v1.sh      | 31 +++++++++++++++++++++++++++++--
 upload-pack.c               | 22 +++++++++++++++++-----
 3 files changed, 73 insertions(+), 7 deletions(-)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v3 1/1] upload-pack: advertise capabilities when cloning empty repos
  2023-05-17 19:24                   ` [PATCH v3 " brian m. carlson
@ 2023-05-17 19:24                     ` brian m. carlson
  2023-05-17 21:48                     ` [PATCH v3 0/1] Fix empty SHA-256 clones with v0 and v1 Junio C Hamano
  2023-05-18 18:28                     ` Jeff King
  2 siblings, 0 replies; 58+ messages in thread
From: brian m. carlson @ 2023-05-17 19:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Adam Majer

From: "brian m. carlson" <bk2204@github.com>

When cloning an empty repository, protocol versions 0 and 1 currently
offer nothing but the header and flush packets for the /info/refs
endpoint. This means that no capabilities are provided, so the client
side doesn't know what capabilities are present.

However, this does pose a problem when working with SHA-256
repositories, since we use the capabilities to know the remote side's
object format (hash algorithm).  As of 8b214c2e9d ("clone: propagate
object-format when cloning from void", 2023-04-05), this has been fixed
for protocol v2, since there we always read the hash algorithm from the
remote.

Fortunately, the push version of the protocol already indicates a clue
for how to solve this.  When the /info/refs endpoint is accessed for a
push and the remote is empty, we include a dummy "capabilities^{}" ref
pointing to the all-zeros object ID.  The protocol documentation already
indicates this should _always_ be sent, even for fetches and clones, so
let's just do that, which means we'll properly announce the hash
algorithm as part of the capabilities.  This just works with the
existing code because we share the same ref code for fetches and clones,
and libgit2, JGit, and dulwich do as well.

There is one minor issue to fix, though.  If we called send_ref with
namespaces, we would return NULL with the capabilities entry, which
would cause a crash.  Instead, let's refactor out a function to print
just the ref itself without stripping the namespace and use it for our
special capabilities entry.

Add several sets of tests for HTTP as well as for local clones.  The
behavior can be slightly different for HTTP versus a local or SSH clone
because of the stateless-rpc functionality, so it's worth testing both.

Signed-off-by: brian m. carlson <bk2204@github.com>
---
 t/t5551-http-fetch-smart.sh | 27 +++++++++++++++++++++++++++
 t/t5700-protocol-v1.sh      | 31 +++++++++++++++++++++++++++++--
 upload-pack.c               | 22 +++++++++++++++++-----
 3 files changed, 73 insertions(+), 7 deletions(-)

diff --git a/t/t5551-http-fetch-smart.sh b/t/t5551-http-fetch-smart.sh
index 0908534f25..21b7767cbd 100755
--- a/t/t5551-http-fetch-smart.sh
+++ b/t/t5551-http-fetch-smart.sh
@@ -611,6 +611,33 @@ test_expect_success 'client falls back from v2 to v0 to match server' '
 	grep symref=HEAD:refs/heads/ trace
 '
 
+test_expect_success 'create empty http-accessible SHA-256 repository' '
+	mkdir "$HTTPD_DOCUMENT_ROOT_PATH/sha256.git" &&
+	(cd "$HTTPD_DOCUMENT_ROOT_PATH/sha256.git" &&
+	 git --bare init --object-format=sha256
+	)
+'
+
+test_expect_success 'clone empty SHA-256 repository with protocol v2' '
+	rm -fr sha256 &&
+	echo sha256 >expected &&
+	git -c protocol.version=2 clone "$HTTPD_URL/smart/sha256.git" &&
+	git -C sha256 rev-parse --show-object-format >actual &&
+	test_cmp actual expected &&
+	git ls-remote "$HTTPD_URL/smart/sha256.git" >actual &&
+	test_must_be_empty actual
+'
+
+test_expect_success 'clone empty SHA-256 repository with protocol v0' '
+	rm -fr sha256 &&
+	echo sha256 >expected &&
+	GIT_TRACE=1 GIT_TRACE_PACKET=1 git -c protocol.version=0 clone "$HTTPD_URL/smart/sha256.git" &&
+	git -C sha256 rev-parse --show-object-format >actual &&
+	test_cmp actual expected &&
+	git ls-remote "$HTTPD_URL/smart/sha256.git" >actual &&
+	test_must_be_empty actual
+'
+
 test_expect_success 'passing hostname resolution information works' '
 	BOGUS_HOST=gitbogusexamplehost.invalid &&
 	BOGUS_HTTPD_URL=$HTTPD_PROTO://$BOGUS_HOST:$LIB_HTTPD_PORT &&
diff --git a/t/t5700-protocol-v1.sh b/t/t5700-protocol-v1.sh
index 6c8d4c6cf1..a73b4d4ff6 100755
--- a/t/t5700-protocol-v1.sh
+++ b/t/t5700-protocol-v1.sh
@@ -244,15 +244,28 @@ test_expect_success 'push with ssh:// using protocol v1' '
 	grep "push< version 1" log
 '
 
+test_expect_success 'clone propagates object-format from empty repo' '
+	test_when_finished "rm -fr src256 dst256" &&
+
+	echo sha256 >expect &&
+	git init --object-format=sha256 src256 &&
+	git clone --no-local src256 dst256 &&
+	git -C dst256 rev-parse --show-object-format >actual &&
+
+	test_cmp expect actual
+'
+
 # Test protocol v1 with 'http://' transport
 #
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
-test_expect_success 'create repo to be served by http:// transport' '
+test_expect_success 'create repos to be served by http:// transport' '
 	git init "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" &&
 	git -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" config http.receivepack true &&
-	test_commit -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" one
+	test_commit -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" one &&
+	git init --object-format=sha256 "$HTTPD_DOCUMENT_ROOT_PATH/sha256" &&
+	git -C "$HTTPD_DOCUMENT_ROOT_PATH/sha256" config http.receivepack true
 '
 
 test_expect_success 'clone with http:// using protocol v1' '
@@ -269,6 +282,20 @@ test_expect_success 'clone with http:// using protocol v1' '
 	grep "git< version 1" log
 '
 
+test_expect_success 'clone with http:// using protocol v1 with empty SHA-256 repo' '
+	GIT_TRACE_PACKET=1 GIT_TRACE_CURL=1 git -c protocol.version=1 \
+		clone "$HTTPD_URL/smart/sha256" sha256 2>log &&
+
+	echo sha256 >expect &&
+	git -C sha256 rev-parse --show-object-format >actual &&
+	test_cmp expect actual &&
+
+	# Client requested to use protocol v1
+	grep "Git-Protocol: version=1" log &&
+	# Server responded using protocol v1
+	grep "git< version 1" log
+'
+
 test_expect_success 'fetch with http:// using protocol v1' '
 	test_commit -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" two &&
 
diff --git a/upload-pack.c b/upload-pack.c
index 08633dc121..d3312006a3 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -120,6 +120,7 @@ struct upload_pack_data {
 	unsigned allow_ref_in_want : 1;				/* v2 only */
 	unsigned allow_sideband_all : 1;			/* v2 only */
 	unsigned advertise_sid : 1;
+	unsigned sent_capabilities : 1;
 };
 
 static void upload_pack_data_init(struct upload_pack_data *data)
@@ -1206,18 +1207,17 @@ static void format_session_id(struct strbuf *buf, struct upload_pack_data *d) {
 		strbuf_addf(buf, " session-id=%s", trace2_session_id());
 }
 
-static int send_ref(const char *refname, const struct object_id *oid,
-		    int flag UNUSED, void *cb_data)
+static void write_v0_ref(struct upload_pack_data *data,
+			const char *refname, const char *refname_nons,
+			const struct object_id *oid)
 {
 	static const char *capabilities = "multi_ack thin-pack side-band"
 		" side-band-64k ofs-delta shallow deepen-since deepen-not"
 		" deepen-relative no-progress include-tag multi_ack_detailed";
-	const char *refname_nons = strip_namespace(refname);
 	struct object_id peeled;
-	struct upload_pack_data *data = cb_data;
 
 	if (mark_our_ref(refname_nons, refname, oid, &data->hidden_refs))
-		return 0;
+		return;
 
 	if (capabilities) {
 		struct strbuf symref_info = STRBUF_INIT;
@@ -1240,12 +1240,20 @@ static int send_ref(const char *refname, const struct object_id *oid,
 			     git_user_agent_sanitized());
 		strbuf_release(&symref_info);
 		strbuf_release(&session_id);
+		data->sent_capabilities = 1;
 	} else {
 		packet_fwrite_fmt(stdout, "%s %s\n", oid_to_hex(oid), refname_nons);
 	}
 	capabilities = NULL;
 	if (!peel_iterated_oid(oid, &peeled))
 		packet_fwrite_fmt(stdout, "%s %s^{}\n", oid_to_hex(&peeled), refname_nons);
+	return;
+}
+
+static int send_ref(const char *refname, const struct object_id *oid,
+		    int flag UNUSED, void *cb_data)
+{
+	write_v0_ref(cb_data, refname, strip_namespace(refname), oid);
 	return 0;
 }
 
@@ -1379,6 +1387,10 @@ void upload_pack(const int advertise_refs, const int stateless_rpc,
 			data.no_done = 1;
 		head_ref_namespaced(send_ref, &data);
 		for_each_namespaced_ref(send_ref, &data);
+		if (!data.sent_capabilities) {
+			const char *refname = "capabilities^{}";
+			write_v0_ref(&data, refname, refname, null_oid());
+		}
 		/*
 		 * fflush stdout before calling advertise_shallow_grafts because send_ref
 		 * uses stdio.

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 0/1] Fix empty SHA-256 clones with v0 and v1
  2023-05-17 19:24                   ` [PATCH v3 " brian m. carlson
  2023-05-17 19:24                     ` [PATCH v3 1/1] upload-pack: advertise capabilities when cloning empty repos brian m. carlson
@ 2023-05-17 21:48                     ` Junio C Hamano
  2023-05-17 22:28                       ` brian m. carlson
  2023-05-18 18:28                     ` Jeff King
  2 siblings, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2023-05-17 21:48 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jeff King, Adam Majer

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> We recently fixed empty clones with SHA-256 over protocol v2 by
> honouring the hash algorithm specified even when no refs are present.
> However, in doing so, we made it impossible to set up a v0 or v1
> repository by cloning from an empty SHA-256 repository.  In doing so, we
> also broke the Git LFS testsuite for SHA-256 repositories.
>
> This series introduces the dummy `capabilities^{}` entry for fetches and
> clones from an empty repository for v0 and v1, just as we do for clones.
> This is already supported by older versions of Git, as well as libgit2,
> dulwich, and JGit.
>
> Changes since v2:
> * Move advertisement of fake capabilities ref to a separate function to
>   avoid an extra strcmp.

We want this in -rc2 if not -rc1 for the upcoming release, right?
I've read the patch again and it all looked sensible.

Thanks.


> Changes since v1:
> * Drop patch to honour GIT_DEFAULT_HASH
> * Support all requests, not just HTTP.
> * Add more tests.
> * Fix NULL pointer dereference.
>
> brian m. carlson (1):
>   upload-pack: advertise capabilities when cloning empty repos
>
>  t/t5551-http-fetch-smart.sh | 27 +++++++++++++++++++++++++++
>  t/t5700-protocol-v1.sh      | 31 +++++++++++++++++++++++++++++--
>  upload-pack.c               | 22 +++++++++++++++++-----
>  3 files changed, 73 insertions(+), 7 deletions(-)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 0/1] Fix empty SHA-256 clones with v0 and v1
  2023-05-17 21:48                     ` [PATCH v3 0/1] Fix empty SHA-256 clones with v0 and v1 Junio C Hamano
@ 2023-05-17 22:28                       ` brian m. carlson
  0 siblings, 0 replies; 58+ messages in thread
From: brian m. carlson @ 2023-05-17 22:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Adam Majer

[-- Attachment #1: Type: text/plain, Size: 1279 bytes --]

On 2023-05-17 at 21:48:39, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > We recently fixed empty clones with SHA-256 over protocol v2 by
> > honouring the hash algorithm specified even when no refs are present.
> > However, in doing so, we made it impossible to set up a v0 or v1
> > repository by cloning from an empty SHA-256 repository.  In doing so, we
> > also broke the Git LFS testsuite for SHA-256 repositories.
> >
> > This series introduces the dummy `capabilities^{}` entry for fetches and
> > clones from an empty repository for v0 and v1, just as we do for clones.
> > This is already supported by older versions of Git, as well as libgit2,
> > dulwich, and JGit.
> >
> > Changes since v2:
> > * Move advertisement of fake capabilities ref to a separate function to
> >   avoid an extra strcmp.
> 
> We want this in -rc2 if not -rc1 for the upcoming release, right?
> I've read the patch again and it all looked sensible.

If it's possible, that would be great.  I understand you just put out
-rc0, and I apologize for the delay in getting back to you, but it would
be ideal to avoid having the problem we're fixing in the release.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 0/1] Fix empty SHA-256 clones with v0 and v1
  2023-05-17 19:24                   ` [PATCH v3 " brian m. carlson
  2023-05-17 19:24                     ` [PATCH v3 1/1] upload-pack: advertise capabilities when cloning empty repos brian m. carlson
  2023-05-17 21:48                     ` [PATCH v3 0/1] Fix empty SHA-256 clones with v0 and v1 Junio C Hamano
@ 2023-05-18 18:28                     ` Jeff King
  2023-05-19 15:32                       ` brian m. carlson
  2 siblings, 1 reply; 58+ messages in thread
From: Jeff King @ 2023-05-18 18:28 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Junio C Hamano, Adam Majer

On Wed, May 17, 2023 at 07:24:42PM +0000, brian m. carlson wrote:

> Changes since v2:
> * Move advertisement of fake capabilities ref to a separate function to
>   avoid an extra strcmp.

Thanks, the code change looks good to me.

I'm still a little puzzled why we need http tests both in t5551 and
t5700. Not too big a deal to have redundant tests, obviously, but I
wonder if I am missing something.

-Peff

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 0/1] Fix empty SHA-256 clones with v0 and v1
  2023-05-18 18:28                     ` Jeff King
@ 2023-05-19 15:32                       ` brian m. carlson
  0 siblings, 0 replies; 58+ messages in thread
From: brian m. carlson @ 2023-05-19 15:32 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Adam Majer

[-- Attachment #1: Type: text/plain, Size: 487 bytes --]

On 2023-05-18 at 18:28:38, Jeff King wrote:
> I'm still a little puzzled why we need http tests both in t5551 and
> t5700. Not too big a deal to have redundant tests, obviously, but I
> wonder if I am missing something.

Technically, they work with v0 and v1, which are separate, and I wanted
to test both, not just v0.  I realize that practically, they are very
similar, but I prefer to be a bit more thorough.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2023-05-19 15:32 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-05 10:28 git clone of empty repositories doesn't preserve hash Adam Majer
2023-04-05 19:04 ` Junio C Hamano
2023-04-05 19:47   ` Adam Majer
2023-04-05 20:01     ` Jeff King
2023-04-05 20:40       ` Junio C Hamano
2023-04-05 21:15         ` Junio C Hamano
2023-04-05 21:26           ` Jeff King
2023-04-05 22:48           ` brian m. carlson
2023-04-06 13:11           ` Adam Majer
2023-04-25 21:35           ` brian m. carlson
2023-04-25 22:24             ` Junio C Hamano
2023-04-25 23:12             ` Junio C Hamano
2023-04-26  0:20               ` brian m. carlson
2023-04-26 11:25                 ` Jeff King
2023-04-26 15:08                   ` Junio C Hamano
2023-04-26 15:13                     ` [PATCH] doc: GIT_DEFAULT_HASH is and will be ignored during "clone" Junio C Hamano
2023-04-26 21:06                       ` brian m. carlson
2023-04-27  4:46                     ` git clone of empty repositories doesn't preserve hash Jeff King
2023-04-26 10:51               ` Jeff King
2023-04-26 15:42                 ` Junio C Hamano
2023-04-26 20:40                 ` brian m. carlson
2023-04-26 20:53                   ` [PATCH 0/2] Fix empty SHA-256 clones with v0 and v1 brian m. carlson
2023-04-26 20:53                     ` [PATCH 1/2] http: advertise capabilities when cloning empty repos brian m. carlson
2023-04-26 21:14                       ` Junio C Hamano
2023-04-26 21:28                         ` brian m. carlson
2023-04-27  5:00                           ` Jeff King
2023-04-27  5:30                       ` Jeff King
2023-04-27 20:40                         ` Junio C Hamano
2023-04-26 20:53                     ` [PATCH 2/2] Honor GIT_DEFAULT_HASH for empty clones without remote algo brian m. carlson
2023-04-26 21:18                       ` Junio C Hamano
2023-04-26 21:33                       ` Junio C Hamano
2023-04-27  5:43                         ` Jeff King
2023-05-02 23:46                           ` Is GIT_DEFAULT_HASH flawed? Felipe Contreras
2023-05-03  9:03                             ` Adam Majer
2023-05-03 15:44                               ` Felipe Contreras
2023-05-03 17:21                                 ` Adam Majer
2023-05-08  0:34                                   ` Felipe Contreras
2023-05-03  9:09                             ` demerphq
2023-05-03 18:20                               ` Felipe Contreras
2023-05-03 22:54                             ` brian m. carlson
2023-05-08  2:00                               ` Felipe Contreras
2023-05-08 21:38                                 ` brian m. carlson
2023-05-09 10:32                                   ` Oswald Buddenhagen
2023-05-09 16:47                                     ` Junio C Hamano
2023-04-26 21:12                     ` [PATCH 0/2] Fix empty SHA-256 clones with v0 and v1 Junio C Hamano
2023-04-27  4:56                   ` git clone of empty repositories doesn't preserve hash Jeff King
2023-05-01 17:00                   ` [PATCH v2 0/1] Fix empty SHA-256 clones with v0 and v1 brian m. carlson
2023-05-01 17:00                     ` [PATCH v2 1/1] upload-pack: advertise capabilities when cloning empty repos brian m. carlson
2023-05-01 22:40                       ` Jeff King
2023-05-01 22:51                         ` Junio C Hamano
2023-05-01 17:37                     ` [PATCH v2 0/1] Fix empty SHA-256 clones with v0 and v1 Junio C Hamano
2023-05-17 19:24                   ` [PATCH v3 " brian m. carlson
2023-05-17 19:24                     ` [PATCH v3 1/1] upload-pack: advertise capabilities when cloning empty repos brian m. carlson
2023-05-17 21:48                     ` [PATCH v3 0/1] Fix empty SHA-256 clones with v0 and v1 Junio C Hamano
2023-05-17 22:28                       ` brian m. carlson
2023-05-18 18:28                     ` Jeff King
2023-05-19 15:32                       ` brian m. carlson
2023-04-05 21:23         ` git clone of empty repositories doesn't preserve hash Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).