Git Mailing List Archive mirror
 help / color / mirror / Atom feed
From: Glen Choo <chooglen@google.com>
To: Taylor Blau <me@ttaylorr.com>, git@vger.kernel.org
Cc: Derrick Stolee <derrickstolee@github.com>,
	Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com>,
	Chris Torek <chris.torek@gmail.com>, <me@waleedkhan.name>,
	<martinvonz@google.com>
Subject: Re: [PATCH v4 2/2] gc: introduce `gc.recentObjectsHook`
Date: Wed, 24 May 2023 16:21:34 -0700	[thread overview]
Message-ID: <kl6l1qj5z56p.fsf@chooglen-macbookpro.roam.corp.google.com> (raw)
In-Reply-To: <18e50d2517ba4cc81d4bafb0b029ca7a770f23a7.1684196634.git.me@ttaylorr.com>

Hi Taylor! It was great seeing you at Review Club today :)

If you'd like, the notes are available at:

  https://docs.google.com/document/d/14L8BAumGTpsXpjDY8VzZ4rRtpAjuGrFSRqn3stCuS_w/edit

but that's optional, since a ground rule is that all important feedback
should be sent to the mailing list.

Taylor Blau <me@ttaylorr.com> writes:

> This patch introduces a new multi-valued configuration option,
> `gc.recentObjectsHook` as a means to mark certain objects as recent (and
> thus exempt from garbage collection), regardless of their age.
>
> However, we have no convenient way to keep certain precious, unreachable
> objects around in the repository, even if they have aged out and would
> be pruned. Our options today consist of:
>
>   - Point references at the reachability tips of any objects you
>     consider precious, which may be undesirable or infeasible.

What I like about the hook is that it matches the desired use case quite
well - if an object is precious but unreachable, it can't be because Git
thinks it's precious. Rather, something else thinks it's precious, so
the obvious fix is for Git to learn how to listen to that tool. Based
off only what's in this commit message though, this feature doesn't seem
sufficiently justified yet. In particular, it's not immediately clear
how this fits in with the alternatives, especially pointing refs at
precious objects (which is probably the most popular way people have
been doing this). It would be useful to flesh out why keeping these
extra refs are either "undesirable" or "infeasible". Presumably, you
already have some idea of why this is the case for the GitHub 'audit
log'.

Another potential use case I can think of is client-side third party Git
tools that implement custom workflows on top of a Git repo, e.g.
git-branchless (https://github.com/arxanas/git-branchless) and jj
(https://github.com/martinvonz/jj/, btw I contribute a little to jj
too). Both of those tools reference commits that Git can consider
unreachable (e.g. they support anonymous branches and "undo"
operations), and so they both create custom refs to keep those commits
alive. The number of refs isn't huge, though it is more than what a
typical repo would see (maybe in the 100s), and ISTR that some users
have reported that it's enough to noticeably slow down the "git fetch"
negotiation. I don't think managing these refs is "infeasible", and
there are workarounds to shrink the number of refs (e.g. creating an
octopus merge of the commits you want and pointing a ref at it), but it
would certainly be a lot easier to use the hook instead.

I've cc-ed the maintainers of both projects here in case they have more
to share.

> +gc.recentObjectsHook::

I have a small preference to use "command" instead of "hook" to avoid
confusion with Git hooks (I've already observed some of this confusion
in off-list conversations). There's some precedent for "hook" in
`uploadpack.packObjectsHook`, but that's a one-off (and it used to
confuse me a lot too :P).

> +	When considering the recency of an object (e.g., when generating
> +	a cruft pack or storing unreachable objects as loose), use the
> +	shell to execute the specified command(s). Interpret their
> +	output as object IDs which Git will consider as "recent",
> +	regardless of their age.

From a potential user's POV, two things seem unclear:

- What does "recenct" mean in this context? Does it just mean
  "precious"?
- What objects do I need to list? Is it every object I want to keep or
  just the reachable tips?

Documentation/git-gc.txt has some pointers:

  . Any object with modification time newer than the `--prune` date is kept,
    along with everything reachable from it.

but it seems quite hard to discover this and put the pieces together. I
wonder if we could make this easier by:

- Replacing "recent" with "precious" (if this is really what we mean),
  and the fact that we use the recency check is an implementation
  detail.
- Explicitly saying that everything reachable from the object is also
  kept.

In the code changes, I noticed a few out-of-date references to "cruft
tips", but everything else looked reasonable to me.

> diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
> index 303f7a5d84..3ae61ca995 100755
> --- a/t/t5329-pack-objects-cruft.sh
> +++ b/t/t5329-pack-objects-cruft.sh
> @@ -739,4 +739,175 @@ test_expect_success 'cruft objects are freshend via loose' '
>  	)
>  '
>  
> +test_expect_success 'gc.recentObjectsHook' '
> +	git init repo &&
> +	test_when_finished "rm -fr repo" &&
> +	(
> +		cd repo &&
> +
> +		# Create a handful of objects.
> +		#
> +		#   - one reachable commit, "base", designated for the reachable
> +		#     pack
> +		#   - one unreachable commit, "cruft.discard", which is marked
> +		#     for deletion
> +		#   - one unreachable commit, "cruft.old", which would be marked
> +		#     for deletion, but is rescued as an extra cruft tip
> +		#   - one unreachable commit, "cruft.new", which is not marked
> +		#     for deletion
> +		test_commit base &&
> +		git branch -M main &&
> +
> +		git checkout --orphan discard &&
> +		git rm -fr . &&
> +		test_commit --no-tag cruft.discard &&
> +
> +		git checkout --orphan old &&
> +		git rm -fr . &&
> +		test_commit --no-tag cruft.old &&
> +		cruft_old="$(git rev-parse HEAD)" &&
> +
> +		git checkout --orphan new &&
> +		git rm -fr . &&
> +		test_commit --no-tag cruft.new &&
> +		cruft_new="$(git rev-parse HEAD)" &&
> +
> +		git checkout main &&
> +		git branch -D discard old new &&
> +		git reflog expire --all --expire=all &&
> +
> +		# mark cruft.old with an mtime that is many minutes
> +		# older than the expiration period, and mark cruft.new
> +		# with an mtime that is in the future (and thus not
> +		# eligible for pruning).
> +		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $cruft_old)" &&
> +		test-tool chmtime +1000 "$objdir/$(test_oid_to_path $cruft_new)" &&
> +
> +		# Write the list of cruft objects we expect to
> +		# accumulate, which is comprised of everything reachable
> +		# from cruft.old and cruft.new, but not cruft.discard.
> +		git rev-list --objects --no-object-names \
> +			$cruft_old $cruft_new >cruft.raw &&
> +		sort cruft.raw >cruft.expect &&
> +
> +		# Write the script to list extra tips, which are limited
> +		# to cruft.old, in this case.
> +		write_script extra-tips <<-EOF &&
> +		echo $cruft_old
> +		EOF
> +		git config gc.recentObjectsHook ./extra-tips &&
> +
> +		git repack --cruft --cruft-expiration=now -d &&
> +
> +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
> +		git show-index <${mtimes%.mtimes}.idx >cruft &&
> +		cut -d" " -f2 cruft | sort >cruft.actual &&
> +		test_cmp cruft.expect cruft.actual &&
> +
> +		# Ensure that the "old" objects are removed after
> +		# dropping the gc.recentObjectsHook hook.
> +		git config --unset gc.recentObjectsHook &&
> +		git repack --cruft --cruft-expiration=now -d &&
> +
> +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
> +		git show-index <${mtimes%.mtimes}.idx >cruft &&
> +		cut -d" " -f2 cruft | sort >cruft.actual &&
> +
> +		git rev-list --objects --no-object-names $cruft_new >cruft.raw &&
> +		cp cruft.expect cruft.old &&
> +		sort cruft.raw >cruft.expect &&
> +		test_cmp cruft.expect cruft.actual &&
> +
> +		# ensure objects which are no longer in the cruft pack were
> +		# removed from the repository
> +		for object in $(comm -13 cruft.expect cruft.old)
> +		do
> +			test_must_fail git cat-file -t $object || return 1
> +		done
> +	)

Thanks for the comments. The tests are quite verbose, and I wouldn't
be able to understand them otherwise. I thought it would be nice to have
a test helper to check that the cruft pack contains the expected objects
(replacing the "git show-index" + "cut | sort" + "test_cmp" recipe), but
this test fits in with the existing style, so I don't consider that a
blocker.

  reply	other threads:[~2023-05-24 23:21 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-11 23:20 [PATCH v3 0/2] pack-objects: introduce `pack.extraRecentObjectsHook` Taylor Blau
2023-05-11 23:20 ` [PATCH v3 1/2] reachable.c: extract `obj_is_recent()` Taylor Blau
2023-05-11 23:20 ` [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook` Taylor Blau
2023-05-12  4:58   ` Jeff King
2023-05-15 20:15     ` Taylor Blau
2023-05-12 21:24   ` Jeff King
2023-05-12 21:36     ` Taylor Blau
2023-05-12 21:46       ` Jeff King
2023-05-12 21:45     ` Jeff King
2023-05-12 22:01       ` Jeff King
2023-05-12 23:21         ` Junio C Hamano
2023-05-13  0:11           ` Jeff King
2023-05-13  0:11             ` Jeff King
2023-05-15 20:49       ` Taylor Blau
2023-05-15 20:38     ` Taylor Blau
2023-05-11 23:23 ` [PATCH v3 0/2] pack-objects: introduce `pack.extraRecentObjectsHook` Taylor Blau
2023-05-11 23:39 ` Junio C Hamano
2023-05-11 23:48   ` Taylor Blau
2023-05-16  0:23 ` [PATCH v4 0/2] gc: introduce `gc.recentObjectsHook` Taylor Blau
2023-05-16  0:24   ` [PATCH v4 1/2] reachable.c: extract `obj_is_recent()` Taylor Blau
2023-05-16  0:24   ` [PATCH v4 2/2] gc: introduce `gc.recentObjectsHook` Taylor Blau
2023-05-24 23:21     ` Glen Choo [this message]
2023-06-07 22:56       ` Taylor Blau
2023-06-07 22:58 ` [PATCH v5 0/2] " Taylor Blau
2023-06-07 22:58   ` [PATCH v5 1/2] reachable.c: extract `obj_is_recent()` Taylor Blau
2023-06-07 22:58   ` [PATCH v5 2/2] gc: introduce `gc.recentObjectsHook` Taylor Blau
2023-06-09 23:33     ` Glen Choo
2023-06-12 21:14       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=kl6l1qj5z56p.fsf@chooglen-macbookpro.roam.corp.google.com \
    --to=chooglen@google.com \
    --cc=chris.torek@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=martinvonz@google.com \
    --cc=me@ttaylorr.com \
    --cc=me@waleedkhan.name \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).