Git Mailing List Archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/2] pack-objects: introduce `pack.extraRecentObjectsHook`
@ 2023-05-11 23:20 Taylor Blau
  2023-05-11 23:20 ` [PATCH v3 1/2] reachable.c: extract `obj_is_recent()` Taylor Blau
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Taylor Blau @ 2023-05-11 23:20 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Chris Torek, Derrick Stolee, Junio C Hamano

Here is a reworked version of the patch which introduced a new
configuration `pack.extraCruftTips` to keep additional objects from
pruning during a cruft pack-generating GC.

The third round was significantly more complicated than necessary, and I
think this round represents a couple of improvements thanks to a very
helpful set of reviews from Junio and Peff:

  - The new code does not change the existing cruft pack implementation
    whatsoever. This is nice, since cruft packs will be the default in
    the next release, so changing that implementation carries additional
    risk.

  - The new code is also not specific to cruft packs, meaning that you
    can do things like:

        $ git -c pack.extraRecentObjectsHook=... \
          repack -Ad --unpack-unreachable=1.hour.ago

    and have it write out loose copies of any object(s) mentioned by one
    or more of the configured hooks.

I am hopeful that others think this version is in a good spot. As in
earlier rounds, I would appreciate an extra careful review on this
topic, because of the changing default I mentioned earlier.

Thanks in advance for your review.

Taylor Blau (2):
  reachable.c: extract `obj_is_recent()`
  builtin/pack-objects.c: introduce `pack.recentObjectsHook`

 Documentation/config/pack.txt        |  13 ++
 builtin/pack-objects.c               |   1 +
 reachable.c                          | 100 +++++++++++++++-
 t/t5329-pack-objects-cruft.sh        | 171 +++++++++++++++++++++++++++
 t/t7701-repack-unpack-unreachable.sh |  22 ++++
 5 files changed, 303 insertions(+), 4 deletions(-)

Range-diff against v2:
-:  ---------- > 1:  f5f3b0f334 reachable.c: extract `obj_is_recent()`
1:  da1711b13b ! 2:  2ce8a79fa4 builtin/pack-objects.c: introduce `pack.extraCruftTips`
    @@ Metadata
     Author: Taylor Blau <me@ttaylorr.com>
     
      ## Commit message ##
    -    builtin/pack-objects.c: introduce `pack.extraCruftTips`
    +    builtin/pack-objects.c: introduce `pack.recentObjectsHook`
     
         This patch introduces a new multi-valued configuration option,
    -    `pack.extraCruftTips` as an alternative means to mark certain objects in
    -    the cruft pack as rescued, even if they would otherwise be pruned.
    +    `pack.recentObjectsHook` as a means to mark certain objects as recent,
    +    regardless of their age.
     
    -    Traditionally, cruft packs are generated in one of two ways:
    +    Depending on whether or not we are generating a cruft pack, this allows
    +    the caller to do one of two things:
     
    -      - When not pruning, all packed and loose objects which do not appear
    -        in the any of the packs which wil remain (as indicated over stdin,
    -        with `--cruft`) are packed separately into a cruft pack.
    +      - If generating a cruft pack, the caller is able to retain additional
    +        objects via the cruft pack, even if they would have otherwise been
    +        pruned due to their age.
     
    -      - When pruning, the above process happens, with two additional tweaks.
    -        Instead of adding every object into the cruft pack, only objects
    -        which have mtime that is within the grace period are kept. In
    -        addition, a "rescuing" traversal is performed over the remaining
    -        cruft objects to keep around cruft objects which would have aged out
    -        of the repository, but are reachable from other cruft objects which
    -        have not yet aged out.
    +      - If not generating a cruft pack, the caller is likewise able to
    +        retain additional objects as loose.
     
    -    This works well for repositories which have all of their "interesting"
    -    objects worth keeping around reachable via at least one reference.
    -
    -    However, there is no option to be able to keep around certain objects
    +    There is currently no option to be able to keep around certain objects
         that have otherwise aged out of the grace period. The only way to retain
         those objects is:
     
           - to point a reference at them, which may be undesirable or
             infeasible,
    +
           - to track them via the reflog, which may be undesirable since the
             reflog's lifetime is limited to that of the reference it's tracking
             (and callers may want to keep those unreachable objects around for
             longer)
    +
           - to extend the grace period, which may keep around other objects that
             the caller *does* want to discard,
    +
           - or, to force the caller to construct the pack of objects they want
             to keep themselves, and then mark the pack as kept by adding a
             ".keep" file.
     
    -    This patch introduces a new configuration, `pack.extraCruftTips` which
    -    allows the caller to specify a program (or set of programs) whose output
    -    is treated as a set of objects to treat as "kept", even if they are
    -    unreachable and have aged out of the retention period.
    +    This patch introduces a new configuration, `pack.recentObjectsHook`
    +    which allows the caller to specify a program (or set of programs) whose
    +    output is treated as a set of objects to treat as recent, regardless of
    +    their true age.
     
    -    The implementation is straightforward: after determining the set of
    -    objects to pack into the cruft pack (either by calling
    -    `enumerate_cruft_objects()` or `enumerate_and_traverse_cruft_objects()`)
    -    we figure out if there are any program(s) we need to consult in order to
    -    determine if there are any objects which we need to "rescue". We then
    -    add those as tips to another reachability traversal, marking every
    -    object along the way as cruft (and thus retaining it in the cruft pack).
    +    The implementation is straightforward. In either case (cruft packs or
    +    not), Git enumerates recent objects via
    +    `add_unseen_recent_objects_to_traversal()`. That enumerates loose and
    +    packed objects, and eventually calls add_recent_object() on any objects
    +    for which `want_recent_object()`'s conditions are met.
    +
    +    This patch modifies the recency condition from simply "is the mtime of
    +    this object more recent than the cutoff?" to "[...] or, is this object
    +    mentioned by at least one `pack.recentObjectsHook`?".
    +
    +    We then add those as tips to another reachability traversal (along with
    +    any recent objects, if pruning), marking every object along the way
    +    (either adding it to the cruft pack, or writing it out as a loose
    +    object).
     
         A potential alternative here is to introduce a new mode to alter the
         contents of the reachable pack instead of the cruft one. One could
    -    imagine a new option to `pack-objects`, say `--extra-tips` that does the
    -    same thing as above, adding the visited set of objects along the
    -    traversal to the pack.
    +    imagine a new option to `pack-objects`, say `--extra-reachable-tips`
    +    that does the same thing as above, adding the visited set of objects
    +    along the traversal to the pack.
     
         But this has the unfortunate side-effect of altering the reachability
         closure of that pack. If parts of the unreachable object graph mentioned
    -    by one or more of the "extra tips" programs is not closed, then the
    -    resulting pack won't be either. This makes it impossible in the general
    -    case to write out reachability bitmaps for that pack, since closure is a
    -    requirement there.
    +    by one or more of the "extra reachable tips" programs is not closed,
    +    then the resulting pack won't be either. This makes it impossible in the
    +    general case to write out reachability bitmaps for that pack, since
    +    closure is a requirement there.
     
    -    Instead, keep these unreachable objects in the cruft pack instead, to
    -    ensure that we can continue to have a pack containing just reachable
    -    objects, which is always safe to write a bitmap over.
    +    Instead, keep these unreachable objects in the cruft pack (or set of
    +    unreachable, loose objects) instead, to ensure that we can continue to
    +    have a pack containing just reachable objects, which is always safe to
    +    write a bitmap over.
     
    +    Helped-by: Jeff King <peff@peff.net>
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## Documentation/config/pack.txt ##
    @@ Documentation/config/pack.txt: pack.deltaCacheLimit::
      	result once the best match for all objects is found.
      	Defaults to 1000. Maximum value is 65535.
      
    -+pack.extraCruftTips::
    -+	When generating a cruft pack, use the shell to execute the
    -+	specified command(s), and interpret their output as additional
    -+	tips of objects to keep in the cruft pack, regardless of their
    -+	age.
    ++pack.recentObjectsHook::
    ++	When considering the recency of an object (e.g., when generating
    ++	a cruft pack or storing unreachable objects as loose), use the
    ++	shell to execute the specified command(s). Interpret their
    ++	output as object IDs which Git will consider as "recent",
    ++	regardless of their age.
     ++
     +Output must contain exactly one hex object ID per line, and nothing
     +else. Objects which cannot be found in the repository are ignored.
    -+Multiple hooks are supported, but all must exit successfully, else no
    -+cruft pack will be generated.
    ++Multiple hooks are supported, but all must exit successfully, else the
    ++operation (either generating a cruft pack or unpacking unreachable
    ++objects) will be halted.
     +
      pack.threads::
      	Specifies the number of threads to spawn when searching for best
    @@ builtin/pack-objects.c
      
      /*
       * Objects we are going to pack are collected in the `to_pack` structure.
    -@@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs
    - 	stop_progress(&progress_state);
    - }
    +
    + ## reachable.c ##
    +@@
    + #include "object-store.h"
    + #include "pack-bitmap.h"
    + #include "pack-mtimes.h"
    ++#include "config.h"
    ++#include "run-command.h"
      
    -+static int enumerate_extra_cruft_tips_1(struct rev_info *revs, const char *args)
    + struct connectivity_progress {
    + 	struct progress *progress;
    +@@ reachable.c: struct recent_data {
    + 	timestamp_t timestamp;
    + 	report_recent_object_fn *cb;
    + 	int ignore_in_core_kept_packs;
    ++
    ++	struct oidset extra_recent_oids;
    ++	int extra_recent_oids_loaded;
    + };
    + 
    ++static int run_one_pack_recent_objects_hook(struct oidset *set,
    ++					    const char *args)
     +{
     +	struct child_process cmd = CHILD_PROCESS_INIT;
     +	struct strbuf buf = STRBUF_INIT;
    -+	FILE *out = NULL;
    ++	FILE *out;
     +	int ret = 0;
     +
     +	cmd.use_shell = 1;
     +	cmd.out = -1;
     +
     +	strvec_push(&cmd.args, args);
    -+	strvec_pushv(&cmd.env, (const char **)local_repo_env);
     +
    -+	if (start_command(&cmd)) {
    -+		ret = -1;
    -+		goto done;
    -+	}
    ++	if (start_command(&cmd))
    ++		return -1;
     +
     +	out = xfdopen(cmd.out, "r");
    -+	while (strbuf_getline_lf(&buf, out) != EOF) {
    ++	while (strbuf_getline(&buf, out) != EOF) {
     +		struct object_id oid;
    -+		struct object *obj;
     +		const char *rest;
     +
     +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
    @@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct
     +			goto done;
     +		}
     +
    -+		obj = parse_object(the_repository, &oid);
    -+		if (!obj)
    -+			continue;
    -+
    -+		display_progress(progress_state, ++nr_seen);
    -+		add_pending_object(revs, obj, "");
    ++		oidset_insert(set, &oid);
     +	}
     +
     +	ret = finish_command(&cmd);
    @@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct
     +	return ret;
     +}
     +
    -+static int enumerate_extra_cruft_tips(void)
    ++static void load_pack_recent_objects(struct recent_data *data)
     +{
    -+	struct rev_info revs;
     +	const struct string_list *programs;
     +	int ret = 0;
     +	size_t i;
     +
    -+	if (git_config_get_string_multi("pack.extracrufttips", &programs))
    -+		return ret;
    ++	data->extra_recent_oids_loaded = 1;
     +
    -+	repo_init_revisions(the_repository, &revs, NULL);
    ++	if (git_config_get_string_multi("pack.recentobjectshook", &programs))
    ++		return;
     +
    -+	revs.tag_objects = 1;
    -+	revs.tree_objects = 1;
    -+	revs.blob_objects = 1;
    -+
    -+	revs.include_check = cruft_include_check;
    -+	revs.include_check_obj = cruft_include_check_obj;
    -+
    -+	revs.ignore_missing_links = 1;
    -+
    -+	if (progress)
    -+		progress_state = start_progress(_("Enumerating extra cruft tips"), 0);
    -+	nr_seen = 0;
     +	for (i = 0; i < programs->nr; i++) {
    -+		ret = enumerate_extra_cruft_tips_1(&revs,
    -+						   programs->items[i].string);
    ++		ret = run_one_pack_recent_objects_hook(&data->extra_recent_oids,
    ++						       programs->items[i].string);
     +		if (ret)
     +			break;
     +	}
    -+	stop_progress(&progress_state);
     +
     +	if (ret)
     +		die(_("unable to enumerate additional cruft tips"));
    -+
    -+	if (prepare_revision_walk(&revs))
    -+		die(_("revision walk setup failed"));
    -+	if (progress)
    -+		progress_state = start_progress(_("Traversing extra cruft objects"), 0);
    -+	nr_seen = 0;
    -+
    -+	/*
    -+	 * Retain all objects reachable from extra tips via
    -+	 * show_cruft_commit(), and show_cruft_object(), regardless of
    -+	 * their mtime.
    -+	 */
    -+	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
    -+	stop_progress(&progress_state);
    -+
    -+	return ret;
     +}
     +
    - static void read_cruft_objects(void)
    + static int obj_is_recent(const struct object_id *oid, timestamp_t mtime,
    + 			 struct recent_data *data)
      {
    - 	struct strbuf buf = STRBUF_INIT;
    -@@ builtin/pack-objects.c: static void read_cruft_objects(void)
    - 	else
    - 		enumerate_cruft_objects();
    +-	return mtime > data->timestamp;
    ++	if (mtime > data->timestamp)
    ++		return 1;
    ++
    ++	if (!data->extra_recent_oids_loaded)
    ++		load_pack_recent_objects(data);
    ++	return oidset_contains(&data->extra_recent_oids, oid);
    + }
    + 
    + static void add_recent_object(const struct object_id *oid,
    +@@ reachable.c: static int want_recent_object(struct recent_data *data,
    + 			      const struct object_id *oid)
    + {
    + 	if (data->ignore_in_core_kept_packs &&
    +-	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS))
    ++	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS)) {
    ++		if (!data->extra_recent_oids_loaded)
    ++			load_pack_recent_objects(data);
    ++		if (oidset_contains(&data->extra_recent_oids, oid))
    ++			return 1;
    ++
    + 		return 0;
    ++	}
    + 	return 1;
    + }
    + 
    +@@ reachable.c: int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
    + 	data.cb = cb;
    + 	data.ignore_in_core_kept_packs = ignore_in_core_kept_packs;
      
    -+	enumerate_extra_cruft_tips();
    ++	oidset_init(&data.extra_recent_oids, 0);
    ++	data.extra_recent_oids_loaded = 0;
     +
    - 	strbuf_release(&buf);
    - 	string_list_clear(&discard_packs, 0);
    - 	string_list_clear(&fresh_packs, 0);
    + 	r = for_each_loose_object(add_recent_loose, &data,
    + 				  FOR_EACH_OBJECT_LOCAL_ONLY);
    + 	if (r)
    +-		return r;
    ++		goto done;
    + 
    + 	flags = FOR_EACH_OBJECT_LOCAL_ONLY | FOR_EACH_OBJECT_PACK_ORDER;
    + 	if (ignore_in_core_kept_packs)
    + 		flags |= FOR_EACH_OBJECT_SKIP_IN_CORE_KEPT_PACKS;
    + 
    +-	return for_each_packed_object(add_recent_packed, &data, flags);
    ++	r = for_each_packed_object(add_recent_packed, &data, flags);
    ++
    ++done:
    ++	oidset_clear(&data.extra_recent_oids);
    ++
    ++	return r;
    + }
    + 
    + static int mark_object_seen(const struct object_id *oid,
     
      ## t/t5329-pack-objects-cruft.sh ##
     @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend via loose' '
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
     +		write_script extra-tips <<-EOF &&
     +		echo $cruft_old
     +		EOF
    -+		git config pack.extraCruftTips ./extra-tips &&
    ++		git config pack.recentObjectsHook ./extra-tips &&
     +
     +		git repack --cruft --cruft-expiration=now -d &&
     +
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
     +
     +		# Ensure that the "old" objects are removed after
     +		# dropping the pack.extraCruftTips hook.
    -+		git config --unset pack.extraCruftTips &&
    ++		git config --unset pack.recentObjectsHook &&
     +		git repack --cruft --cruft-expiration=now -d &&
     +
     +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
     +
     +		# ensure that each extra cruft tip is saved by its
     +		# respective hook
    -+		git config --add pack.extraCruftTips ./extra-tips.a &&
    -+		git config --add pack.extraCruftTips ./extra-tips.b &&
    ++		git config --add pack.recentObjectsHook ./extra-tips.a &&
    ++		git config --add pack.recentObjectsHook ./extra-tips.b &&
     +		git repack --cruft --cruft-expiration=now -d &&
     +
     +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
     +		test_cmp cruft.expect cruft.actual &&
     +
     +		# ensure that a dirty exit halts cruft pack generation
    -+		git config --add pack.extraCruftTips ./extra-tips.c &&
    -+		test_must_fail git repack --cruft -d 2>err &&
    ++		git config --add pack.recentObjectsHook ./extra-tips.c &&
    ++		test_must_fail git repack --cruft --cruft-expiration=now -d 2>err &&
     +		grep "unable to enumerate additional cruft tips" err &&
     +
     +		# and that the existing cruft pack is left alone
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
     +		write_script extra-tips <<-EOF &&
     +		echo $blob
     +		EOF
    -+		git config pack.extraCruftTips ./extra-tips &&
    ++		git config pack.recentObjectsHook ./extra-tips &&
     +
     +		git repack --cruft --cruft-expiration=now -d &&
     +
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
     +'
     +
      test_done
    +
    + ## t/t7701-repack-unpack-unreachable.sh ##
    +@@ t/t7701-repack-unpack-unreachable.sh: test_expect_success 'do not bother loosening old objects' '
    + 	test_must_fail git cat-file -p $obj2
    + '
    + 
    ++test_expect_success 'extra recent tips are kept regardless of age' '
    ++	obj1=$(echo one | git hash-object -w --stdin) &&
    ++	obj2=$(echo two | git hash-object -w --stdin) &&
    ++	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
    ++	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
    ++	git prune-packed &&
    ++
    ++	git cat-file -p $obj1 &&
    ++	git cat-file -p $obj2 &&
    ++
    ++	write_script extra-tips <<-EOF &&
    ++	echo $obj2
    ++	EOF
    ++	git config pack.recentObjectsHook ./extra-tips &&
    ++
    ++	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
    ++	git repack -A -d --unpack-unreachable=1.hour.ago &&
    ++
    ++	git cat-file -p $obj1 &&
    ++	git cat-file -p $obj2
    ++'
    ++
    + test_expect_success 'keep packed objects found only in index' '
    + 	echo my-unique-content >file &&
    + 	git add file &&
-- 
2.40.1.554.g2ce8a79fa4

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 1/2] reachable.c: extract `obj_is_recent()`
  2023-05-11 23:20 [PATCH v3 0/2] pack-objects: introduce `pack.extraRecentObjectsHook` Taylor Blau
@ 2023-05-11 23:20 ` Taylor Blau
  2023-05-11 23:20 ` [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook` Taylor Blau
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Taylor Blau @ 2023-05-11 23:20 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Chris Torek, Derrick Stolee, Junio C Hamano

When enumerating objects in order to add recent ones (i.e. those whose
mtime is strictly newer than the cutoff) as tips of a reachability
traversal, `add_recent_object()` discards objects which do not meet the
recency criteria.

The subsequent commit will make checking whether or not an object is
recent also consult the list of hooks in `pack.recentHook`. Isolate this
check in its own function to keep the additional complexity outside of
`add_recent_object()`.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 reachable.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/reachable.c b/reachable.c
index 55bb114353..7a42da5d39 100644
--- a/reachable.c
+++ b/reachable.c
@@ -69,6 +69,12 @@ struct recent_data {
 	int ignore_in_core_kept_packs;
 };
 
+static int obj_is_recent(const struct object_id *oid, timestamp_t mtime,
+			 struct recent_data *data)
+{
+	return mtime > data->timestamp;
+}
+
 static void add_recent_object(const struct object_id *oid,
 			      struct packed_git *pack,
 			      off_t offset,
@@ -78,7 +84,7 @@ static void add_recent_object(const struct object_id *oid,
 	struct object *obj;
 	enum object_type type;
 
-	if (mtime <= data->timestamp)
+	if (!obj_is_recent(oid, mtime, data))
 		return;
 
 	/*
-- 
2.40.1.554.g2ce8a79fa4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook`
  2023-05-11 23:20 [PATCH v3 0/2] pack-objects: introduce `pack.extraRecentObjectsHook` Taylor Blau
  2023-05-11 23:20 ` [PATCH v3 1/2] reachable.c: extract `obj_is_recent()` Taylor Blau
@ 2023-05-11 23:20 ` Taylor Blau
  2023-05-12  4:58   ` Jeff King
  2023-05-12 21:24   ` Jeff King
  2023-05-11 23:23 ` [PATCH v3 0/2] pack-objects: introduce `pack.extraRecentObjectsHook` Taylor Blau
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 28+ messages in thread
From: Taylor Blau @ 2023-05-11 23:20 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Chris Torek, Derrick Stolee, Junio C Hamano

This patch introduces a new multi-valued configuration option,
`pack.recentObjectsHook` as a means to mark certain objects as recent,
regardless of their age.

Depending on whether or not we are generating a cruft pack, this allows
the caller to do one of two things:

  - If generating a cruft pack, the caller is able to retain additional
    objects via the cruft pack, even if they would have otherwise been
    pruned due to their age.

  - If not generating a cruft pack, the caller is likewise able to
    retain additional objects as loose.

There is currently no option to be able to keep around certain objects
that have otherwise aged out of the grace period. The only way to retain
those objects is:

  - to point a reference at them, which may be undesirable or
    infeasible,

  - to track them via the reflog, which may be undesirable since the
    reflog's lifetime is limited to that of the reference it's tracking
    (and callers may want to keep those unreachable objects around for
    longer)

  - to extend the grace period, which may keep around other objects that
    the caller *does* want to discard,

  - or, to force the caller to construct the pack of objects they want
    to keep themselves, and then mark the pack as kept by adding a
    ".keep" file.

This patch introduces a new configuration, `pack.recentObjectsHook`
which allows the caller to specify a program (or set of programs) whose
output is treated as a set of objects to treat as recent, regardless of
their true age.

The implementation is straightforward. In either case (cruft packs or
not), Git enumerates recent objects via
`add_unseen_recent_objects_to_traversal()`. That enumerates loose and
packed objects, and eventually calls add_recent_object() on any objects
for which `want_recent_object()`'s conditions are met.

This patch modifies the recency condition from simply "is the mtime of
this object more recent than the cutoff?" to "[...] or, is this object
mentioned by at least one `pack.recentObjectsHook`?".

We then add those as tips to another reachability traversal (along with
any recent objects, if pruning), marking every object along the way
(either adding it to the cruft pack, or writing it out as a loose
object).

A potential alternative here is to introduce a new mode to alter the
contents of the reachable pack instead of the cruft one. One could
imagine a new option to `pack-objects`, say `--extra-reachable-tips`
that does the same thing as above, adding the visited set of objects
along the traversal to the pack.

But this has the unfortunate side-effect of altering the reachability
closure of that pack. If parts of the unreachable object graph mentioned
by one or more of the "extra reachable tips" programs is not closed,
then the resulting pack won't be either. This makes it impossible in the
general case to write out reachability bitmaps for that pack, since
closure is a requirement there.

Instead, keep these unreachable objects in the cruft pack (or set of
unreachable, loose objects) instead, to ensure that we can continue to
have a pack containing just reachable objects, which is always safe to
write a bitmap over.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/pack.txt        |  13 ++
 builtin/pack-objects.c               |   1 +
 reachable.c                          |  94 ++++++++++++++-
 t/t5329-pack-objects-cruft.sh        | 171 +++++++++++++++++++++++++++
 t/t7701-repack-unpack-unreachable.sh |  22 ++++
 5 files changed, 297 insertions(+), 4 deletions(-)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index d4c7c9d4e4..560cb42223 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -67,6 +67,19 @@ pack.deltaCacheLimit::
 	result once the best match for all objects is found.
 	Defaults to 1000. Maximum value is 65535.
 
+pack.recentObjectsHook::
+	When considering the recency of an object (e.g., when generating
+	a cruft pack or storing unreachable objects as loose), use the
+	shell to execute the specified command(s). Interpret their
+	output as object IDs which Git will consider as "recent",
+	regardless of their age.
++
+Output must contain exactly one hex object ID per line, and nothing
+else. Objects which cannot be found in the repository are ignored.
+Multiple hooks are supported, but all must exit successfully, else the
+operation (either generating a cruft pack or unpacking unreachable
+objects) will be halted.
+
 pack.threads::
 	Specifies the number of threads to spawn when searching for best
 	delta matches.  This requires that linkgit:git-pack-objects[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a5b466839b..bd6ad016d6 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -44,6 +44,7 @@
 #include "pack-mtimes.h"
 #include "parse-options.h"
 #include "wrapper.h"
+#include "run-command.h"
 
 /*
  * Objects we are going to pack are collected in the `to_pack` structure.
diff --git a/reachable.c b/reachable.c
index 7a42da5d39..4cca6fe93f 100644
--- a/reachable.c
+++ b/reachable.c
@@ -16,6 +16,8 @@
 #include "object-store.h"
 #include "pack-bitmap.h"
 #include "pack-mtimes.h"
+#include "config.h"
+#include "run-command.h"
 
 struct connectivity_progress {
 	struct progress *progress;
@@ -67,12 +69,82 @@ struct recent_data {
 	timestamp_t timestamp;
 	report_recent_object_fn *cb;
 	int ignore_in_core_kept_packs;
+
+	struct oidset extra_recent_oids;
+	int extra_recent_oids_loaded;
 };
 
+static int run_one_pack_recent_objects_hook(struct oidset *set,
+					    const char *args)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf buf = STRBUF_INIT;
+	FILE *out;
+	int ret = 0;
+
+	cmd.use_shell = 1;
+	cmd.out = -1;
+
+	strvec_push(&cmd.args, args);
+
+	if (start_command(&cmd))
+		return -1;
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline(&buf, out) != EOF) {
+		struct object_id oid;
+		const char *rest;
+
+		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
+			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
+			goto done;
+		}
+
+		oidset_insert(set, &oid);
+	}
+
+	ret = finish_command(&cmd);
+
+done:
+	if (out)
+		fclose(out);
+	strbuf_release(&buf);
+	child_process_clear(&cmd);
+
+	return ret;
+}
+
+static void load_pack_recent_objects(struct recent_data *data)
+{
+	const struct string_list *programs;
+	int ret = 0;
+	size_t i;
+
+	data->extra_recent_oids_loaded = 1;
+
+	if (git_config_get_string_multi("pack.recentobjectshook", &programs))
+		return;
+
+	for (i = 0; i < programs->nr; i++) {
+		ret = run_one_pack_recent_objects_hook(&data->extra_recent_oids,
+						       programs->items[i].string);
+		if (ret)
+			break;
+	}
+
+	if (ret)
+		die(_("unable to enumerate additional cruft tips"));
+}
+
 static int obj_is_recent(const struct object_id *oid, timestamp_t mtime,
 			 struct recent_data *data)
 {
-	return mtime > data->timestamp;
+	if (mtime > data->timestamp)
+		return 1;
+
+	if (!data->extra_recent_oids_loaded)
+		load_pack_recent_objects(data);
+	return oidset_contains(&data->extra_recent_oids, oid);
 }
 
 static void add_recent_object(const struct object_id *oid,
@@ -126,8 +198,14 @@ static int want_recent_object(struct recent_data *data,
 			      const struct object_id *oid)
 {
 	if (data->ignore_in_core_kept_packs &&
-	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS))
+	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS)) {
+		if (!data->extra_recent_oids_loaded)
+			load_pack_recent_objects(data);
+		if (oidset_contains(&data->extra_recent_oids, oid))
+			return 1;
+
 		return 0;
+	}
 	return 1;
 }
 
@@ -199,16 +277,24 @@ int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
 	data.cb = cb;
 	data.ignore_in_core_kept_packs = ignore_in_core_kept_packs;
 
+	oidset_init(&data.extra_recent_oids, 0);
+	data.extra_recent_oids_loaded = 0;
+
 	r = for_each_loose_object(add_recent_loose, &data,
 				  FOR_EACH_OBJECT_LOCAL_ONLY);
 	if (r)
-		return r;
+		goto done;
 
 	flags = FOR_EACH_OBJECT_LOCAL_ONLY | FOR_EACH_OBJECT_PACK_ORDER;
 	if (ignore_in_core_kept_packs)
 		flags |= FOR_EACH_OBJECT_SKIP_IN_CORE_KEPT_PACKS;
 
-	return for_each_packed_object(add_recent_packed, &data, flags);
+	r = for_each_packed_object(add_recent_packed, &data, flags);
+
+done:
+	oidset_clear(&data.extra_recent_oids);
+
+	return r;
 }
 
 static int mark_object_seen(const struct object_id *oid,
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 303f7a5d84..ffd6c2eeb6 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -739,4 +739,175 @@ test_expect_success 'cruft objects are freshend via loose' '
 	)
 '
 
+test_expect_success 'additional cruft tips may be specified via pack.extraCruftTips' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		# Create a handful of objects.
+		#
+		#   - one reachable commit, "base", designated for the reachable
+		#     pack
+		#   - one unreachable commit, "cruft.discard", which is marked
+		#     for deletion
+		#   - one unreachable commit, "cruft.old", which would be marked
+		#     for deletion, but is rescued as an extra cruft tip
+		#   - one unreachable commit, "cruft.new", which is not marked
+		#     for deletion
+		test_commit base &&
+		git branch -M main &&
+
+		git checkout --orphan discard &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.discard &&
+
+		git checkout --orphan old &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.old &&
+		cruft_old="$(git rev-parse HEAD)" &&
+
+		git checkout --orphan new &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.new &&
+		cruft_new="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D discard old new &&
+		git reflog expire --all --expire=all &&
+
+		# mark cruft.old with an mtime that is many minutes
+		# older than the expiration period, and mark cruft.new
+		# with an mtime that is in the future (and thus not
+		# eligible for pruning).
+		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $cruft_old)" &&
+		test-tool chmtime +1000 "$objdir/$(test_oid_to_path $cruft_new)" &&
+
+		# Write the list of cruft objects we expect to
+		# accumulate, which is comprised of everything reachable
+		# from cruft.old and cruft.new, but not cruft.discard.
+		git rev-list --objects --no-object-names \
+			$cruft_old $cruft_new >cruft.raw &&
+		sort cruft.raw >cruft.expect &&
+
+		# Write the script to list extra tips, which are limited
+		# to cruft.old, in this case.
+		write_script extra-tips <<-EOF &&
+		echo $cruft_old
+		EOF
+		git config pack.recentObjectsHook ./extra-tips &&
+
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# Ensure that the "old" objects are removed after
+		# dropping the pack.extraCruftTips hook.
+		git config --unset pack.recentObjectsHook &&
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+
+		git rev-list --objects --no-object-names $cruft_new >cruft.raw &&
+		cp cruft.expect cruft.old &&
+		sort cruft.raw >cruft.expect &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# ensure objects which are no longer in the cruft pack were
+		# removed from the repository
+		for object in $(comm -13 cruft.expect cruft.old)
+		do
+			test_must_fail git cat-file -t $object || return 1
+		done
+	)
+'
+
+test_expect_success 'multi-valued pack.extraCruftTips' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		git branch -M main &&
+
+		git checkout --orphan cruft.a &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.a &&
+		cruft_a="$(git rev-parse HEAD)" &&
+
+		git checkout --orphan cruft.b &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.b &&
+		cruft_b="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D cruft.a cruft.b &&
+		git reflog expire --all --expire=all &&
+
+		echo "echo $cruft_a" | write_script extra-tips.a &&
+		echo "echo $cruft_b" | write_script extra-tips.b &&
+		echo "false" | write_script extra-tips.c &&
+
+		git rev-list --objects --no-object-names $cruft_a $cruft_b \
+			>cruft.raw &&
+		sort cruft.raw >cruft.expect &&
+
+		# ensure that each extra cruft tip is saved by its
+		# respective hook
+		git config --add pack.recentObjectsHook ./extra-tips.a &&
+		git config --add pack.recentObjectsHook ./extra-tips.b &&
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# ensure that a dirty exit halts cruft pack generation
+		git config --add pack.recentObjectsHook ./extra-tips.c &&
+		test_must_fail git repack --cruft --cruft-expiration=now -d 2>err &&
+		grep "unable to enumerate additional cruft tips" err &&
+
+		# and that the existing cruft pack is left alone
+		test_path_is_file "$mtimes"
+	)
+'
+
+test_expect_success 'additional cruft blobs via pack.extraCruftTips' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		blob=$(echo "unreachable" | git hash-object -w --stdin) &&
+
+		# mark the unreachable blob we wrote above as having
+		# aged out of the retention period
+		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $blob)" &&
+
+		# Write the script to list extra tips, which is just the
+		# extra blob as above.
+		write_script extra-tips <<-EOF &&
+		echo $blob
+		EOF
+		git config pack.recentObjectsHook ./extra-tips &&
+
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft >actual &&
+		echo $blob >expect &&
+		test_cmp expect actual
+	)
+'
+
 test_done
diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh
index ebb267855f..d2eea6e754 100755
--- a/t/t7701-repack-unpack-unreachable.sh
+++ b/t/t7701-repack-unpack-unreachable.sh
@@ -113,6 +113,28 @@ test_expect_success 'do not bother loosening old objects' '
 	test_must_fail git cat-file -p $obj2
 '
 
+test_expect_success 'extra recent tips are kept regardless of age' '
+	obj1=$(echo one | git hash-object -w --stdin) &&
+	obj2=$(echo two | git hash-object -w --stdin) &&
+	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
+	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
+	git prune-packed &&
+
+	git cat-file -p $obj1 &&
+	git cat-file -p $obj2 &&
+
+	write_script extra-tips <<-EOF &&
+	echo $obj2
+	EOF
+	git config pack.recentObjectsHook ./extra-tips &&
+
+	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
+	git repack -A -d --unpack-unreachable=1.hour.ago &&
+
+	git cat-file -p $obj1 &&
+	git cat-file -p $obj2
+'
+
 test_expect_success 'keep packed objects found only in index' '
 	echo my-unique-content >file &&
 	git add file &&
-- 
2.40.1.554.g2ce8a79fa4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 0/2] pack-objects: introduce `pack.extraRecentObjectsHook`
  2023-05-11 23:20 [PATCH v3 0/2] pack-objects: introduce `pack.extraRecentObjectsHook` Taylor Blau
  2023-05-11 23:20 ` [PATCH v3 1/2] reachable.c: extract `obj_is_recent()` Taylor Blau
  2023-05-11 23:20 ` [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook` Taylor Blau
@ 2023-05-11 23:23 ` Taylor Blau
  2023-05-11 23:39 ` Junio C Hamano
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Taylor Blau @ 2023-05-11 23:23 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Chris Torek, Derrick Stolee, Junio C Hamano

On Thu, May 11, 2023 at 07:20:31PM -0400, Taylor Blau wrote:
> Here is a reworked version of the patch which introduced a new
> configuration `pack.extraCruftTips` to keep additional objects from
> pruning during a cruft pack-generating GC.

I'm starting a new thread disconnected from the old one since the
approach is significantly different (but mostly because I forgot to set
`--in-reply-to` when preparing these patches ;-)).

Strictly speaking, I think this is technically "v4" of the original
topic, so v3 here is a mislabel.

Those interested in following along with the earlier round(s) of
discussion can see the original thread at [1].

Thanks,
Taylor

[1]: https://lore.kernel.org/git/8af478ebe34539b68ffb9b353bbb1372dfca3871.1682011600.git.me@ttaylorr.com/


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 0/2] pack-objects: introduce `pack.extraRecentObjectsHook`
  2023-05-11 23:20 [PATCH v3 0/2] pack-objects: introduce `pack.extraRecentObjectsHook` Taylor Blau
                   ` (2 preceding siblings ...)
  2023-05-11 23:23 ` [PATCH v3 0/2] pack-objects: introduce `pack.extraRecentObjectsHook` Taylor Blau
@ 2023-05-11 23:39 ` Junio C Hamano
  2023-05-11 23:48   ` Taylor Blau
  2023-05-16  0:23 ` [PATCH v4 0/2] gc: introduce `gc.recentObjectsHook` Taylor Blau
  2023-06-07 22:58 ` [PATCH v5 0/2] " Taylor Blau
  5 siblings, 1 reply; 28+ messages in thread
From: Junio C Hamano @ 2023-05-11 23:39 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Chris Torek, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

> Here is a reworked version of the patch which introduced a new
> configuration `pack.extraCruftTips` to keep additional objects from
> pruning during a cruft pack-generating GC.
>
> The third round was significantly more complicated than necessary, and I

Hopefully you just meant "the previous round"; in a follow-up you
mention that this is v4 and not v3, which is also a good explanation
for that "third round was too complex".

>   - The new code does not change the existing cruft pack implementation
>     whatsoever. This is nice, since cruft packs will be the default in
>     the next release, so changing that implementation carries additional
>     risk.
>
>   - The new code is also not specific to cruft packs, meaning that you
>     can do things like:
>
>         $ git -c pack.extraRecentObjectsHook=... \
>           repack -Ad --unpack-unreachable=1.hour.ago
>
>     and have it write out loose copies of any object(s) mentioned by one
>     or more of the configured hooks.

OK.

> I am hopeful that others think this version is in a good spot. As in
> earlier rounds, I would appreciate an extra careful review on this
> topic, because of the changing default I mentioned earlier.
>
> Thanks in advance for your review.

Thanks for working on it.  I am in the process of pushing out
today's integration cycle already, so this topic will have to wait
until tomorrow's, though.




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 0/2] pack-objects: introduce `pack.extraRecentObjectsHook`
  2023-05-11 23:39 ` Junio C Hamano
@ 2023-05-11 23:48   ` Taylor Blau
  0 siblings, 0 replies; 28+ messages in thread
From: Taylor Blau @ 2023-05-11 23:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Chris Torek, Derrick Stolee

On Thu, May 11, 2023 at 04:39:26PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Here is a reworked version of the patch which introduced a new
> > configuration `pack.extraCruftTips` to keep additional objects from
> > pruning during a cruft pack-generating GC.
> >
> > The third round was significantly more complicated than necessary, and I
>
> Hopefully you just meant "the previous round"; in a follow-up you
> mention that this is v4 and not v3, which is also a good explanation
> for that "third round was too complex".

Yes, this series being the true fourth round, I was referring to the
version I sent in:

  https://lore.kernel.org/git/27a7f16aab35b5cac391d9831aadb0f2e2146313.1683151485.git.me@ttaylorr.com/

...which this version is significantly less complicated than.

> > I am hopeful that others think this version is in a good spot. As in
> > earlier rounds, I would appreciate an extra careful review on this
> > topic, because of the changing default I mentioned earlier.
> >
> > Thanks in advance for your review.
>
> Thanks for working on it.  I am in the process of pushing out
> today's integration cycle already, so this topic will have to wait
> until tomorrow's, though.

No problem. Happy merging ;-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook`
  2023-05-11 23:20 ` [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook` Taylor Blau
@ 2023-05-12  4:58   ` Jeff King
  2023-05-15 20:15     ` Taylor Blau
  2023-05-12 21:24   ` Jeff King
  1 sibling, 1 reply; 28+ messages in thread
From: Jeff King @ 2023-05-12  4:58 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Chris Torek, Derrick Stolee, Junio C Hamano

On Thu, May 11, 2023 at 07:20:37PM -0400, Taylor Blau wrote:

> +static int run_one_pack_recent_objects_hook(struct oidset *set,
> +					    const char *args)
> +{
> +	struct child_process cmd = CHILD_PROCESS_INIT;
> +	struct strbuf buf = STRBUF_INIT;
> +	FILE *out;
> +	int ret = 0;
> +
> +	cmd.use_shell = 1;
> +	cmd.out = -1;
> +
> +	strvec_push(&cmd.args, args);
> +
> +	if (start_command(&cmd))
> +		return -1;
> +
> +	out = xfdopen(cmd.out, "r");
> +	while (strbuf_getline(&buf, out) != EOF) {
> +		struct object_id oid;
> +		const char *rest;
> +
> +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
> +			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
> +			goto done;
> +		}
> +
> +		oidset_insert(set, &oid);
> +	}
> +
> +	ret = finish_command(&cmd);
> +
> +done:

I haven't looked closely at this whole patch yet (and I especially want
to look at the new tests since this approach covers more cases), but I
did notice that this version of the function still has the "we don't
reap the child on parse failure" problem I described in:

  https://lore.kernel.org/git/20230505221921.GE3321533@coredump.intra.peff.net/

I'll try to give the whole patch a more thorough review tomorrow.

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook`
  2023-05-11 23:20 ` [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook` Taylor Blau
  2023-05-12  4:58   ` Jeff King
@ 2023-05-12 21:24   ` Jeff King
  2023-05-12 21:36     ` Taylor Blau
                       ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Jeff King @ 2023-05-12 21:24 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Chris Torek, Derrick Stolee, Junio C Hamano

On Thu, May 11, 2023 at 07:20:37PM -0400, Taylor Blau wrote:

> This patch introduces a new multi-valued configuration option,
> `pack.recentObjectsHook` as a means to mark certain objects as recent,
> regardless of their age.
> 
> Depending on whether or not we are generating a cruft pack, this allows
> the caller to do one of two things:
> 
>   - If generating a cruft pack, the caller is able to retain additional
>     objects via the cruft pack, even if they would have otherwise been
>     pruned due to their age.
> 
>   - If not generating a cruft pack, the caller is likewise able to
>     retain additional objects as loose.

I think this description suffers a bit from being adapted from the
original patch which was targeting cruft packs. It's not clear to me
what "the caller" means here. And really, I think this is getting into
the details before giving an overview and motivation.

I'd expect something the rationale to be something like:

  1. Git considers the recent-ness of objects when deciding how to handle
     unreachable objects (discarding those that are too old, and keeping
     new-enough ones)

  2. you might want to consider more objects as "recent" because you know
     something Git doesn't (that certain objects are more interesting
     than others; specific examples may help here)

  3. so this patch gives you a way to override the recent-ness check for
     certain objects

And then get into the details of the implementation, why other solutions
don't work, etc.

> There is currently no option to be able to keep around certain objects
> that have otherwise aged out of the grace period. The only way to retain
> those objects is:
> 
>   - to point a reference at them, which may be undesirable or
>     infeasible,
> 
>   - to track them via the reflog, which may be undesirable since the
>     reflog's lifetime is limited to that of the reference it's tracking
>     (and callers may want to keep those unreachable objects around for
>     longer)
> 
>   - to extend the grace period, which may keep around other objects that
>     the caller *does* want to discard,
> 
>   - or, to force the caller to construct the pack of objects they want
>     to keep themselves, and then mark the pack as kept by adding a
>     ".keep" file.

One option I don't see here is: update the mtime on the objects you want
to salvage.

Why would we want this patch instead of just having the caller update
the mtimes of objects (or in a cruft-pack world, call a command that
rewrites the .mtimes file with new values)?

I can think of some possible arguments against it (you might want to
retain the old mtimes, or you might find it a hassle to have to
continually update them before gc kills them). But I think the commit
message should probably make those arguments.

Likewise, I think you'd want to argue why the ".keep" approach isn't as
good (and I guess it is "having extra packs is awkward especially as you
roll forward your cruft packs").

> This patch introduces a new configuration, `pack.recentObjectsHook`
> which allows the caller to specify a program (or set of programs) whose
> output is treated as a set of objects to treat as recent, regardless of
> their true age.

I was going to complain about putting this in the "pack" section,
because I thought by touching reachable.c, we'd also affect git-prune.
But I don't think we do, because it does its own direct mtime check on
the loose objects.

But I'm not sure that's the right behavior.

It feels like even before your patch, this is a huge gap in our
object-retention strategy.  During repacking, we try to avoid dropping
objects which are reachable from recent-but-unreachable things we're
keeping (since otherwise it effectively corrupts those recent objects,
making them less valuable to keep). But git-prune will happily drop them
anyway!

And I think the same thing would apply to your hook. If the hook says
"object XYZ is precious even if unreachable, keep it", then git-prune
ignoring that seems like it would be a source of errors.

I suspect both could be fixed by having git-prune trigger the same
add_unseen_recent_objects_to_traversal() call either as part of
the perform_reachability_traversal() walk, or maybe in its own walk (I
think maybe it has to be its own because the second walk should avoid
complaining about missing objects).

> We then add those as tips to another reachability traversal (along with
> any recent objects, if pruning), marking every object along the way
> (either adding it to the cruft pack, or writing it out as a loose
> object).

I didn't understand this "if pruning" comment. If we are not pruning at
all, wouldn't we skip the extra traversal entirely, since we know we are
saving everything?

> A potential alternative here is to introduce a new mode to alter the
> contents of the reachable pack instead of the cruft one. One could
> imagine a new option to `pack-objects`, say `--extra-reachable-tips`
> that does the same thing as above, adding the visited set of objects
> along the traversal to the pack.
> 
> But this has the unfortunate side-effect of altering the reachability
> closure of that pack. If parts of the unreachable object graph mentioned
> by one or more of the "extra reachable tips" programs is not closed,
> then the resulting pack won't be either. This makes it impossible in the
> general case to write out reachability bitmaps for that pack, since
> closure is a requirement there.

Makes sense.

> +	while (strbuf_getline(&buf, out) != EOF) {
> +		struct object_id oid;
> +		const char *rest;
> +
> +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
> +			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
> +			goto done;
> +		}
> +
> +		oidset_insert(set, &oid);
> +	}
> +
> +	ret = finish_command(&cmd);
> +
> +done:

I mentioned the "goto done" issue in another email.

>  static int obj_is_recent(const struct object_id *oid, timestamp_t mtime,
>  			 struct recent_data *data)
>  {
> -	return mtime > data->timestamp;
> +	if (mtime > data->timestamp)
> +		return 1;
> +
> +	if (!data->extra_recent_oids_loaded)
> +		load_pack_recent_objects(data);
> +	return oidset_contains(&data->extra_recent_oids, oid);
>  }

OK, so this is where we override the timestamp check. Makes sense.

> @@ -126,8 +198,14 @@ static int want_recent_object(struct recent_data *data,
>  			      const struct object_id *oid)
>  {
>  	if (data->ignore_in_core_kept_packs &&
> -	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS))
> +	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS)) {
> +		if (!data->extra_recent_oids_loaded)
> +			load_pack_recent_objects(data);
> +		if (oidset_contains(&data->extra_recent_oids, oid))
> +			return 1;
> +
>  		return 0;
> +	}
>  	return 1;
>  }

This hunk I'm less sure about. The purpose of this function is that the
caller has told us about some packs which are "special", and we avoid
adding their objects to the traversal.

This kicks in for cruft packs, when the git-repack caller says "I just
made pack xyz.pack; do not bother saving anything in it to the cruft
pack, since xyz.pack is here to stay". So if a hook says "you should
keep object X", why would we want to override that check? It is already
a reachable object that has been packed into xyz.pack, so we know there
is no point in even considering its recency.

> @@ -199,16 +277,24 @@ int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
>  	data.cb = cb;
>  	data.ignore_in_core_kept_packs = ignore_in_core_kept_packs;
>  
> +	oidset_init(&data.extra_recent_oids, 0);
> +	data.extra_recent_oids_loaded = 0;
> +
>  	r = for_each_loose_object(add_recent_loose, &data,
>  				  FOR_EACH_OBJECT_LOCAL_ONLY);
>  	if (r)
> -		return r;
> +		goto done;
>  
>  	flags = FOR_EACH_OBJECT_LOCAL_ONLY | FOR_EACH_OBJECT_PACK_ORDER;
>  	if (ignore_in_core_kept_packs)
>  		flags |= FOR_EACH_OBJECT_SKIP_IN_CORE_KEPT_PACKS;
>  
> -	return for_each_packed_object(add_recent_packed, &data, flags);
> +	r = for_each_packed_object(add_recent_packed, &data, flags);
> +
> +done:
> +	oidset_clear(&data.extra_recent_oids);
> +
> +	return r;
>  }

This handles the cleanup that I was too lazy to do in my earlier patch.
Good. :)

> --- a/t/t5329-pack-objects-cruft.sh
> +++ b/t/t5329-pack-objects-cruft.sh
> @@ -739,4 +739,175 @@ test_expect_success 'cruft objects are freshend via loose' '
>  	)
>  '
>  
> +test_expect_success 'additional cruft tips may be specified via pack.extraCruftTips' '

This title (and others below) seems out of date. :)

> diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh
> index ebb267855f..d2eea6e754 100755
> --- a/t/t7701-repack-unpack-unreachable.sh
> +++ b/t/t7701-repack-unpack-unreachable.sh
> @@ -113,6 +113,28 @@ test_expect_success 'do not bother loosening old objects' '
>  	test_must_fail git cat-file -p $obj2
>  '
>  
> +test_expect_success 'extra recent tips are kept regardless of age' '
> +	obj1=$(echo one | git hash-object -w --stdin) &&
> +	obj2=$(echo two | git hash-object -w --stdin) &&
> +	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
> +	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
> +	git prune-packed &&
> +
> +	git cat-file -p $obj1 &&
> +	git cat-file -p $obj2 &&
> +
> +	write_script extra-tips <<-EOF &&
> +	echo $obj2
> +	EOF
> +	git config pack.recentObjectsHook ./extra-tips &&
> +
> +	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
> +	git repack -A -d --unpack-unreachable=1.hour.ago &&
> +
> +	git cat-file -p $obj1 &&
> +	git cat-file -p $obj2
> +'

And this is the new test in this iteration covering the "repack -A"
case.

It is checking that $obj2, which our hook mentions, is saved. It also
checks that $obj1 is saved because it is still recent. But there are two
other possibly interesting cases:

  - an object that is too old and is _not_ saved. It seems useful to
    confirm that the new patch does not simply break the ability to drop
    objects. ;)

  - an object that is reachable from $obj2 is also saved. From a
    white-box perspective this is less interesting, because we should
    already test elsewhere that this works for recent objects, and we
    know the new feature is implemented by faking recency. But it might
    be worth it for completeness, and because it's easy to do (making
    $obj2 a tag pointing to a blob should work).

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook`
  2023-05-12 21:24   ` Jeff King
@ 2023-05-12 21:36     ` Taylor Blau
  2023-05-12 21:46       ` Jeff King
  2023-05-12 21:45     ` Jeff King
  2023-05-15 20:38     ` Taylor Blau
  2 siblings, 1 reply; 28+ messages in thread
From: Taylor Blau @ 2023-05-12 21:36 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Chris Torek, Derrick Stolee, Junio C Hamano

On Fri, May 12, 2023 at 05:24:56PM -0400, Jeff King wrote:
> > This patch introduces a new configuration, `pack.recentObjectsHook`
> > which allows the caller to specify a program (or set of programs) whose
> > output is treated as a set of objects to treat as recent, regardless of
> > their true age.
>
> I was going to complain about putting this in the "pack" section,
> because I thought by touching reachable.c, we'd also affect git-prune.
> But I don't think we do, because it does its own direct mtime check on
> the loose objects.
>
> But I'm not sure that's the right behavior.
>
> It feels like even before your patch, this is a huge gap in our
> object-retention strategy.  During repacking, we try to avoid dropping
> objects which are reachable from recent-but-unreachable things we're
> keeping (since otherwise it effectively corrupts those recent objects,
> making them less valuable to keep). But git-prune will happily drop them
> anyway!
>
> And I think the same thing would apply to your hook. If the hook says
> "object XYZ is precious even if unreachable, keep it", then git-prune
> ignoring that seems like it would be a source of errors.
>
> I suspect both could be fixed by having git-prune trigger the same
> add_unseen_recent_objects_to_traversal() call either as part of
> the perform_reachability_traversal() walk, or maybe in its own walk (I
> think maybe it has to be its own because the second walk should avoid
> complaining about missing objects).

I might be missing something, but I think we already (kind of) do the
right thing here.

AFAICT, the path is:

  - cmd_prune()
  - for_each_loose_file_in_objdir()
  - prune_object() (as a callback to the above)
  - is_object_reachable()
  - perform_reachability_traversal()
  - mark_reachable_objects()
  - add_unseen_recent_objects_to_traversal()

That only happens when `mark_recent != 0`, though.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook`
  2023-05-12 21:24   ` Jeff King
  2023-05-12 21:36     ` Taylor Blau
@ 2023-05-12 21:45     ` Jeff King
  2023-05-12 22:01       ` Jeff King
  2023-05-15 20:49       ` Taylor Blau
  2023-05-15 20:38     ` Taylor Blau
  2 siblings, 2 replies; 28+ messages in thread
From: Jeff King @ 2023-05-12 21:45 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Chris Torek, Derrick Stolee, Junio C Hamano

On Fri, May 12, 2023 at 05:24:56PM -0400, Jeff King wrote:

> > This patch introduces a new configuration, `pack.recentObjectsHook`
> > which allows the caller to specify a program (or set of programs) whose
> > output is treated as a set of objects to treat as recent, regardless of
> > their true age.
> 
> I was going to complain about putting this in the "pack" section,
> because I thought by touching reachable.c, we'd also affect git-prune.
> But I don't think we do, because it does its own direct mtime check on
> the loose objects.
> 
> But I'm not sure that's the right behavior.
> 
> It feels like even before your patch, this is a huge gap in our
> object-retention strategy.  During repacking, we try to avoid dropping
> objects which are reachable from recent-but-unreachable things we're
> keeping (since otherwise it effectively corrupts those recent objects,
> making them less valuable to keep). But git-prune will happily drop them
> anyway!
> 
> And I think the same thing would apply to your hook. If the hook says
> "object XYZ is precious even if unreachable, keep it", then git-prune
> ignoring that seems like it would be a source of errors.
> 
> I suspect both could be fixed by having git-prune trigger the same
> add_unseen_recent_objects_to_traversal() call either as part of
> the perform_reachability_traversal() walk, or maybe in its own walk (I
> think maybe it has to be its own because the second walk should avoid
> complaining about missing objects).

<phew> I am happy to say that I was wrong here, and git-prune behaves as
it should, courtesy of d3038d22f9 (prune: keep objects reachable from
recent objects, 2014-10-15). The magic happens in mark_reachable_objects(),
which handles walking the recent objects by calling...you guessed it,
add_unseen_recent_objects_to_traversal().

So it does the right thing now, and your patch should kick in
automatically for git-prune, too. But I think we'd want two things:

  1. Should the config variable name be made more generic to match?
     Maybe "core" is too broad (though certainly I'd expect it to apply
     anywhere in Git where we check recent-ness of objects), but perhaps
     "gc" would make sense (even though it is not strictly part of the
     gc command, it is within that realm of concepts).

  2. We probably want a test covering git-prune in this situation.

The thing that confused me when looking at the code earlier is that
git-prune itself checks the mtime of the objects, too, even if they were
not mentioned in the recent-but-reachable walk. That seems redundant to
me, but somehow it isn't. If I do:

diff --git a/builtin/prune.c b/builtin/prune.c
index 5dc9b20720..22b7ce4b10 100644
--- a/builtin/prune.c
+++ b/builtin/prune.c
@@ -92,7 +92,7 @@ static int prune_object(const struct object_id *oid, const char *fullpath,
 		return 0;
 	}
 	if (st.st_mtime > expire)
-		return 0;
+		BUG("object should have been saved via is_object_reachable!");
 	if (show_only || verbose) {
 		enum object_type type = oid_object_info(the_repository, oid,
 							NULL);

then t5304 shows some failures. I'm not quite sure what is going on, but
_if_ that mtime check in prune_object() is important, then it probably
should also respect the hook mechanism. So there may be a (3) to add to
your patch there.

-Peff

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook`
  2023-05-12 21:36     ` Taylor Blau
@ 2023-05-12 21:46       ` Jeff King
  0 siblings, 0 replies; 28+ messages in thread
From: Jeff King @ 2023-05-12 21:46 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Chris Torek, Derrick Stolee, Junio C Hamano

On Fri, May 12, 2023 at 05:36:15PM -0400, Taylor Blau wrote:

> > I suspect both could be fixed by having git-prune trigger the same
> > add_unseen_recent_objects_to_traversal() call either as part of
> > the perform_reachability_traversal() walk, or maybe in its own walk (I
> > think maybe it has to be its own because the second walk should avoid
> > complaining about missing objects).
> 
> I might be missing something, but I think we already (kind of) do the
> right thing here.
> 
> AFAICT, the path is:
> 
>   - cmd_prune()
>   - for_each_loose_file_in_objdir()
>   - prune_object() (as a callback to the above)
>   - is_object_reachable()
>   - perform_reachability_traversal()
>   - mark_reachable_objects()
>   - add_unseen_recent_objects_to_traversal()
> 
> That only happens when `mark_recent != 0`, though.

Yeah, I'm sorry, I was totally wrong here. See the mail I just sent that
crossed paths with yours.

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook`
  2023-05-12 21:45     ` Jeff King
@ 2023-05-12 22:01       ` Jeff King
  2023-05-12 23:21         ` Junio C Hamano
  2023-05-15 20:49       ` Taylor Blau
  1 sibling, 1 reply; 28+ messages in thread
From: Jeff King @ 2023-05-12 22:01 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Chris Torek, Derrick Stolee, Junio C Hamano

On Fri, May 12, 2023 at 05:45:43PM -0400, Jeff King wrote:

> The thing that confused me when looking at the code earlier is that
> git-prune itself checks the mtime of the objects, too, even if they were
> not mentioned in the recent-but-reachable walk. That seems redundant to
> me, but somehow it isn't. If I do:
> 
> diff --git a/builtin/prune.c b/builtin/prune.c
> index 5dc9b20720..22b7ce4b10 100644
> --- a/builtin/prune.c
> +++ b/builtin/prune.c
> @@ -92,7 +92,7 @@ static int prune_object(const struct object_id *oid, const char *fullpath,
>  		return 0;
>  	}
>  	if (st.st_mtime > expire)
> -		return 0;
> +		BUG("object should have been saved via is_object_reachable!");
>  	if (show_only || verbose) {
>  		enum object_type type = oid_object_info(the_repository, oid,
>  							NULL);
> 
> then t5304 shows some failures. I'm not quite sure what is going on, but
> _if_ that mtime check in prune_object() is important, then it probably
> should also respect the hook mechanism. So there may be a (3) to add to
> your patch there.

Ah, I see. The problem is that "git prune --expire=never" relies on this
check. In mark_reachable_objects(), we take a "0" expiration as "do not
bother adding these objects to the traversal", so it's here that we
catch them and say "ah, we are not expiring anything, so keep this".

If my hack above is switched to:

diff --git a/builtin/prune.c b/builtin/prune.c
index 5dc9b20720..108305cd26 100644
--- a/builtin/prune.c
+++ b/builtin/prune.c
@@ -91,8 +91,11 @@ static int prune_object(const struct object_id *oid, const char *fullpath,
 		error("Could not stat '%s'", fullpath);
 		return 0;
 	}
-	if (st.st_mtime > expire)
-		return 0;
+	if (st.st_mtime > expire) {
+		if (!expire)
+			return 0; /* we are not expiring anything! */
+		BUG("object should have been saved via is_object_reachable!");
+	}
 	if (show_only || verbose) {
 		enum object_type type = oid_object_info(the_repository, oid,
 							NULL);

then all of the tests pass. I do suspect this code could trigger racily
in the real world (we do our traversal, and then a new object is added,
which is of course recent). So we would not want to drop the check.

Is it important there for your patch to kick in and say "we were
specifically asked to consider this object recent, so do so"? I think it
shouldn't matter, because any such object is by definition recent,
having an mtime greater than or equal to when we started (it cannot even
get racily old while we traverse, because we decide the cutoff time as
an absolute value even before starting the traversal). The strict
greater-than in the check makes it feel like an off-by-one problem, but
unless you are saying "now" it must be at least 1 second in the past
anyway.

It does make me wonder why "git prune --expire=never" does not simply
exit immediately. Isn't there by definition nothing useful to do?

-Peff

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook`
  2023-05-12 22:01       ` Jeff King
@ 2023-05-12 23:21         ` Junio C Hamano
  2023-05-13  0:11           ` Jeff King
  0 siblings, 1 reply; 28+ messages in thread
From: Junio C Hamano @ 2023-05-12 23:21 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, Chris Torek, Derrick Stolee

Jeff King <peff@peff.net> writes:

> It does make me wonder why "git prune --expire=never" does not simply
> exit immediately. Isn't there by definition nothing useful to do?

I think the answer to the question is "no", but if I have to guess
why such a low-hanging fruit optimization possibility has not been
exploited so far is because it does not optimize a useful case,
perhaps?  IOW, falling into "if it hurts, then don't do it" category?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook`
  2023-05-12 23:21         ` Junio C Hamano
@ 2023-05-13  0:11           ` Jeff King
  2023-05-13  0:11             ` Jeff King
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff King @ 2023-05-13  0:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, Chris Torek, Derrick Stolee

On Fri, May 12, 2023 at 04:21:44PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > It does make me wonder why "git prune --expire=never" does not simply
> > exit immediately. Isn't there by definition nothing useful to do?
> 
> I think the answer to the question is "no", but if I have to guess
> why such a low-hanging fruit optimization possibility has not been
> exploited so far is because it does not optimize a useful case,
> perhaps?  IOW, falling into "if it hurts, then don't do it" category?

I think it would come up any time you run "git gc", if you've set
gc.pruneExpire to "never". And then it wastes time running a full object
walk (which is 30+ seconds for linux.git) even though it won't do
anything useful.

Curiously, git-gc is sprinkled with "if (prune_expire)" logic, including
skipping the call to git-prune entirely. But that only kicks in if you
run "git gc --no-prune". If "never" is truly the same thing (and I
cannot off the top of my head think of a reason that it isn't), then
this:

diff --git a/builtin/gc.c b/builtin/gc.c
index f3942188a6..5118535a4d 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -610,6 +610,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 
 	if (prune_expire && parse_expiry_date(prune_expire, &dummy))
 		die(_("failed to parse prune expiry value %s"), prune_expire);
+	if (prune_expire && !dummy)
+		prune_expire = NULL; /* "never" same as --no-prune */
 
 	if (aggressive) {
 		strvec_push(&repack, "-f");

would bring the two cases in sync (I tried to minimize the diff for
illustration; probably some light refactoring would be in order).

I guess nobody cared because it is not that common to set pruneExpire to
"never". We did something like that at GitHub, but we always drove
repack and prune through our own scripts, not through git-gc. I don't
have access to those scripts anymore, but I'm pretty sure they just
skipped calling git-prune entirely for this case.

So yeah, I think it may just be a curiosity that nobody noticed and
bothered to optimize it. I am tempted to work the above into a proper
patch. We even do something similar already for the reflog expiration
variables.

-Peff

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook`
  2023-05-13  0:11           ` Jeff King
@ 2023-05-13  0:11             ` Jeff King
  0 siblings, 0 replies; 28+ messages in thread
From: Jeff King @ 2023-05-13  0:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, Chris Torek, Derrick Stolee

On Fri, May 12, 2023 at 08:11:18PM -0400, Jeff King wrote:

> So yeah, I think it may just be a curiosity that nobody noticed and
> bothered to optimize it. I am tempted to work the above into a proper
> patch. We even do something similar already for the reflog expiration
> variables.

Oh, and in case it was not clear: this is all well off Taylor's original
topic. Whatever we do (or don't do) should be its own series.

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook`
  2023-05-12  4:58   ` Jeff King
@ 2023-05-15 20:15     ` Taylor Blau
  0 siblings, 0 replies; 28+ messages in thread
From: Taylor Blau @ 2023-05-15 20:15 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Chris Torek, Derrick Stolee, Junio C Hamano

On Fri, May 12, 2023 at 12:58:53AM -0400, Jeff King wrote:
> I haven't looked closely at this whole patch yet (and I especially want
> to look at the new tests since this approach covers more cases), but I
> did notice that this version of the function still has the "we don't
> reap the child on parse failure" problem I described in:
>
>   https://lore.kernel.org/git/20230505221921.GE3321533@coredump.intra.peff.net/

Hmmph. I could have sworn that I remember including that feedback in the
new round, but I must have dropped it on the floor somewhere.

Thanks for pointing it out, I'm 99% sure that it'll be in the next round
;-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook`
  2023-05-12 21:24   ` Jeff King
  2023-05-12 21:36     ` Taylor Blau
  2023-05-12 21:45     ` Jeff King
@ 2023-05-15 20:38     ` Taylor Blau
  2 siblings, 0 replies; 28+ messages in thread
From: Taylor Blau @ 2023-05-15 20:38 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Chris Torek, Derrick Stolee, Junio C Hamano

On Fri, May 12, 2023 at 05:24:56PM -0400, Jeff King wrote:
> I think this description suffers a bit from being adapted from the
> original patch which was targeting cruft packs. It's not clear to me
> what "the caller" means here. And really, I think this is getting into
> the details before giving an overview and motivation.
>
> I'd expect something the rationale to be something like:

Re-reading it myself, I tend to agree with you. I modified it quite a
bit, and I'm much happier with the result. Thanks for mentioning it.

> One option I don't see here is: update the mtime on the objects you want
> to salvage.
>
> Why would we want this patch instead of just having the caller update
> the mtimes of objects (or in a cruft-pack world, call a command that
> rewrites the .mtimes file with new values)?
>
> I can think of some possible arguments against it (you might want to
> retain the old mtimes, or you might find it a hassle to have to
> continually update them before gc kills them). But I think the commit
> message should probably make those arguments.

I agree with everything you wrote here.

> > We then add those as tips to another reachability traversal (along with
> > any recent objects, if pruning), marking every object along the way
> > (either adding it to the cruft pack, or writing it out as a loose
> > object).
>
> I didn't understand this "if pruning" comment. If we are not pruning at
> all, wouldn't we skip the extra traversal entirely, since we know we are
> saving everything?

I was talking about the rescuing traversal for generating a cruft pack.
But I ended up dropping this whole paragraph anyway, since I don't think
it's adding anything in the context of the new patch message.

> > @@ -126,8 +198,14 @@ static int want_recent_object(struct recent_data *data,
> >  			      const struct object_id *oid)
> >  {
> >  	if (data->ignore_in_core_kept_packs &&
> > -	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS))
> > +	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS)) {
> > +		if (!data->extra_recent_oids_loaded)
> > +			load_pack_recent_objects(data);
> > +		if (oidset_contains(&data->extra_recent_oids, oid))
> > +			return 1;
> > +
> >  		return 0;
> > +	}
> >  	return 1;
> >  }
>
> This hunk I'm less sure about. The purpose of this function is that the
> caller has told us about some packs which are "special", and we avoid
> adding their objects to the traversal.
>
> This kicks in for cruft packs, when the git-repack caller says "I just
> made pack xyz.pack; do not bother saving anything in it to the cruft
> pack, since xyz.pack is here to stay". So if a hook says "you should
> keep object X", why would we want to override that check? It is already
> a reachable object that has been packed into xyz.pack, so we know there
> is no point in even considering its recency.

Yup, you're absolutely right here. Thanks for catching it.

> > --- a/t/t5329-pack-objects-cruft.sh
> > +++ b/t/t5329-pack-objects-cruft.sh
> > @@ -739,4 +739,175 @@ test_expect_success 'cruft objects are freshend via loose' '
> >  	)
> >  '
> >
> > +test_expect_success 'additional cruft tips may be specified via pack.extraCruftTips' '
>
> This title (and others below) seems out of date. :)

Thanks for noticing, fixed.

> > diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh
> > index ebb267855f..d2eea6e754 100755
> > --- a/t/t7701-repack-unpack-unreachable.sh
> > +++ b/t/t7701-repack-unpack-unreachable.sh
> > @@ -113,6 +113,28 @@ test_expect_success 'do not bother loosening old objects' '
> >  	test_must_fail git cat-file -p $obj2
> >  '
> >
> > +test_expect_success 'extra recent tips are kept regardless of age' '
> > +	obj1=$(echo one | git hash-object -w --stdin) &&
> > +	obj2=$(echo two | git hash-object -w --stdin) &&
> > +	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
> > +	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
> > +	git prune-packed &&
> > +
> > +	git cat-file -p $obj1 &&
> > +	git cat-file -p $obj2 &&
> > +
> > +	write_script extra-tips <<-EOF &&
> > +	echo $obj2
> > +	EOF
> > +	git config pack.recentObjectsHook ./extra-tips &&
> > +
> > +	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
> > +	git repack -A -d --unpack-unreachable=1.hour.ago &&
> > +
> > +	git cat-file -p $obj1 &&
> > +	git cat-file -p $obj2
> > +'
>
> And this is the new test in this iteration covering the "repack -A"
> case.
>
> It is checking that $obj2, which our hook mentions, is saved. It also
> checks that $obj1 is saved because it is still recent. But there are two
> other possibly interesting cases:
>
>   - an object that is too old and is _not_ saved. It seems useful to
>     confirm that the new patch does not simply break the ability to drop
>     objects. ;)
>
>   - an object that is reachable from $obj2 is also saved. From a
>     white-box perspective this is less interesting, because we should
>     already test elsewhere that this works for recent objects, and we
>     know the new feature is implemented by faking recency. But it might
>     be worth it for completeness, and because it's easy to do (making
>     $obj2 a tag pointing to a blob should work).

All very good cases to check for. Here's a patch on top (which I'll
obviously squash into my new version, but figured I'd send it as a
response to you directly, too):

--- 8< ---
S
diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh
index d2eea6e754..fa2df6016b 100755
--- a/t/t7701-repack-unpack-unreachable.sh
+++ b/t/t7701-repack-unpack-unreachable.sh
@@ -116,23 +116,33 @@ test_expect_success 'do not bother loosening old objects' '
 test_expect_success 'extra recent tips are kept regardless of age' '
 	obj1=$(echo one | git hash-object -w --stdin) &&
 	obj2=$(echo two | git hash-object -w --stdin) &&
+	obj3=$(echo three | git hash-object -w --stdin) &&
 	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
 	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
+	pack3=$(echo $obj3 | git pack-objects .git/objects/pack/pack) &&
 	git prune-packed &&

 	git cat-file -p $obj1 &&
 	git cat-file -p $obj2 &&
+	git cat-file -p $obj3 &&

-	write_script extra-tips <<-EOF &&
-	echo $obj2
+	git tag -a -m tag obj2-tag $obj2 &&
+	obj2_tag="$(git rev-parse obj2-tag)" &&
+
+
+	write_script precious-objects <<-EOF &&
+	echo $obj2_tag
 	EOF
-	git config pack.recentObjectsHook ./extra-tips &&
+	git config pack.recentObjectsHook ./precious-objects &&

 	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
+	test-tool chmtime =-86400 .git/objects/pack/pack-$pack3.pack &&
 	git repack -A -d --unpack-unreachable=1.hour.ago &&

 	git cat-file -p $obj1 &&
-	git cat-file -p $obj2
+	git cat-file -p $obj2 &&
+	git cat-file -p $obj2_tag &&
+	test_must_fail git cat-file -p $obj3
 '

 test_expect_success 'keep packed objects found only in index' '
--- >8 ---

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook`
  2023-05-12 21:45     ` Jeff King
  2023-05-12 22:01       ` Jeff King
@ 2023-05-15 20:49       ` Taylor Blau
  1 sibling, 0 replies; 28+ messages in thread
From: Taylor Blau @ 2023-05-15 20:49 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Chris Torek, Derrick Stolee, Junio C Hamano

On Fri, May 12, 2023 at 05:45:42PM -0400, Jeff King wrote:
> On Fri, May 12, 2023 at 05:24:56PM -0400, Jeff King wrote:
>
> > > This patch introduces a new configuration, `pack.recentObjectsHook`
> > > which allows the caller to specify a program (or set of programs) whose
> > > output is treated as a set of objects to treat as recent, regardless of
> > > their true age.
> >
> > I was going to complain about putting this in the "pack" section,
> > because I thought by touching reachable.c, we'd also affect git-prune.
> > But I don't think we do, because it does its own direct mtime check on
> > the loose objects.
> >
> > But I'm not sure that's the right behavior.
> >
> > It feels like even before your patch, this is a huge gap in our
> > object-retention strategy.  During repacking, we try to avoid dropping
> > objects which are reachable from recent-but-unreachable things we're
> > keeping (since otherwise it effectively corrupts those recent objects,
> > making them less valuable to keep). But git-prune will happily drop them
> > anyway!
> >
> > And I think the same thing would apply to your hook. If the hook says
> > "object XYZ is precious even if unreachable, keep it", then git-prune
> > ignoring that seems like it would be a source of errors.
> >
> > I suspect both could be fixed by having git-prune trigger the same
> > add_unseen_recent_objects_to_traversal() call either as part of
> > the perform_reachability_traversal() walk, or maybe in its own walk (I
> > think maybe it has to be its own because the second walk should avoid
> > complaining about missing objects).
>
> <phew> I am happy to say that I was wrong here, and git-prune behaves as
> it should, courtesy of d3038d22f9 (prune: keep objects reachable from
> recent objects, 2014-10-15). The magic happens in mark_reachable_objects(),
> which handles walking the recent objects by calling...you guessed it,
> add_unseen_recent_objects_to_traversal().

Phew. Thanks for digging into it before I was able to respond. I'm glad
that this works (though I agree that we should add a test).

> So it does the right thing now, and your patch should kick in
> automatically for git-prune, too. But I think we'd want two things:
>
>   1. Should the config variable name be made more generic to match?
>      Maybe "core" is too broad (though certainly I'd expect it to apply
>      anywhere in Git where we check recent-ness of objects), but perhaps
>      "gc" would make sense (even though it is not strictly part of the
>      gc command, it is within that realm of concepts).

"core" does feel pretty broad. There's some precedence for adding
hook-like configuration there, at least with
`core.alternateRefsCommand`. But I think that was an appropriate choice
given the scope of that feature.

I think that calling it "gc.recentObjectsHook" makes the most sense.

>   2. We probably want a test covering git-prune in this situation.

Yup.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v4 0/2] gc: introduce `gc.recentObjectsHook`
  2023-05-11 23:20 [PATCH v3 0/2] pack-objects: introduce `pack.extraRecentObjectsHook` Taylor Blau
                   ` (3 preceding siblings ...)
  2023-05-11 23:39 ` Junio C Hamano
@ 2023-05-16  0:23 ` Taylor Blau
  2023-05-16  0:24   ` [PATCH v4 1/2] reachable.c: extract `obj_is_recent()` Taylor Blau
  2023-05-16  0:24   ` [PATCH v4 2/2] gc: introduce `gc.recentObjectsHook` Taylor Blau
  2023-06-07 22:58 ` [PATCH v5 0/2] " Taylor Blau
  5 siblings, 2 replies; 28+ messages in thread
From: Taylor Blau @ 2023-05-16  0:23 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano, Chris Torek

Here is another reworked version of the patch which introduced a new
configuration `gc.extraCruftTips` to keep additional objects when
pruning which might have otherwise aged out of the repository.

Notable changes since last time include:

  - updating the name of this configuration to "gc.recentObjectsHook",
  - significantly reworking the substantive patch's message
  - improved test coverage to cover more cases of loose object pruning,
    as well as `git prune`'s interaction with `gc.recentObjectsHook`.

(Again, since we're expecting -rc0, I don't expect a ton of movement on
this series before 2.41 is shipped, but I figured I'd share it out
anyways to get it off of my laptop).

Thanks in advance for your review.

Taylor Blau (2):
  reachable.c: extract `obj_is_recent()`
  gc: introduce `gc.recentObjectsHook`

 Documentation/config/gc.txt          |  13 ++
 reachable.c                          |  85 ++++++++++++-
 t/t5304-prune.sh                     |  14 +++
 t/t5329-pack-objects-cruft.sh        | 171 +++++++++++++++++++++++++++
 t/t7701-repack-unpack-unreachable.sh |  31 +++++
 5 files changed, 311 insertions(+), 3 deletions(-)

Range-diff against v3:
1:  f5f3b0f334 = 1:  9c1b59c8cf reachable.c: extract `obj_is_recent()`
2:  2ce8a79fa4 ! 2:  18e50d2517 builtin/pack-objects.c: introduce `pack.recentObjectsHook`
    @@ Metadata
     Author: Taylor Blau <me@ttaylorr.com>
     
      ## Commit message ##
    -    builtin/pack-objects.c: introduce `pack.recentObjectsHook`
    +    gc: introduce `gc.recentObjectsHook`
     
         This patch introduces a new multi-valued configuration option,
    -    `pack.recentObjectsHook` as a means to mark certain objects as recent,
    -    regardless of their age.
    +    `gc.recentObjectsHook` as a means to mark certain objects as recent (and
    +    thus exempt from garbage collection), regardless of their age.
     
    -    Depending on whether or not we are generating a cruft pack, this allows
    -    the caller to do one of two things:
    +    When performing a garbage collection operation on a repository with
    +    unreachable objects, Git makes its decision on what to do with those
    +    object(s) bed on how recent the objects are or not. Generally speaking,
    +    unreachable-but-recent objects stay in the repository, and older objects
    +    are discarded.
     
    -      - If generating a cruft pack, the caller is able to retain additional
    -        objects via the cruft pack, even if they would have otherwise been
    -        pruned due to their age.
    +    However, we have no convenient way to keep certain precious, unreachable
    +    objects around in the repository, even if they have aged out and would
    +    be pruned. Our options today consist of:
     
    -      - If not generating a cruft pack, the caller is likewise able to
    -        retain additional objects as loose.
    +      - Point references at the reachability tips of any objects you
    +        consider precious, which may be undesirable or infeasible.
     
    -    There is currently no option to be able to keep around certain objects
    -    that have otherwise aged out of the grace period. The only way to retain
    -    those objects is:
    -
    -      - to point a reference at them, which may be undesirable or
    -        infeasible,
    -
    -      - to track them via the reflog, which may be undesirable since the
    +      - Track them via the reflog, which may be undesirable since the
             reflog's lifetime is limited to that of the reference it's tracking
             (and callers may want to keep those unreachable objects around for
    -        longer)
    +        longer).
     
    -      - to extend the grace period, which may keep around other objects that
    -        the caller *does* want to discard,
    +      - Extend the grace period, which may keep around other objects that
    +        the caller *does* want to discard.
     
    -      - or, to force the caller to construct the pack of objects they want
    +      - Manually modify the mtimes of objects you want to keep. If those
    +        objects are already loose, this is easy enough to do (you can just
    +        enumerate and `touch -m` each one).
    +
    +        But if they are packed, you will either end up modifying the mtimes
    +        of *all* objects in that pack, or be forced to write out a loose
    +        copy of that object, both of which may be undesirable. Even worse,
    +        if they are in a cruft pack, that requires modifying its `*.mtimes`
    +        file by hand, since there is no exposed plumbing for this.
    +
    +      - Force the caller to construct the pack of objects they want
             to keep themselves, and then mark the pack as kept by adding a
    -        ".keep" file.
    +        ".keep" file. This works, but is burdensome for the caller, and
    +        having extra packs is awkward as you roll forward your cruft pack.
     
    -    This patch introduces a new configuration, `pack.recentObjectsHook`
    -    which allows the caller to specify a program (or set of programs) whose
    -    output is treated as a set of objects to treat as recent, regardless of
    -    their true age.
    +    This patch introduces a new option to the above list via the
    +    `gc.recentObjectsHook` configuration, which allows the caller to
    +    specify a program (or set of programs) whose output is treated as a set
    +    of objects to treat as recent, regardless of their true age.
     
    -    The implementation is straightforward. In either case (cruft packs or
    -    not), Git enumerates recent objects via
    -    `add_unseen_recent_objects_to_traversal()`. That enumerates loose and
    +    The implementation is straightforward. Git enumerates recent objects via
    +    `add_unseen_recent_objects_to_traversal()`, which enumerates loose and
         packed objects, and eventually calls add_recent_object() on any objects
         for which `want_recent_object()`'s conditions are met.
     
         This patch modifies the recency condition from simply "is the mtime of
         this object more recent than the cutoff?" to "[...] or, is this object
    -    mentioned by at least one `pack.recentObjectsHook`?".
    +    mentioned by at least one `gc.recentObjectsHook`?".
     
    -    We then add those as tips to another reachability traversal (along with
    -    any recent objects, if pruning), marking every object along the way
    -    (either adding it to the cruft pack, or writing it out as a loose
    -    object).
    +    Depending on whether or not we are generating a cruft pack, this allows
    +    the caller to do one of two things:
    +
    +      - If generating a cruft pack, the caller is able to retain additional
    +        objects via the cruft pack, even if they would have otherwise been
    +        pruned due to their age.
    +
    +      - If not generating a cruft pack, the caller is likewise able to
    +        retain additional objects as loose.
     
         A potential alternative here is to introduce a new mode to alter the
         contents of the reachable pack instead of the cruft one. One could
    @@ Commit message
         Helped-by: Jeff King <peff@peff.net>
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    - ## Documentation/config/pack.txt ##
    -@@ Documentation/config/pack.txt: pack.deltaCacheLimit::
    - 	result once the best match for all objects is found.
    - 	Defaults to 1000. Maximum value is 65535.
    + ## Documentation/config/gc.txt ##
    +@@ Documentation/config/gc.txt: or rebase occurring.  Since these changes are not part of the current
    + project most users will want to expire them sooner, which is why the
    + default is more aggressive than `gc.reflogExpire`.
      
    -+pack.recentObjectsHook::
    ++gc.recentObjectsHook::
     +	When considering the recency of an object (e.g., when generating
     +	a cruft pack or storing unreachable objects as loose), use the
     +	shell to execute the specified command(s). Interpret their
    @@ Documentation/config/pack.txt: pack.deltaCacheLimit::
     +operation (either generating a cruft pack or unpacking unreachable
     +objects) will be halted.
     +
    - pack.threads::
    - 	Specifies the number of threads to spawn when searching for best
    - 	delta matches.  This requires that linkgit:git-pack-objects[1]
    -
    - ## builtin/pack-objects.c ##
    -@@
    - #include "pack-mtimes.h"
    - #include "parse-options.h"
    - #include "wrapper.h"
    -+#include "run-command.h"
    - 
    - /*
    -  * Objects we are going to pack are collected in the `to_pack` structure.
    + gc.rerereResolved::
    + 	Records of conflicted merge you resolved earlier are
    + 	kept for this many days when 'git rerere gc' is run.
     
      ## reachable.c ##
     @@
    @@ reachable.c: struct recent_data {
     +	int extra_recent_oids_loaded;
      };
      
    -+static int run_one_pack_recent_objects_hook(struct oidset *set,
    ++static int run_one_gc_recent_objects_hook(struct oidset *set,
     +					    const char *args)
     +{
     +	struct child_process cmd = CHILD_PROCESS_INIT;
    @@ reachable.c: struct recent_data {
     +
     +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
     +			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
    -+			goto done;
    ++			break;
     +		}
     +
     +		oidset_insert(set, &oid);
     +	}
     +
    -+	ret = finish_command(&cmd);
    ++	fclose(out);
    ++	ret |= finish_command(&cmd);
     +
    -+done:
    -+	if (out)
    -+		fclose(out);
     +	strbuf_release(&buf);
    -+	child_process_clear(&cmd);
    -+
     +	return ret;
     +}
     +
    -+static void load_pack_recent_objects(struct recent_data *data)
    ++static void load_gc_recent_objects(struct recent_data *data)
     +{
     +	const struct string_list *programs;
     +	int ret = 0;
    @@ reachable.c: struct recent_data {
     +
     +	data->extra_recent_oids_loaded = 1;
     +
    -+	if (git_config_get_string_multi("pack.recentobjectshook", &programs))
    ++	if (git_config_get_string_multi("gc.recentobjectshook", &programs))
     +		return;
     +
     +	for (i = 0; i < programs->nr; i++) {
    -+		ret = run_one_pack_recent_objects_hook(&data->extra_recent_oids,
    ++		ret = run_one_gc_recent_objects_hook(&data->extra_recent_oids,
     +						       programs->items[i].string);
     +		if (ret)
    -+			break;
    ++			die(_("unable to enumerate additional cruft tips"));
     +	}
    -+
    -+	if (ret)
    -+		die(_("unable to enumerate additional cruft tips"));
     +}
     +
      static int obj_is_recent(const struct object_id *oid, timestamp_t mtime,
    @@ reachable.c: struct recent_data {
     +		return 1;
     +
     +	if (!data->extra_recent_oids_loaded)
    -+		load_pack_recent_objects(data);
    ++		load_gc_recent_objects(data);
     +	return oidset_contains(&data->extra_recent_oids, oid);
      }
      
      static void add_recent_object(const struct object_id *oid,
    -@@ reachable.c: static int want_recent_object(struct recent_data *data,
    - 			      const struct object_id *oid)
    - {
    - 	if (data->ignore_in_core_kept_packs &&
    --	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS))
    -+	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS)) {
    -+		if (!data->extra_recent_oids_loaded)
    -+			load_pack_recent_objects(data);
    -+		if (oidset_contains(&data->extra_recent_oids, oid))
    -+			return 1;
    -+
    - 		return 0;
    -+	}
    - 	return 1;
    - }
    - 
     @@ reachable.c: int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
      	data.cb = cb;
      	data.ignore_in_core_kept_packs = ignore_in_core_kept_packs;
    @@ reachable.c: int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
      
      static int mark_object_seen(const struct object_id *oid,
     
    + ## t/t5304-prune.sh ##
    +@@ t/t5304-prune.sh: test_expect_success 'old reachable-from-recent retained with bitmaps' '
    + 	test_must_fail git cat-file -e $to_drop
    + '
    + 
    ++test_expect_success 'gc.recentObjectsHook' '
    ++	add_blob &&
    ++	test-tool chmtime =-86500 $BLOB_FILE &&
    ++
    ++	write_script precious-objects <<-EOF &&
    ++	echo $BLOB
    ++	EOF
    ++	test_config gc.recentObjectsHook ./precious-objects &&
    ++
    ++	git prune --expire=now &&
    ++
    ++	git cat-file -p $BLOB
    ++'
    ++
    + test_done
    +
      ## t/t5329-pack-objects-cruft.sh ##
     @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend via loose' '
      	)
      '
      
    -+test_expect_success 'additional cruft tips may be specified via pack.extraCruftTips' '
    ++test_expect_success 'gc.recentObjectsHook' '
     +	git init repo &&
     +	test_when_finished "rm -fr repo" &&
     +	(
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
     +		write_script extra-tips <<-EOF &&
     +		echo $cruft_old
     +		EOF
    -+		git config pack.recentObjectsHook ./extra-tips &&
    ++		git config gc.recentObjectsHook ./extra-tips &&
     +
     +		git repack --cruft --cruft-expiration=now -d &&
     +
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
     +		test_cmp cruft.expect cruft.actual &&
     +
     +		# Ensure that the "old" objects are removed after
    -+		# dropping the pack.extraCruftTips hook.
    -+		git config --unset pack.recentObjectsHook &&
    ++		# dropping the gc.recentObjectsHook hook.
    ++		git config --unset gc.recentObjectsHook &&
     +		git repack --cruft --cruft-expiration=now -d &&
     +
     +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
     +	)
     +'
     +
    -+test_expect_success 'multi-valued pack.extraCruftTips' '
    ++test_expect_success 'multi-valued gc.recentObjectsHook' '
     +	git init repo &&
     +	test_when_finished "rm -fr repo" &&
     +	(
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
     +
     +		# ensure that each extra cruft tip is saved by its
     +		# respective hook
    -+		git config --add pack.recentObjectsHook ./extra-tips.a &&
    -+		git config --add pack.recentObjectsHook ./extra-tips.b &&
    ++		git config --add gc.recentObjectsHook ./extra-tips.a &&
    ++		git config --add gc.recentObjectsHook ./extra-tips.b &&
     +		git repack --cruft --cruft-expiration=now -d &&
     +
     +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
     +		test_cmp cruft.expect cruft.actual &&
     +
     +		# ensure that a dirty exit halts cruft pack generation
    -+		git config --add pack.recentObjectsHook ./extra-tips.c &&
    ++		git config --add gc.recentObjectsHook ./extra-tips.c &&
     +		test_must_fail git repack --cruft --cruft-expiration=now -d 2>err &&
     +		grep "unable to enumerate additional cruft tips" err &&
     +
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
     +	)
     +'
     +
    -+test_expect_success 'additional cruft blobs via pack.extraCruftTips' '
    ++test_expect_success 'additional cruft blobs via gc.recentObjectsHook' '
     +	git init repo &&
     +	test_when_finished "rm -fr repo" &&
     +	(
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
     +		write_script extra-tips <<-EOF &&
     +		echo $blob
     +		EOF
    -+		git config pack.recentObjectsHook ./extra-tips &&
    ++		git config gc.recentObjectsHook ./extra-tips &&
     +
     +		git repack --cruft --cruft-expiration=now -d &&
     +
    @@ t/t7701-repack-unpack-unreachable.sh: test_expect_success 'do not bother looseni
      	test_must_fail git cat-file -p $obj2
      '
      
    -+test_expect_success 'extra recent tips are kept regardless of age' '
    ++test_expect_success 'gc.recentObjectsHook' '
     +	obj1=$(echo one | git hash-object -w --stdin) &&
     +	obj2=$(echo two | git hash-object -w --stdin) &&
    ++	obj3=$(echo three | git hash-object -w --stdin) &&
     +	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
     +	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
    ++	pack3=$(echo $obj3 | git pack-objects .git/objects/pack/pack) &&
     +	git prune-packed &&
     +
     +	git cat-file -p $obj1 &&
     +	git cat-file -p $obj2 &&
    ++	git cat-file -p $obj3 &&
     +
    -+	write_script extra-tips <<-EOF &&
    -+	echo $obj2
    ++	git tag -a -m tag obj2-tag $obj2 &&
    ++	obj2_tag="$(git rev-parse obj2-tag)" &&
    ++
    ++	write_script precious-objects <<-EOF &&
    ++	echo $obj2_tag
     +	EOF
    -+	git config pack.recentObjectsHook ./extra-tips &&
    ++	git config gc.recentObjectsHook ./precious-objects &&
     +
     +	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
    ++	test-tool chmtime =-86400 .git/objects/pack/pack-$pack3.pack &&
     +	git repack -A -d --unpack-unreachable=1.hour.ago &&
     +
     +	git cat-file -p $obj1 &&
    -+	git cat-file -p $obj2
    ++	git cat-file -p $obj2 &&
    ++	git cat-file -p $obj2_tag &&
    ++	test_must_fail git cat-file -p $obj3
     +'
     +
      test_expect_success 'keep packed objects found only in index' '
-- 
2.40.1.558.g18e50d2517

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v4 1/2] reachable.c: extract `obj_is_recent()`
  2023-05-16  0:23 ` [PATCH v4 0/2] gc: introduce `gc.recentObjectsHook` Taylor Blau
@ 2023-05-16  0:24   ` Taylor Blau
  2023-05-16  0:24   ` [PATCH v4 2/2] gc: introduce `gc.recentObjectsHook` Taylor Blau
  1 sibling, 0 replies; 28+ messages in thread
From: Taylor Blau @ 2023-05-16  0:24 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano, Chris Torek

When enumerating objects in order to add recent ones (i.e. those whose
mtime is strictly newer than the cutoff) as tips of a reachability
traversal, `add_recent_object()` discards objects which do not meet the
recency criteria.

The subsequent commit will make checking whether or not an object is
recent also consult the list of hooks in `pack.recentHook`. Isolate this
check in its own function to keep the additional complexity outside of
`add_recent_object()`.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 reachable.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/reachable.c b/reachable.c
index 55bb114353..7a42da5d39 100644
--- a/reachable.c
+++ b/reachable.c
@@ -69,6 +69,12 @@ struct recent_data {
 	int ignore_in_core_kept_packs;
 };
 
+static int obj_is_recent(const struct object_id *oid, timestamp_t mtime,
+			 struct recent_data *data)
+{
+	return mtime > data->timestamp;
+}
+
 static void add_recent_object(const struct object_id *oid,
 			      struct packed_git *pack,
 			      off_t offset,
@@ -78,7 +84,7 @@ static void add_recent_object(const struct object_id *oid,
 	struct object *obj;
 	enum object_type type;
 
-	if (mtime <= data->timestamp)
+	if (!obj_is_recent(oid, mtime, data))
 		return;
 
 	/*
-- 
2.40.1.558.g18e50d2517


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 2/2] gc: introduce `gc.recentObjectsHook`
  2023-05-16  0:23 ` [PATCH v4 0/2] gc: introduce `gc.recentObjectsHook` Taylor Blau
  2023-05-16  0:24   ` [PATCH v4 1/2] reachable.c: extract `obj_is_recent()` Taylor Blau
@ 2023-05-16  0:24   ` Taylor Blau
  2023-05-24 23:21     ` Glen Choo
  1 sibling, 1 reply; 28+ messages in thread
From: Taylor Blau @ 2023-05-16  0:24 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano, Chris Torek

This patch introduces a new multi-valued configuration option,
`gc.recentObjectsHook` as a means to mark certain objects as recent (and
thus exempt from garbage collection), regardless of their age.

When performing a garbage collection operation on a repository with
unreachable objects, Git makes its decision on what to do with those
object(s) bed on how recent the objects are or not. Generally speaking,
unreachable-but-recent objects stay in the repository, and older objects
are discarded.

However, we have no convenient way to keep certain precious, unreachable
objects around in the repository, even if they have aged out and would
be pruned. Our options today consist of:

  - Point references at the reachability tips of any objects you
    consider precious, which may be undesirable or infeasible.

  - Track them via the reflog, which may be undesirable since the
    reflog's lifetime is limited to that of the reference it's tracking
    (and callers may want to keep those unreachable objects around for
    longer).

  - Extend the grace period, which may keep around other objects that
    the caller *does* want to discard.

  - Manually modify the mtimes of objects you want to keep. If those
    objects are already loose, this is easy enough to do (you can just
    enumerate and `touch -m` each one).

    But if they are packed, you will either end up modifying the mtimes
    of *all* objects in that pack, or be forced to write out a loose
    copy of that object, both of which may be undesirable. Even worse,
    if they are in a cruft pack, that requires modifying its `*.mtimes`
    file by hand, since there is no exposed plumbing for this.

  - Force the caller to construct the pack of objects they want
    to keep themselves, and then mark the pack as kept by adding a
    ".keep" file. This works, but is burdensome for the caller, and
    having extra packs is awkward as you roll forward your cruft pack.

This patch introduces a new option to the above list via the
`gc.recentObjectsHook` configuration, which allows the caller to
specify a program (or set of programs) whose output is treated as a set
of objects to treat as recent, regardless of their true age.

The implementation is straightforward. Git enumerates recent objects via
`add_unseen_recent_objects_to_traversal()`, which enumerates loose and
packed objects, and eventually calls add_recent_object() on any objects
for which `want_recent_object()`'s conditions are met.

This patch modifies the recency condition from simply "is the mtime of
this object more recent than the cutoff?" to "[...] or, is this object
mentioned by at least one `gc.recentObjectsHook`?".

Depending on whether or not we are generating a cruft pack, this allows
the caller to do one of two things:

  - If generating a cruft pack, the caller is able to retain additional
    objects via the cruft pack, even if they would have otherwise been
    pruned due to their age.

  - If not generating a cruft pack, the caller is likewise able to
    retain additional objects as loose.

A potential alternative here is to introduce a new mode to alter the
contents of the reachable pack instead of the cruft one. One could
imagine a new option to `pack-objects`, say `--extra-reachable-tips`
that does the same thing as above, adding the visited set of objects
along the traversal to the pack.

But this has the unfortunate side-effect of altering the reachability
closure of that pack. If parts of the unreachable object graph mentioned
by one or more of the "extra reachable tips" programs is not closed,
then the resulting pack won't be either. This makes it impossible in the
general case to write out reachability bitmaps for that pack, since
closure is a requirement there.

Instead, keep these unreachable objects in the cruft pack (or set of
unreachable, loose objects) instead, to ensure that we can continue to
have a pack containing just reachable objects, which is always safe to
write a bitmap over.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/gc.txt          |  13 ++
 reachable.c                          |  79 ++++++++++++-
 t/t5304-prune.sh                     |  14 +++
 t/t5329-pack-objects-cruft.sh        | 171 +++++++++++++++++++++++++++
 t/t7701-repack-unpack-unreachable.sh |  31 +++++
 5 files changed, 305 insertions(+), 3 deletions(-)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index 7f95c866e1..16190be877 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -130,6 +130,19 @@ or rebase occurring.  Since these changes are not part of the current
 project most users will want to expire them sooner, which is why the
 default is more aggressive than `gc.reflogExpire`.
 
+gc.recentObjectsHook::
+	When considering the recency of an object (e.g., when generating
+	a cruft pack or storing unreachable objects as loose), use the
+	shell to execute the specified command(s). Interpret their
+	output as object IDs which Git will consider as "recent",
+	regardless of their age.
++
+Output must contain exactly one hex object ID per line, and nothing
+else. Objects which cannot be found in the repository are ignored.
+Multiple hooks are supported, but all must exit successfully, else the
+operation (either generating a cruft pack or unpacking unreachable
+objects) will be halted.
+
 gc.rerereResolved::
 	Records of conflicted merge you resolved earlier are
 	kept for this many days when 'git rerere gc' is run.
diff --git a/reachable.c b/reachable.c
index 7a42da5d39..e82ab33444 100644
--- a/reachable.c
+++ b/reachable.c
@@ -16,6 +16,8 @@
 #include "object-store.h"
 #include "pack-bitmap.h"
 #include "pack-mtimes.h"
+#include "config.h"
+#include "run-command.h"
 
 struct connectivity_progress {
 	struct progress *progress;
@@ -67,12 +69,75 @@ struct recent_data {
 	timestamp_t timestamp;
 	report_recent_object_fn *cb;
 	int ignore_in_core_kept_packs;
+
+	struct oidset extra_recent_oids;
+	int extra_recent_oids_loaded;
 };
 
+static int run_one_gc_recent_objects_hook(struct oidset *set,
+					    const char *args)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf buf = STRBUF_INIT;
+	FILE *out;
+	int ret = 0;
+
+	cmd.use_shell = 1;
+	cmd.out = -1;
+
+	strvec_push(&cmd.args, args);
+
+	if (start_command(&cmd))
+		return -1;
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline(&buf, out) != EOF) {
+		struct object_id oid;
+		const char *rest;
+
+		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
+			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
+			break;
+		}
+
+		oidset_insert(set, &oid);
+	}
+
+	fclose(out);
+	ret |= finish_command(&cmd);
+
+	strbuf_release(&buf);
+	return ret;
+}
+
+static void load_gc_recent_objects(struct recent_data *data)
+{
+	const struct string_list *programs;
+	int ret = 0;
+	size_t i;
+
+	data->extra_recent_oids_loaded = 1;
+
+	if (git_config_get_string_multi("gc.recentobjectshook", &programs))
+		return;
+
+	for (i = 0; i < programs->nr; i++) {
+		ret = run_one_gc_recent_objects_hook(&data->extra_recent_oids,
+						       programs->items[i].string);
+		if (ret)
+			die(_("unable to enumerate additional cruft tips"));
+	}
+}
+
 static int obj_is_recent(const struct object_id *oid, timestamp_t mtime,
 			 struct recent_data *data)
 {
-	return mtime > data->timestamp;
+	if (mtime > data->timestamp)
+		return 1;
+
+	if (!data->extra_recent_oids_loaded)
+		load_gc_recent_objects(data);
+	return oidset_contains(&data->extra_recent_oids, oid);
 }
 
 static void add_recent_object(const struct object_id *oid,
@@ -199,16 +264,24 @@ int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
 	data.cb = cb;
 	data.ignore_in_core_kept_packs = ignore_in_core_kept_packs;
 
+	oidset_init(&data.extra_recent_oids, 0);
+	data.extra_recent_oids_loaded = 0;
+
 	r = for_each_loose_object(add_recent_loose, &data,
 				  FOR_EACH_OBJECT_LOCAL_ONLY);
 	if (r)
-		return r;
+		goto done;
 
 	flags = FOR_EACH_OBJECT_LOCAL_ONLY | FOR_EACH_OBJECT_PACK_ORDER;
 	if (ignore_in_core_kept_packs)
 		flags |= FOR_EACH_OBJECT_SKIP_IN_CORE_KEPT_PACKS;
 
-	return for_each_packed_object(add_recent_packed, &data, flags);
+	r = for_each_packed_object(add_recent_packed, &data, flags);
+
+done:
+	oidset_clear(&data.extra_recent_oids);
+
+	return r;
 }
 
 static int mark_object_seen(const struct object_id *oid,
diff --git a/t/t5304-prune.sh b/t/t5304-prune.sh
index 662ae9b152..a635fe98f8 100755
--- a/t/t5304-prune.sh
+++ b/t/t5304-prune.sh
@@ -350,4 +350,18 @@ test_expect_success 'old reachable-from-recent retained with bitmaps' '
 	test_must_fail git cat-file -e $to_drop
 '
 
+test_expect_success 'gc.recentObjectsHook' '
+	add_blob &&
+	test-tool chmtime =-86500 $BLOB_FILE &&
+
+	write_script precious-objects <<-EOF &&
+	echo $BLOB
+	EOF
+	test_config gc.recentObjectsHook ./precious-objects &&
+
+	git prune --expire=now &&
+
+	git cat-file -p $BLOB
+'
+
 test_done
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 303f7a5d84..3ae61ca995 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -739,4 +739,175 @@ test_expect_success 'cruft objects are freshend via loose' '
 	)
 '
 
+test_expect_success 'gc.recentObjectsHook' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		# Create a handful of objects.
+		#
+		#   - one reachable commit, "base", designated for the reachable
+		#     pack
+		#   - one unreachable commit, "cruft.discard", which is marked
+		#     for deletion
+		#   - one unreachable commit, "cruft.old", which would be marked
+		#     for deletion, but is rescued as an extra cruft tip
+		#   - one unreachable commit, "cruft.new", which is not marked
+		#     for deletion
+		test_commit base &&
+		git branch -M main &&
+
+		git checkout --orphan discard &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.discard &&
+
+		git checkout --orphan old &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.old &&
+		cruft_old="$(git rev-parse HEAD)" &&
+
+		git checkout --orphan new &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.new &&
+		cruft_new="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D discard old new &&
+		git reflog expire --all --expire=all &&
+
+		# mark cruft.old with an mtime that is many minutes
+		# older than the expiration period, and mark cruft.new
+		# with an mtime that is in the future (and thus not
+		# eligible for pruning).
+		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $cruft_old)" &&
+		test-tool chmtime +1000 "$objdir/$(test_oid_to_path $cruft_new)" &&
+
+		# Write the list of cruft objects we expect to
+		# accumulate, which is comprised of everything reachable
+		# from cruft.old and cruft.new, but not cruft.discard.
+		git rev-list --objects --no-object-names \
+			$cruft_old $cruft_new >cruft.raw &&
+		sort cruft.raw >cruft.expect &&
+
+		# Write the script to list extra tips, which are limited
+		# to cruft.old, in this case.
+		write_script extra-tips <<-EOF &&
+		echo $cruft_old
+		EOF
+		git config gc.recentObjectsHook ./extra-tips &&
+
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# Ensure that the "old" objects are removed after
+		# dropping the gc.recentObjectsHook hook.
+		git config --unset gc.recentObjectsHook &&
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+
+		git rev-list --objects --no-object-names $cruft_new >cruft.raw &&
+		cp cruft.expect cruft.old &&
+		sort cruft.raw >cruft.expect &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# ensure objects which are no longer in the cruft pack were
+		# removed from the repository
+		for object in $(comm -13 cruft.expect cruft.old)
+		do
+			test_must_fail git cat-file -t $object || return 1
+		done
+	)
+'
+
+test_expect_success 'multi-valued gc.recentObjectsHook' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		git branch -M main &&
+
+		git checkout --orphan cruft.a &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.a &&
+		cruft_a="$(git rev-parse HEAD)" &&
+
+		git checkout --orphan cruft.b &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.b &&
+		cruft_b="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D cruft.a cruft.b &&
+		git reflog expire --all --expire=all &&
+
+		echo "echo $cruft_a" | write_script extra-tips.a &&
+		echo "echo $cruft_b" | write_script extra-tips.b &&
+		echo "false" | write_script extra-tips.c &&
+
+		git rev-list --objects --no-object-names $cruft_a $cruft_b \
+			>cruft.raw &&
+		sort cruft.raw >cruft.expect &&
+
+		# ensure that each extra cruft tip is saved by its
+		# respective hook
+		git config --add gc.recentObjectsHook ./extra-tips.a &&
+		git config --add gc.recentObjectsHook ./extra-tips.b &&
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# ensure that a dirty exit halts cruft pack generation
+		git config --add gc.recentObjectsHook ./extra-tips.c &&
+		test_must_fail git repack --cruft --cruft-expiration=now -d 2>err &&
+		grep "unable to enumerate additional cruft tips" err &&
+
+		# and that the existing cruft pack is left alone
+		test_path_is_file "$mtimes"
+	)
+'
+
+test_expect_success 'additional cruft blobs via gc.recentObjectsHook' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		blob=$(echo "unreachable" | git hash-object -w --stdin) &&
+
+		# mark the unreachable blob we wrote above as having
+		# aged out of the retention period
+		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $blob)" &&
+
+		# Write the script to list extra tips, which is just the
+		# extra blob as above.
+		write_script extra-tips <<-EOF &&
+		echo $blob
+		EOF
+		git config gc.recentObjectsHook ./extra-tips &&
+
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft >actual &&
+		echo $blob >expect &&
+		test_cmp expect actual
+	)
+'
+
 test_done
diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh
index ebb267855f..ba428c18a8 100755
--- a/t/t7701-repack-unpack-unreachable.sh
+++ b/t/t7701-repack-unpack-unreachable.sh
@@ -113,6 +113,37 @@ test_expect_success 'do not bother loosening old objects' '
 	test_must_fail git cat-file -p $obj2
 '
 
+test_expect_success 'gc.recentObjectsHook' '
+	obj1=$(echo one | git hash-object -w --stdin) &&
+	obj2=$(echo two | git hash-object -w --stdin) &&
+	obj3=$(echo three | git hash-object -w --stdin) &&
+	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
+	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
+	pack3=$(echo $obj3 | git pack-objects .git/objects/pack/pack) &&
+	git prune-packed &&
+
+	git cat-file -p $obj1 &&
+	git cat-file -p $obj2 &&
+	git cat-file -p $obj3 &&
+
+	git tag -a -m tag obj2-tag $obj2 &&
+	obj2_tag="$(git rev-parse obj2-tag)" &&
+
+	write_script precious-objects <<-EOF &&
+	echo $obj2_tag
+	EOF
+	git config gc.recentObjectsHook ./precious-objects &&
+
+	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
+	test-tool chmtime =-86400 .git/objects/pack/pack-$pack3.pack &&
+	git repack -A -d --unpack-unreachable=1.hour.ago &&
+
+	git cat-file -p $obj1 &&
+	git cat-file -p $obj2 &&
+	git cat-file -p $obj2_tag &&
+	test_must_fail git cat-file -p $obj3
+'
+
 test_expect_success 'keep packed objects found only in index' '
 	echo my-unique-content >file &&
 	git add file &&
-- 
2.40.1.558.g18e50d2517

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 2/2] gc: introduce `gc.recentObjectsHook`
  2023-05-16  0:24   ` [PATCH v4 2/2] gc: introduce `gc.recentObjectsHook` Taylor Blau
@ 2023-05-24 23:21     ` Glen Choo
  2023-06-07 22:56       ` Taylor Blau
  0 siblings, 1 reply; 28+ messages in thread
From: Glen Choo @ 2023-05-24 23:21 UTC (permalink / raw)
  To: Taylor Blau, git
  Cc: Derrick Stolee, Jeff King, Junio C Hamano, Chris Torek, me,
	martinvonz

Hi Taylor! It was great seeing you at Review Club today :)

If you'd like, the notes are available at:

  https://docs.google.com/document/d/14L8BAumGTpsXpjDY8VzZ4rRtpAjuGrFSRqn3stCuS_w/edit

but that's optional, since a ground rule is that all important feedback
should be sent to the mailing list.

Taylor Blau <me@ttaylorr.com> writes:

> This patch introduces a new multi-valued configuration option,
> `gc.recentObjectsHook` as a means to mark certain objects as recent (and
> thus exempt from garbage collection), regardless of their age.
>
> However, we have no convenient way to keep certain precious, unreachable
> objects around in the repository, even if they have aged out and would
> be pruned. Our options today consist of:
>
>   - Point references at the reachability tips of any objects you
>     consider precious, which may be undesirable or infeasible.

What I like about the hook is that it matches the desired use case quite
well - if an object is precious but unreachable, it can't be because Git
thinks it's precious. Rather, something else thinks it's precious, so
the obvious fix is for Git to learn how to listen to that tool. Based
off only what's in this commit message though, this feature doesn't seem
sufficiently justified yet. In particular, it's not immediately clear
how this fits in with the alternatives, especially pointing refs at
precious objects (which is probably the most popular way people have
been doing this). It would be useful to flesh out why keeping these
extra refs are either "undesirable" or "infeasible". Presumably, you
already have some idea of why this is the case for the GitHub 'audit
log'.

Another potential use case I can think of is client-side third party Git
tools that implement custom workflows on top of a Git repo, e.g.
git-branchless (https://github.com/arxanas/git-branchless) and jj
(https://github.com/martinvonz/jj/, btw I contribute a little to jj
too). Both of those tools reference commits that Git can consider
unreachable (e.g. they support anonymous branches and "undo"
operations), and so they both create custom refs to keep those commits
alive. The number of refs isn't huge, though it is more than what a
typical repo would see (maybe in the 100s), and ISTR that some users
have reported that it's enough to noticeably slow down the "git fetch"
negotiation. I don't think managing these refs is "infeasible", and
there are workarounds to shrink the number of refs (e.g. creating an
octopus merge of the commits you want and pointing a ref at it), but it
would certainly be a lot easier to use the hook instead.

I've cc-ed the maintainers of both projects here in case they have more
to share.

> +gc.recentObjectsHook::

I have a small preference to use "command" instead of "hook" to avoid
confusion with Git hooks (I've already observed some of this confusion
in off-list conversations). There's some precedent for "hook" in
`uploadpack.packObjectsHook`, but that's a one-off (and it used to
confuse me a lot too :P).

> +	When considering the recency of an object (e.g., when generating
> +	a cruft pack or storing unreachable objects as loose), use the
> +	shell to execute the specified command(s). Interpret their
> +	output as object IDs which Git will consider as "recent",
> +	regardless of their age.

From a potential user's POV, two things seem unclear:

- What does "recenct" mean in this context? Does it just mean
  "precious"?
- What objects do I need to list? Is it every object I want to keep or
  just the reachable tips?

Documentation/git-gc.txt has some pointers:

  . Any object with modification time newer than the `--prune` date is kept,
    along with everything reachable from it.

but it seems quite hard to discover this and put the pieces together. I
wonder if we could make this easier by:

- Replacing "recent" with "precious" (if this is really what we mean),
  and the fact that we use the recency check is an implementation
  detail.
- Explicitly saying that everything reachable from the object is also
  kept.

In the code changes, I noticed a few out-of-date references to "cruft
tips", but everything else looked reasonable to me.

> diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
> index 303f7a5d84..3ae61ca995 100755
> --- a/t/t5329-pack-objects-cruft.sh
> +++ b/t/t5329-pack-objects-cruft.sh
> @@ -739,4 +739,175 @@ test_expect_success 'cruft objects are freshend via loose' '
>  	)
>  '
>  
> +test_expect_success 'gc.recentObjectsHook' '
> +	git init repo &&
> +	test_when_finished "rm -fr repo" &&
> +	(
> +		cd repo &&
> +
> +		# Create a handful of objects.
> +		#
> +		#   - one reachable commit, "base", designated for the reachable
> +		#     pack
> +		#   - one unreachable commit, "cruft.discard", which is marked
> +		#     for deletion
> +		#   - one unreachable commit, "cruft.old", which would be marked
> +		#     for deletion, but is rescued as an extra cruft tip
> +		#   - one unreachable commit, "cruft.new", which is not marked
> +		#     for deletion
> +		test_commit base &&
> +		git branch -M main &&
> +
> +		git checkout --orphan discard &&
> +		git rm -fr . &&
> +		test_commit --no-tag cruft.discard &&
> +
> +		git checkout --orphan old &&
> +		git rm -fr . &&
> +		test_commit --no-tag cruft.old &&
> +		cruft_old="$(git rev-parse HEAD)" &&
> +
> +		git checkout --orphan new &&
> +		git rm -fr . &&
> +		test_commit --no-tag cruft.new &&
> +		cruft_new="$(git rev-parse HEAD)" &&
> +
> +		git checkout main &&
> +		git branch -D discard old new &&
> +		git reflog expire --all --expire=all &&
> +
> +		# mark cruft.old with an mtime that is many minutes
> +		# older than the expiration period, and mark cruft.new
> +		# with an mtime that is in the future (and thus not
> +		# eligible for pruning).
> +		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $cruft_old)" &&
> +		test-tool chmtime +1000 "$objdir/$(test_oid_to_path $cruft_new)" &&
> +
> +		# Write the list of cruft objects we expect to
> +		# accumulate, which is comprised of everything reachable
> +		# from cruft.old and cruft.new, but not cruft.discard.
> +		git rev-list --objects --no-object-names \
> +			$cruft_old $cruft_new >cruft.raw &&
> +		sort cruft.raw >cruft.expect &&
> +
> +		# Write the script to list extra tips, which are limited
> +		# to cruft.old, in this case.
> +		write_script extra-tips <<-EOF &&
> +		echo $cruft_old
> +		EOF
> +		git config gc.recentObjectsHook ./extra-tips &&
> +
> +		git repack --cruft --cruft-expiration=now -d &&
> +
> +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
> +		git show-index <${mtimes%.mtimes}.idx >cruft &&
> +		cut -d" " -f2 cruft | sort >cruft.actual &&
> +		test_cmp cruft.expect cruft.actual &&
> +
> +		# Ensure that the "old" objects are removed after
> +		# dropping the gc.recentObjectsHook hook.
> +		git config --unset gc.recentObjectsHook &&
> +		git repack --cruft --cruft-expiration=now -d &&
> +
> +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
> +		git show-index <${mtimes%.mtimes}.idx >cruft &&
> +		cut -d" " -f2 cruft | sort >cruft.actual &&
> +
> +		git rev-list --objects --no-object-names $cruft_new >cruft.raw &&
> +		cp cruft.expect cruft.old &&
> +		sort cruft.raw >cruft.expect &&
> +		test_cmp cruft.expect cruft.actual &&
> +
> +		# ensure objects which are no longer in the cruft pack were
> +		# removed from the repository
> +		for object in $(comm -13 cruft.expect cruft.old)
> +		do
> +			test_must_fail git cat-file -t $object || return 1
> +		done
> +	)

Thanks for the comments. The tests are quite verbose, and I wouldn't
be able to understand them otherwise. I thought it would be nice to have
a test helper to check that the cruft pack contains the expected objects
(replacing the "git show-index" + "cut | sort" + "test_cmp" recipe), but
this test fits in with the existing style, so I don't consider that a
blocker.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 2/2] gc: introduce `gc.recentObjectsHook`
  2023-05-24 23:21     ` Glen Choo
@ 2023-06-07 22:56       ` Taylor Blau
  0 siblings, 0 replies; 28+ messages in thread
From: Taylor Blau @ 2023-06-07 22:56 UTC (permalink / raw)
  To: Glen Choo
  Cc: git, Derrick Stolee, Jeff King, Junio C Hamano, Chris Torek, me,
	martinvonz

Hi Glen,

Apologies for my delayed response. Now that we are on the other side of
2.41 I think that it's the right time to pick this back up again.

On Wed, May 24, 2023 at 04:21:34PM -0700, Glen Choo wrote:
> Hi Taylor! It was great seeing you at Review Club today :)

It was fun :-).

> It would be useful to flesh out why keeping these extra refs are
> either "undesirable" or "infeasible". Presumably, you already have
> some idea of why this is the case for the GitHub 'audit log'.

Yes, thanks for suggesting it. I think the gap in my explanation is
"[...] if there are many such objects". Hopefully that clarifies it.

> Another potential use case I can think of is client-side third party Git
> tools that implement custom workflows on top of a Git repo, e.g.
> git-branchless (https://github.com/arxanas/git-branchless) and jj
> (https://github.com/martinvonz/jj/, btw I contribute a little to jj
> too).

I thought that this was an interesting part of the discussion. I hadn't
thought of it when writing up these patches, but I think that it could
be potentially useful for those tools if they want to keep around some
precious set of metadata objects without having to point refs at them.

It also introduces a little bit of a higher barrier between the tool and
user to destroy those objects. Without pinning them with this hook, all
a user has to do to remove them is drop the reference(s) which points at
them, and then GC. Now they'd have to modify the hook, etc.

> > +gc.recentObjectsHook::
>
> I have a small preference to use "command" instead of "hook" to avoid
> confusion with Git hooks (I've already observed some of this confusion
> in off-list conversations). There's some precedent for "hook" in
> `uploadpack.packObjectsHook`, but that's a one-off (and it used to
> confuse me a lot too :P).

Unless you feel strongly, let's leave it as-is. "gc.recentObjectsHook"
is the third iteration of this name, and I'd like to avoid spending much
more time on naming if we can help it.

> > +	When considering the recency of an object (e.g., when generating
> > +	a cruft pack or storing unreachable objects as loose), use the
> > +	shell to execute the specified command(s). Interpret their
> > +	output as object IDs which Git will consider as "recent",
> > +	regardless of their age.
>
> >From a potential user's POV, two things seem unclear:
>
> - What does "recenct" mean in this context? Does it just mean
>   "precious"?
> - What objects do I need to list? Is it every object I want to keep or
>   just the reachable tips?

To answer your questions: recency is referring to the "mtime" of an
object [^1], not whether or not it is precious. I clarified this by
removing the term "recent" from this sentence altogether, to instead
read:

    "When considering whether or not to remove an object [...]"

You only need to list the tips, since Git will treat the output of the
hook as input to a traversal which allows for missing objects. Any
object visited along that traversal will be kept in the repository and
rescued from deletion. I tried to clarify this by adding a final
sentence (emphasis mine):

    "By treating their mtimes as "now", any objects **(and their
    descendants)** mentioned in the output will be kept regardless of
    their true age."

> In the code changes, I noticed a few out-of-date references to "cruft
> tips", but everything else looked reasonable to me.

Thanks, I'll clean those up and resubmit it with the above fixes
squashed in.

Thanks,
Taylor

[^1]: an object's mtime is the most recent of (1) the st_mtime of a its
  loose object file, if it exists, (2) the st_mtime of any non-cruft
  pack(s) that contain that object, or (3) the value listed in the .mtimes
  file of the cruft pack corresponding to that object's entry.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v5 0/2] gc: introduce `gc.recentObjectsHook`
  2023-05-11 23:20 [PATCH v3 0/2] pack-objects: introduce `pack.extraRecentObjectsHook` Taylor Blau
                   ` (4 preceding siblings ...)
  2023-05-16  0:23 ` [PATCH v4 0/2] gc: introduce `gc.recentObjectsHook` Taylor Blau
@ 2023-06-07 22:58 ` Taylor Blau
  2023-06-07 22:58   ` [PATCH v5 1/2] reachable.c: extract `obj_is_recent()` Taylor Blau
  2023-06-07 22:58   ` [PATCH v5 2/2] gc: introduce `gc.recentObjectsHook` Taylor Blau
  5 siblings, 2 replies; 28+ messages in thread
From: Taylor Blau @ 2023-06-07 22:58 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Chris Torek, Derrick Stolee, Junio C Hamano, Glen Choo

Here is a version of my series to implement the mutli-valued
"gc.recentObjectHook" configuration, which allows users to define custom
programs whose output determines an additional set of objects to retain
(along with their descendants) when GCing, regardless of their true age.

It is largely the same as the previous round, with the following
changes:

  - a small handful of wordsmithing changes in the second patch's
    message thanks to a very helpful review from the Google team's
    "Review Club" meeting.

  - removed a pair of stray references to "cruft tips" in favor of
    "recent objects".

  - rebased onto v2.41.0

Based on previous rounds of review, I think that this version should be
pretty much ready to go. But I'd appreciate some extra eyes on it just
to make sure.

Thanks in advance for your (hopefully final!) review.

Taylor Blau (2):
  reachable.c: extract `obj_is_recent()`
  gc: introduce `gc.recentObjectsHook`

 Documentation/config/gc.txt          |  15 +++
 reachable.c                          |  85 ++++++++++++-
 t/t5304-prune.sh                     |  14 +++
 t/t5329-pack-objects-cruft.sh        | 171 +++++++++++++++++++++++++++
 t/t7701-repack-unpack-unreachable.sh |  31 +++++
 5 files changed, 313 insertions(+), 3 deletions(-)

Range-diff against v4:
1:  9c1b59c8cf = 1:  38c4c4a17f reachable.c: extract `obj_is_recent()`
2:  18e50d2517 ! 2:  f661b54941 gc: introduce `gc.recentObjectsHook`
    @@ Commit message
         be pruned. Our options today consist of:
     
           - Point references at the reachability tips of any objects you
    -        consider precious, which may be undesirable or infeasible.
    +        consider precious, which may be undesirable or infeasible if there
    +        are many such objects.
     
           - Track them via the reflog, which may be undesirable since the
             reflog's lifetime is limited to that of the reference it's tracking
    @@ Documentation/config/gc.txt: or rebase occurring.  Since these changes are not p
      default is more aggressive than `gc.reflogExpire`.
      
     +gc.recentObjectsHook::
    -+	When considering the recency of an object (e.g., when generating
    -+	a cruft pack or storing unreachable objects as loose), use the
    -+	shell to execute the specified command(s). Interpret their
    -+	output as object IDs which Git will consider as "recent",
    -+	regardless of their age.
    ++	When considering whether or not to remove an object (either when
    ++	generating a cruft pack or storing unreachable objects as
    ++	loose), use the shell to execute the specified command(s).
    ++	Interpret their output as object IDs which Git will consider as
    ++	"recent", regardless of their age. By treating their mtimes as
    ++	"now", any objects (and their descendants) mentioned in the
    ++	output will be kept regardless of their true age.
     ++
     +Output must contain exactly one hex object ID per line, and nothing
     +else. Objects which cannot be found in the repository are ignored.
    @@ reachable.c: struct recent_data {
     +		ret = run_one_gc_recent_objects_hook(&data->extra_recent_oids,
     +						       programs->items[i].string);
     +		if (ret)
    -+			die(_("unable to enumerate additional cruft tips"));
    ++			die(_("unable to enumerate additional recent objects"));
     +	}
     +}
     +
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend v
     +		# ensure that a dirty exit halts cruft pack generation
     +		git config --add gc.recentObjectsHook ./extra-tips.c &&
     +		test_must_fail git repack --cruft --cruft-expiration=now -d 2>err &&
    -+		grep "unable to enumerate additional cruft tips" err &&
    ++		grep "unable to enumerate additional recent objects" err &&
     +
     +		# and that the existing cruft pack is left alone
     +		test_path_is_file "$mtimes"
-- 
2.41.0.2.gaaae24b3a6.dirty

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v5 1/2] reachable.c: extract `obj_is_recent()`
  2023-06-07 22:58 ` [PATCH v5 0/2] " Taylor Blau
@ 2023-06-07 22:58   ` Taylor Blau
  2023-06-07 22:58   ` [PATCH v5 2/2] gc: introduce `gc.recentObjectsHook` Taylor Blau
  1 sibling, 0 replies; 28+ messages in thread
From: Taylor Blau @ 2023-06-07 22:58 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Chris Torek, Derrick Stolee, Junio C Hamano, Glen Choo

When enumerating objects in order to add recent ones (i.e. those whose
mtime is strictly newer than the cutoff) as tips of a reachability
traversal, `add_recent_object()` discards objects which do not meet the
recency criteria.

The subsequent commit will make checking whether or not an object is
recent also consult the list of hooks in `pack.recentHook`. Isolate this
check in its own function to keep the additional complexity outside of
`add_recent_object()`.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 reachable.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/reachable.c b/reachable.c
index 55bb114353..7a42da5d39 100644
--- a/reachable.c
+++ b/reachable.c
@@ -69,6 +69,12 @@ struct recent_data {
 	int ignore_in_core_kept_packs;
 };
 
+static int obj_is_recent(const struct object_id *oid, timestamp_t mtime,
+			 struct recent_data *data)
+{
+	return mtime > data->timestamp;
+}
+
 static void add_recent_object(const struct object_id *oid,
 			      struct packed_git *pack,
 			      off_t offset,
@@ -78,7 +84,7 @@ static void add_recent_object(const struct object_id *oid,
 	struct object *obj;
 	enum object_type type;
 
-	if (mtime <= data->timestamp)
+	if (!obj_is_recent(oid, mtime, data))
 		return;
 
 	/*
-- 
2.41.0.2.gaaae24b3a6.dirty


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 2/2] gc: introduce `gc.recentObjectsHook`
  2023-06-07 22:58 ` [PATCH v5 0/2] " Taylor Blau
  2023-06-07 22:58   ` [PATCH v5 1/2] reachable.c: extract `obj_is_recent()` Taylor Blau
@ 2023-06-07 22:58   ` Taylor Blau
  2023-06-09 23:33     ` Glen Choo
  1 sibling, 1 reply; 28+ messages in thread
From: Taylor Blau @ 2023-06-07 22:58 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Chris Torek, Derrick Stolee, Junio C Hamano, Glen Choo

This patch introduces a new multi-valued configuration option,
`gc.recentObjectsHook` as a means to mark certain objects as recent (and
thus exempt from garbage collection), regardless of their age.

When performing a garbage collection operation on a repository with
unreachable objects, Git makes its decision on what to do with those
object(s) bed on how recent the objects are or not. Generally speaking,
unreachable-but-recent objects stay in the repository, and older objects
are discarded.

However, we have no convenient way to keep certain precious, unreachable
objects around in the repository, even if they have aged out and would
be pruned. Our options today consist of:

  - Point references at the reachability tips of any objects you
    consider precious, which may be undesirable or infeasible if there
    are many such objects.

  - Track them via the reflog, which may be undesirable since the
    reflog's lifetime is limited to that of the reference it's tracking
    (and callers may want to keep those unreachable objects around for
    longer).

  - Extend the grace period, which may keep around other objects that
    the caller *does* want to discard.

  - Manually modify the mtimes of objects you want to keep. If those
    objects are already loose, this is easy enough to do (you can just
    enumerate and `touch -m` each one).

    But if they are packed, you will either end up modifying the mtimes
    of *all* objects in that pack, or be forced to write out a loose
    copy of that object, both of which may be undesirable. Even worse,
    if they are in a cruft pack, that requires modifying its `*.mtimes`
    file by hand, since there is no exposed plumbing for this.

  - Force the caller to construct the pack of objects they want
    to keep themselves, and then mark the pack as kept by adding a
    ".keep" file. This works, but is burdensome for the caller, and
    having extra packs is awkward as you roll forward your cruft pack.

This patch introduces a new option to the above list via the
`gc.recentObjectsHook` configuration, which allows the caller to
specify a program (or set of programs) whose output is treated as a set
of objects to treat as recent, regardless of their true age.

The implementation is straightforward. Git enumerates recent objects via
`add_unseen_recent_objects_to_traversal()`, which enumerates loose and
packed objects, and eventually calls add_recent_object() on any objects
for which `want_recent_object()`'s conditions are met.

This patch modifies the recency condition from simply "is the mtime of
this object more recent than the cutoff?" to "[...] or, is this object
mentioned by at least one `gc.recentObjectsHook`?".

Depending on whether or not we are generating a cruft pack, this allows
the caller to do one of two things:

  - If generating a cruft pack, the caller is able to retain additional
    objects via the cruft pack, even if they would have otherwise been
    pruned due to their age.

  - If not generating a cruft pack, the caller is likewise able to
    retain additional objects as loose.

A potential alternative here is to introduce a new mode to alter the
contents of the reachable pack instead of the cruft one. One could
imagine a new option to `pack-objects`, say `--extra-reachable-tips`
that does the same thing as above, adding the visited set of objects
along the traversal to the pack.

But this has the unfortunate side-effect of altering the reachability
closure of that pack. If parts of the unreachable object graph mentioned
by one or more of the "extra reachable tips" programs is not closed,
then the resulting pack won't be either. This makes it impossible in the
general case to write out reachability bitmaps for that pack, since
closure is a requirement there.

Instead, keep these unreachable objects in the cruft pack (or set of
unreachable, loose objects) instead, to ensure that we can continue to
have a pack containing just reachable objects, which is always safe to
write a bitmap over.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/gc.txt          |  15 +++
 reachable.c                          |  79 ++++++++++++-
 t/t5304-prune.sh                     |  14 +++
 t/t5329-pack-objects-cruft.sh        | 171 +++++++++++++++++++++++++++
 t/t7701-repack-unpack-unreachable.sh |  31 +++++
 5 files changed, 307 insertions(+), 3 deletions(-)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index 7f95c866e1..ca47eb2008 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -130,6 +130,21 @@ or rebase occurring.  Since these changes are not part of the current
 project most users will want to expire them sooner, which is why the
 default is more aggressive than `gc.reflogExpire`.
 
+gc.recentObjectsHook::
+	When considering whether or not to remove an object (either when
+	generating a cruft pack or storing unreachable objects as
+	loose), use the shell to execute the specified command(s).
+	Interpret their output as object IDs which Git will consider as
+	"recent", regardless of their age. By treating their mtimes as
+	"now", any objects (and their descendants) mentioned in the
+	output will be kept regardless of their true age.
++
+Output must contain exactly one hex object ID per line, and nothing
+else. Objects which cannot be found in the repository are ignored.
+Multiple hooks are supported, but all must exit successfully, else the
+operation (either generating a cruft pack or unpacking unreachable
+objects) will be halted.
+
 gc.rerereResolved::
 	Records of conflicted merge you resolved earlier are
 	kept for this many days when 'git rerere gc' is run.
diff --git a/reachable.c b/reachable.c
index 7a42da5d39..60a7336b87 100644
--- a/reachable.c
+++ b/reachable.c
@@ -16,6 +16,8 @@
 #include "object-store.h"
 #include "pack-bitmap.h"
 #include "pack-mtimes.h"
+#include "config.h"
+#include "run-command.h"
 
 struct connectivity_progress {
 	struct progress *progress;
@@ -67,12 +69,75 @@ struct recent_data {
 	timestamp_t timestamp;
 	report_recent_object_fn *cb;
 	int ignore_in_core_kept_packs;
+
+	struct oidset extra_recent_oids;
+	int extra_recent_oids_loaded;
 };
 
+static int run_one_gc_recent_objects_hook(struct oidset *set,
+					    const char *args)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf buf = STRBUF_INIT;
+	FILE *out;
+	int ret = 0;
+
+	cmd.use_shell = 1;
+	cmd.out = -1;
+
+	strvec_push(&cmd.args, args);
+
+	if (start_command(&cmd))
+		return -1;
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline(&buf, out) != EOF) {
+		struct object_id oid;
+		const char *rest;
+
+		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
+			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
+			break;
+		}
+
+		oidset_insert(set, &oid);
+	}
+
+	fclose(out);
+	ret |= finish_command(&cmd);
+
+	strbuf_release(&buf);
+	return ret;
+}
+
+static void load_gc_recent_objects(struct recent_data *data)
+{
+	const struct string_list *programs;
+	int ret = 0;
+	size_t i;
+
+	data->extra_recent_oids_loaded = 1;
+
+	if (git_config_get_string_multi("gc.recentobjectshook", &programs))
+		return;
+
+	for (i = 0; i < programs->nr; i++) {
+		ret = run_one_gc_recent_objects_hook(&data->extra_recent_oids,
+						       programs->items[i].string);
+		if (ret)
+			die(_("unable to enumerate additional recent objects"));
+	}
+}
+
 static int obj_is_recent(const struct object_id *oid, timestamp_t mtime,
 			 struct recent_data *data)
 {
-	return mtime > data->timestamp;
+	if (mtime > data->timestamp)
+		return 1;
+
+	if (!data->extra_recent_oids_loaded)
+		load_gc_recent_objects(data);
+	return oidset_contains(&data->extra_recent_oids, oid);
 }
 
 static void add_recent_object(const struct object_id *oid,
@@ -199,16 +264,24 @@ int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
 	data.cb = cb;
 	data.ignore_in_core_kept_packs = ignore_in_core_kept_packs;
 
+	oidset_init(&data.extra_recent_oids, 0);
+	data.extra_recent_oids_loaded = 0;
+
 	r = for_each_loose_object(add_recent_loose, &data,
 				  FOR_EACH_OBJECT_LOCAL_ONLY);
 	if (r)
-		return r;
+		goto done;
 
 	flags = FOR_EACH_OBJECT_LOCAL_ONLY | FOR_EACH_OBJECT_PACK_ORDER;
 	if (ignore_in_core_kept_packs)
 		flags |= FOR_EACH_OBJECT_SKIP_IN_CORE_KEPT_PACKS;
 
-	return for_each_packed_object(add_recent_packed, &data, flags);
+	r = for_each_packed_object(add_recent_packed, &data, flags);
+
+done:
+	oidset_clear(&data.extra_recent_oids);
+
+	return r;
 }
 
 static int mark_object_seen(const struct object_id *oid,
diff --git a/t/t5304-prune.sh b/t/t5304-prune.sh
index f367327441..b4df545e5a 100755
--- a/t/t5304-prune.sh
+++ b/t/t5304-prune.sh
@@ -350,4 +350,18 @@ test_expect_success 'old reachable-from-recent retained with bitmaps' '
 	test_must_fail git cat-file -e $to_drop
 '
 
+test_expect_success 'gc.recentObjectsHook' '
+	add_blob &&
+	test-tool chmtime =-86500 $BLOB_FILE &&
+
+	write_script precious-objects <<-EOF &&
+	echo $BLOB
+	EOF
+	test_config gc.recentObjectsHook ./precious-objects &&
+
+	git prune --expire=now &&
+
+	git cat-file -p $BLOB
+'
+
 test_done
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 303f7a5d84..45667d4999 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -739,4 +739,175 @@ test_expect_success 'cruft objects are freshend via loose' '
 	)
 '
 
+test_expect_success 'gc.recentObjectsHook' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		# Create a handful of objects.
+		#
+		#   - one reachable commit, "base", designated for the reachable
+		#     pack
+		#   - one unreachable commit, "cruft.discard", which is marked
+		#     for deletion
+		#   - one unreachable commit, "cruft.old", which would be marked
+		#     for deletion, but is rescued as an extra cruft tip
+		#   - one unreachable commit, "cruft.new", which is not marked
+		#     for deletion
+		test_commit base &&
+		git branch -M main &&
+
+		git checkout --orphan discard &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.discard &&
+
+		git checkout --orphan old &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.old &&
+		cruft_old="$(git rev-parse HEAD)" &&
+
+		git checkout --orphan new &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.new &&
+		cruft_new="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D discard old new &&
+		git reflog expire --all --expire=all &&
+
+		# mark cruft.old with an mtime that is many minutes
+		# older than the expiration period, and mark cruft.new
+		# with an mtime that is in the future (and thus not
+		# eligible for pruning).
+		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $cruft_old)" &&
+		test-tool chmtime +1000 "$objdir/$(test_oid_to_path $cruft_new)" &&
+
+		# Write the list of cruft objects we expect to
+		# accumulate, which is comprised of everything reachable
+		# from cruft.old and cruft.new, but not cruft.discard.
+		git rev-list --objects --no-object-names \
+			$cruft_old $cruft_new >cruft.raw &&
+		sort cruft.raw >cruft.expect &&
+
+		# Write the script to list extra tips, which are limited
+		# to cruft.old, in this case.
+		write_script extra-tips <<-EOF &&
+		echo $cruft_old
+		EOF
+		git config gc.recentObjectsHook ./extra-tips &&
+
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# Ensure that the "old" objects are removed after
+		# dropping the gc.recentObjectsHook hook.
+		git config --unset gc.recentObjectsHook &&
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+
+		git rev-list --objects --no-object-names $cruft_new >cruft.raw &&
+		cp cruft.expect cruft.old &&
+		sort cruft.raw >cruft.expect &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# ensure objects which are no longer in the cruft pack were
+		# removed from the repository
+		for object in $(comm -13 cruft.expect cruft.old)
+		do
+			test_must_fail git cat-file -t $object || return 1
+		done
+	)
+'
+
+test_expect_success 'multi-valued gc.recentObjectsHook' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		git branch -M main &&
+
+		git checkout --orphan cruft.a &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.a &&
+		cruft_a="$(git rev-parse HEAD)" &&
+
+		git checkout --orphan cruft.b &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.b &&
+		cruft_b="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D cruft.a cruft.b &&
+		git reflog expire --all --expire=all &&
+
+		echo "echo $cruft_a" | write_script extra-tips.a &&
+		echo "echo $cruft_b" | write_script extra-tips.b &&
+		echo "false" | write_script extra-tips.c &&
+
+		git rev-list --objects --no-object-names $cruft_a $cruft_b \
+			>cruft.raw &&
+		sort cruft.raw >cruft.expect &&
+
+		# ensure that each extra cruft tip is saved by its
+		# respective hook
+		git config --add gc.recentObjectsHook ./extra-tips.a &&
+		git config --add gc.recentObjectsHook ./extra-tips.b &&
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# ensure that a dirty exit halts cruft pack generation
+		git config --add gc.recentObjectsHook ./extra-tips.c &&
+		test_must_fail git repack --cruft --cruft-expiration=now -d 2>err &&
+		grep "unable to enumerate additional recent objects" err &&
+
+		# and that the existing cruft pack is left alone
+		test_path_is_file "$mtimes"
+	)
+'
+
+test_expect_success 'additional cruft blobs via gc.recentObjectsHook' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		blob=$(echo "unreachable" | git hash-object -w --stdin) &&
+
+		# mark the unreachable blob we wrote above as having
+		# aged out of the retention period
+		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $blob)" &&
+
+		# Write the script to list extra tips, which is just the
+		# extra blob as above.
+		write_script extra-tips <<-EOF &&
+		echo $blob
+		EOF
+		git config gc.recentObjectsHook ./extra-tips &&
+
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft >actual &&
+		echo $blob >expect &&
+		test_cmp expect actual
+	)
+'
+
 test_done
diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh
index ebb267855f..ba428c18a8 100755
--- a/t/t7701-repack-unpack-unreachable.sh
+++ b/t/t7701-repack-unpack-unreachable.sh
@@ -113,6 +113,37 @@ test_expect_success 'do not bother loosening old objects' '
 	test_must_fail git cat-file -p $obj2
 '
 
+test_expect_success 'gc.recentObjectsHook' '
+	obj1=$(echo one | git hash-object -w --stdin) &&
+	obj2=$(echo two | git hash-object -w --stdin) &&
+	obj3=$(echo three | git hash-object -w --stdin) &&
+	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
+	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
+	pack3=$(echo $obj3 | git pack-objects .git/objects/pack/pack) &&
+	git prune-packed &&
+
+	git cat-file -p $obj1 &&
+	git cat-file -p $obj2 &&
+	git cat-file -p $obj3 &&
+
+	git tag -a -m tag obj2-tag $obj2 &&
+	obj2_tag="$(git rev-parse obj2-tag)" &&
+
+	write_script precious-objects <<-EOF &&
+	echo $obj2_tag
+	EOF
+	git config gc.recentObjectsHook ./precious-objects &&
+
+	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
+	test-tool chmtime =-86400 .git/objects/pack/pack-$pack3.pack &&
+	git repack -A -d --unpack-unreachable=1.hour.ago &&
+
+	git cat-file -p $obj1 &&
+	git cat-file -p $obj2 &&
+	git cat-file -p $obj2_tag &&
+	test_must_fail git cat-file -p $obj3
+'
+
 test_expect_success 'keep packed objects found only in index' '
 	echo my-unique-content >file &&
 	git add file &&
-- 
2.41.0.2.gaaae24b3a6.dirty

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/2] gc: introduce `gc.recentObjectsHook`
  2023-06-07 22:58   ` [PATCH v5 2/2] gc: introduce `gc.recentObjectsHook` Taylor Blau
@ 2023-06-09 23:33     ` Glen Choo
  2023-06-12 21:14       ` Junio C Hamano
  0 siblings, 1 reply; 28+ messages in thread
From: Glen Choo @ 2023-06-09 23:33 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: Jeff King, Chris Torek, Derrick Stolee, Junio C Hamano

Taylor Blau <me@ttaylorr.com> writes:

> When performing a garbage collection operation on a repository with
> unreachable objects, Git makes its decision on what to do with those
> object(s) bed on how recent the objects are or not. Generally speaking,

s/bed/based/ ?

> +gc.recentObjectsHook::
> +	When considering whether or not to remove an object (either when
> +	generating a cruft pack or storing unreachable objects as
> +	loose), use the shell to execute the specified command(s).
> +	Interpret their output as object IDs which Git will consider as
> +	"recent", regardless of their age. By treating their mtimes as
> +	"now", any objects (and their descendants) mentioned in the
> +	output will be kept regardless of their true age.

Thanks! I think this version will be clear to prospective users.

> +static int run_one_gc_recent_objects_hook(struct oidset *set,
> +					    const char *args)
> +{
> +	struct child_process cmd = CHILD_PROCESS_INIT;
> +	struct strbuf buf = STRBUF_INIT;
> +	FILE *out;
> +	int ret = 0;
> +
> +	cmd.use_shell = 1;
> +	cmd.out = -1;
> +
> +	strvec_push(&cmd.args, args);
> +
> +	if (start_command(&cmd))
> +		return -1;
> +
> +	out = xfdopen(cmd.out, "r");
> +	while (strbuf_getline(&buf, out) != EOF) {
> +		struct object_id oid;
> +		const char *rest;
> +
> +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
> +			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);

To be consistent with the other error message, perhaps s/extra cruft
tip/additional recent object/?

> +static void load_gc_recent_objects(struct recent_data *data)
> +{
> +	const struct string_list *programs;
> +	int ret = 0;
> +	size_t i;
> +
> +	data->extra_recent_oids_loaded = 1;
> +
> +	if (git_config_get_string_multi("gc.recentobjectshook", &programs))
> +		return;
> +
> +	for (i = 0; i < programs->nr; i++) {
> +		ret = run_one_gc_recent_objects_hook(&data->extra_recent_oids,
> +						       programs->items[i].string);
> +		if (ret)
> +			die(_("unable to enumerate additional recent objects"));

This error message to be exact.

> +test_expect_success 'gc.recentObjectsHook' '
> +	git init repo &&
> +	test_when_finished "rm -fr repo" &&
> +	(
> +		cd repo &&
> +
> +		# Create a handful of objects.
> +		#
> +		#   - one reachable commit, "base", designated for the reachable
> +		#     pack
> +		#   - one unreachable commit, "cruft.discard", which is marked
> +		#     for deletion
> +		#   - one unreachable commit, "cruft.old", which would be marked
> +		#     for deletion, but is rescued as an extra cruft tip

Perhaps we intended to drop "cruft tips" in the test comments too? e.g.
here and..

> +test_expect_success 'multi-valued gc.recentObjectsHook' '
> +	git init repo &&
> +	test_when_finished "rm -fr repo" &&
> +	(
> +		cd repo &&
> +
> +		test_commit base &&
> +		git branch -M main &&
> +
> +		git checkout --orphan cruft.a &&
> +		git rm -fr . &&
> +		test_commit --no-tag cruft.a &&
> +		cruft_a="$(git rev-parse HEAD)" &&
> +
> +		git checkout --orphan cruft.b &&
> +		git rm -fr . &&
> +		test_commit --no-tag cruft.b &&
> +		cruft_b="$(git rev-parse HEAD)" &&
> +
> +		git checkout main &&
> +		git branch -D cruft.a cruft.b &&
> +		git reflog expire --all --expire=all &&
> +
> +		echo "echo $cruft_a" | write_script extra-tips.a &&
> +		echo "echo $cruft_b" | write_script extra-tips.b &&
> +		echo "false" | write_script extra-tips.c &&
> +
> +		git rev-list --objects --no-object-names $cruft_a $cruft_b \
> +			>cruft.raw &&
> +		sort cruft.raw >cruft.expect &&
> +
> +		# ensure that each extra cruft tip is saved by its

here.

Aside from those trivial points, everything looks good.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/2] gc: introduce `gc.recentObjectsHook`
  2023-06-09 23:33     ` Glen Choo
@ 2023-06-12 21:14       ` Junio C Hamano
  0 siblings, 0 replies; 28+ messages in thread
From: Junio C Hamano @ 2023-06-12 21:14 UTC (permalink / raw)
  To: Glen Choo; +Cc: Taylor Blau, git, Jeff King, Chris Torek, Derrick Stolee

Glen Choo <chooglen@google.com> writes:

> Taylor Blau <me@ttaylorr.com> writes:
>
>> When performing a garbage collection operation on a repository with
>> unreachable objects, Git makes its decision on what to do with those
>> object(s) bed on how recent the objects are or not. Generally speaking,
>
> s/bed/based/ ?

Indeed.  Also "or not" sounds a bit extraneous.

>> +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
>> +			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
>
> To be consistent with the other error message, perhaps s/extra cruft
> tip/additional recent object/?

Sharp eyes.

> Aside from those trivial points, everything looks good.

Thanks for reviewing (and writing).

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2023-06-12 21:21 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-11 23:20 [PATCH v3 0/2] pack-objects: introduce `pack.extraRecentObjectsHook` Taylor Blau
2023-05-11 23:20 ` [PATCH v3 1/2] reachable.c: extract `obj_is_recent()` Taylor Blau
2023-05-11 23:20 ` [PATCH v3 2/2] builtin/pack-objects.c: introduce `pack.recentObjectsHook` Taylor Blau
2023-05-12  4:58   ` Jeff King
2023-05-15 20:15     ` Taylor Blau
2023-05-12 21:24   ` Jeff King
2023-05-12 21:36     ` Taylor Blau
2023-05-12 21:46       ` Jeff King
2023-05-12 21:45     ` Jeff King
2023-05-12 22:01       ` Jeff King
2023-05-12 23:21         ` Junio C Hamano
2023-05-13  0:11           ` Jeff King
2023-05-13  0:11             ` Jeff King
2023-05-15 20:49       ` Taylor Blau
2023-05-15 20:38     ` Taylor Blau
2023-05-11 23:23 ` [PATCH v3 0/2] pack-objects: introduce `pack.extraRecentObjectsHook` Taylor Blau
2023-05-11 23:39 ` Junio C Hamano
2023-05-11 23:48   ` Taylor Blau
2023-05-16  0:23 ` [PATCH v4 0/2] gc: introduce `gc.recentObjectsHook` Taylor Blau
2023-05-16  0:24   ` [PATCH v4 1/2] reachable.c: extract `obj_is_recent()` Taylor Blau
2023-05-16  0:24   ` [PATCH v4 2/2] gc: introduce `gc.recentObjectsHook` Taylor Blau
2023-05-24 23:21     ` Glen Choo
2023-06-07 22:56       ` Taylor Blau
2023-06-07 22:58 ` [PATCH v5 0/2] " Taylor Blau
2023-06-07 22:58   ` [PATCH v5 1/2] reachable.c: extract `obj_is_recent()` Taylor Blau
2023-06-07 22:58   ` [PATCH v5 2/2] gc: introduce `gc.recentObjectsHook` Taylor Blau
2023-06-09 23:33     ` Glen Choo
2023-06-12 21:14       ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).