Git Mailing List Archive mirror
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Taylor Blau <me@ttaylorr.com>
Cc: git@vger.kernel.org, Jeff King <peff@peff.net>,
	Derrick Stolee <derrickstolee@github.com>
Subject: Re: [PATCH v3] builtin/pack-objects.c: introduce `pack.extraCruftTips`
Date: Wed, 03 May 2023 16:18:44 -0700	[thread overview]
Message-ID: <xmqqv8h9m2az.fsf@gitster.g> (raw)
In-Reply-To: <27a7f16aab35b5cac391d9831aadb0f2e2146313.1683151485.git.me@ttaylorr.com> (Taylor Blau's message of "Wed, 3 May 2023 18:05:45 -0400")

Taylor Blau <me@ttaylorr.com> writes:

>   - When not pruning, all packed and loose objects which do not appear
>     in the any of the packs which wil remain (as indicated over stdin,
>     with `--cruft`) are packed separately into a cruft pack.

"wil remain" -> "will remain"

But more importantly, it is not clear to me whose "stdin" we are
talking about and what is sent "over stdin" here.  Somebody pipes
the output of "printf %s --cruft" to "pack-objects"???

> However, there is no option to be able to keep around certain objects
> that have otherwise aged out of the grace period. The only way to retain
> those objects is:
>
>   - to point a reference at them, which may be undesirable or
>     infeasible,
>   - to track them via the reflog, which may be undesirable since the
>     reflog's lifetime is limited to that of the reference it's tracking
>     (and callers may want to keep those unreachable objects around for
>     longer)
>   - to extend the grace period, which may keep around other objects that
>     the caller *does* want to discard,
>   - or, to force the caller to construct the pack of objects they want
>     to keep themselves, and then mark the pack as kept by adding a
>     ".keep" file.

If you do not want to point objects you want to keep with a ref,
then you are making a problem for yourself, as gc decisions are
designed around the notion that anything worthwhile to keep are
supposed to be reachable from some anchoring points (like the refs,
the reflogs, the index and its various extensions, etc.).

> Here is a reworked version of the pack.extraCruftTips feature to reuse
> parts of the "rescue old/unreachable, but reachable-from-recent" traversal.

That sounds like a sensible way to go, but as I said earlier, it
still is not clear, at least to me, why the pack.extraCruftTips
program can compute them cheaply enough when you cannot have these
tips pointed at with refs in some reserved namespaces of your own.

> +pack.extraCruftTips::

Is this consistent with the way we name a variable whose value
specifies a command to run?  At first, I thought this was a
multi-valued configuration variable, each of the value is an object
name.  extraCruftTipsCmd?  extraCruftTipsHook?

> +	When generating a cruft pack, use the shell to execute the
> +	specified command(s), and interpret their output as additional
> +	tips of objects to keep in the cruft pack, regardless of their

What is a "tip of an object"?  The first byte ;-)?

A "tip of history" would only imply commit objects, but presumably
you would want to specify a tree and protect all the blobs and trees
it recursively contains, so that is not a good name for it.

> +	age.
> ++
> +Output must contain exactly one hex object ID per line, and nothing
> +else. Objects which cannot be found in the repository are ignored.

Are objects that cannot be found the same as objects that are
missing?  How do we determine if a given object name refers to an
object which cannot be found?  Would the lazy fetching from promisor
remotes come into play?

> +Multiple hooks are supported, but all must exit successfully, else no
> +cruft pack will be generated.
> +

Now, we are told this variable refers to "hooks".  If that is the
name we want to call them, we'd better use the term from the
beginning.

> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index a5b466839b..6f6e7872cd 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -44,6 +44,7 @@
>  #include "pack-mtimes.h"
>  #include "parse-options.h"
>  #include "wrapper.h"
> +#include "run-command.h"
>
>  /*
>   * Objects we are going to pack are collected in the `to_pack` structure.
> @@ -3543,11 +3544,85 @@ static void enumerate_cruft_objects(void)
>  	stop_progress(&progress_state);
>  }
>
> +static int enumerate_extra_cruft_tips_1(struct rev_info *revs, const char *args)

Let's lose the _1 suffix unless you have
enumerate_extra_cruft_tips() that farms out bulk of its inner
workings to this one.

> +{
> +	struct child_process cmd = CHILD_PROCESS_INIT;
> +	struct strbuf buf = STRBUF_INIT;
> +	FILE *out;
> +	int ret = 0;
> +
> +	cmd.use_shell = 1;
> +	cmd.out = -1;
> +
> +	strvec_push(&cmd.args, args);
> +
> +	if (start_command(&cmd))
> +		return -1;
> +
> +	out = xfdopen(cmd.out, "r");

It somehow feels funny to be reading from "out", which is usually
where we write things to.

> +	while (strbuf_getline(&buf, out) != EOF) {

Ditto.

> +		struct object_id oid;
> +		struct object *obj;
> +		int type;
> +		const char *rest;
> +
> +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
> +			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
> +			goto done;
> +		}
> +
> +		type = oid_object_info(the_repository, &oid, NULL);
> +		if (type < 0)
> +			continue;
> +
> +		obj = lookup_object_by_type(the_repository, &oid, type);
> +		if (!obj)
> +			continue;

Hmph, we may want to have an interface that lets us avoid looking up
the same oid twice in the same set of tables.  Given an object
unseen so far, oid_object_info() should have done most of the work
necessary for lookup_object_by_type() to get to and start parsing
the data of the object in the good case (i.e. object exists and in a
pack---just we haven't needed it yet), but in the above sequence
there is not enough information passed between two calls to take
advantage of it.

> +		display_progress(progress_state, ++nr_seen);
> +		add_pending_object(revs, obj, "");
> +	}
> +
> +	ret = finish_command(&cmd);
> +
> +done:
> +	if (out)
> +		fclose(out);
> +	strbuf_release(&buf);
> +	child_process_clear(&cmd);
> +
> +	return ret;
> +}
> +
> +static void add_extra_cruft_tips_to_traversal(struct rev_info *revs)
> +{
> +	const struct string_list *programs;
> +	int ret = 0;
> +	size_t i;
> +
> +	if (git_config_get_string_multi("pack.extracrufttips", &programs))
> +		return;
> +
> +	if (progress)
> +		progress_state = start_progress(_("Enumerating extra cruft tips"), 0);
> +	nr_seen = 0;
> +	for (i = 0; i < programs->nr; i++) {
> +		ret = enumerate_extra_cruft_tips_1(revs,
> +						   programs->items[i].string);
> +		if (ret)
> +			break;
> +	}
> +	stop_progress(&progress_state);

We iterate over the value list internal to the repo_config, but we
are only borrowing them so there is no need to do any clean-up upon
leaving.  OK.

> +	if (ret)
> +		die(_("unable to enumerate additional cruft tips"));
> +}
> +
>  static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
>  {
>  	struct packed_git *p;
>  	struct rev_info revs;
> -	int ret;
> +	int ret = 0;
>
>  	repo_init_revisions(the_repository, &revs, NULL);
>
> @@ -3560,11 +3635,15 @@ static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs
>
>  	revs.ignore_missing_links = 1;
>
> -	if (progress)
> -		progress_state = start_progress(_("Enumerating cruft objects"), 0);
> -	ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
> -						     set_cruft_mtime, 1);
> -	stop_progress(&progress_state);
> +	if (cruft_expiration) {
> +		if (progress)
> +			progress_state = start_progress(_("Enumerating cruft objects"), 0);
> +		ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
> +							     set_cruft_mtime, 1);
> +		stop_progress(&progress_state);
> +	}
> +
> +	add_extra_cruft_tips_to_traversal(&revs);

Hmph, is this hunk doing two unrelated things at the same time?
What was the reason why we didn't have to be conditional on
cruft_expiration in the original code, and why does it start
mattering when we teach the code to call out to hooks?

> @@ -3639,8 +3718,19 @@ static void read_cruft_objects(void)
>
>  	if (cruft_expiration)
>  		enumerate_and_traverse_cruft_objects(&fresh_packs);
> -	else
> +	else {
> +		/*
> +		 * call both enumerate_cruft_objects() and
> +		 * enumerate_and_traverse_cruft_objects(), since:
> +		 *
> +		 *   - the former is required to implement non-pruning GCs
> +		 *   - the latter is a noop for "cruft_expiration == 0", but
> +		 *     picks up any additional tips mentioned via
> +		 *     `pack.extraCruftTips`.
> +		 */
>  		enumerate_cruft_objects();
> +		enumerate_and_traverse_cruft_objects(&fresh_packs);
> +	}

  reply	other threads:[~2023-05-03 23:18 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-20 17:27 [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips` Taylor Blau
2023-04-20 18:12 ` Junio C Hamano
2023-04-20 19:30   ` Taylor Blau
2023-04-20 19:52     ` Junio C Hamano
2023-04-20 20:48       ` Taylor Blau
2023-04-21  0:10 ` Chris Torek
2023-04-21  2:14   ` Taylor Blau
2023-04-25 19:42 ` Derrick Stolee
2023-04-25 21:25   ` Taylor Blau
2023-04-26 10:52     ` Derrick Stolee
2023-05-03  0:06       ` Taylor Blau
2023-05-03  0:09 ` [PATCH v2] " Taylor Blau
2023-05-03 14:01   ` Derrick Stolee
2023-05-03 19:59   ` Jeff King
2023-05-03 21:22     ` Taylor Blau
2023-05-05 21:23       ` Jeff King
2023-05-06  0:06         ` Taylor Blau
2023-05-06  0:14           ` Taylor Blau
2023-05-03 21:28     ` Taylor Blau
2023-05-05 21:26       ` Jeff King
2023-05-05 22:13         ` Jeff King
2023-05-06  0:13           ` Taylor Blau
2023-05-06  0:20             ` Taylor Blau
2023-05-06  2:12             ` Jeff King
2023-05-03 22:05 ` [PATCH v3] " Taylor Blau
2023-05-03 23:18   ` Junio C Hamano [this message]
2023-05-03 23:42     ` Junio C Hamano
2023-05-03 23:48       ` Taylor Blau
2023-05-03 23:50       ` Taylor Blau
2023-05-05 21:39     ` Jeff King
2023-05-05 22:19   ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqv8h9m2az.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).