Git Mailing List Archive mirror
 help / color / mirror / Atom feed
* [PATCH] repack: only repack .packs that exist
@ 2023-06-20 19:03 Derrick Stolee via GitGitGadget
  2023-06-21  9:01 ` Taylor Blau
  2023-06-27  7:54 ` Jeff King
  0 siblings, 2 replies; 6+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2023-06-20 19:03 UTC (permalink / raw)
  To: git; +Cc: gitster, me, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

In 73320e49add (builtin/repack.c: only collect fully-formed packs,
2023-06-07), we switched the check for which packs to collect by
starting at the .idx files and looking for matching .pack files. This
avoids trying to repack pack-files that have not had their pack-indexes
installed yet.

However, it does cause maintenance to halt if we find the (problematic,
but not insurmountable) case of a .idx file without a corresponding
.pack file. In an environment where packfile maintenance is a critical
function, such a hard stop is costly and requires human intervention to
resolve (by deleting the .idx file).

This was not the case before. We successfully repacked through this
scenario until the recent change to scan for .idx files.

Further, if we are actually in a case where objects are missing, we
detect this at a different point during the reachability walk.

In other cases, Git prepares its list of packfiles by scanning .idx
files and then only adds it to the packfile list if the corresponding
.pack file exists. It even does so without a warning! (See
add_packed_git() in packfile.c for details.)

This case is much less likely to occur than the failures seen before
73320e49add. Packfiles are "installed" by writing the .pack file before
the .idx and that process can be interrupted. Packfiles _should_ be
deleted by deleting the .idx first, followed by the .pack file, but
unlink_pack_path() does not do this: it deletes the .pack _first_,
allowing a window where this process could be interrupted. We leave the
consideration of changing this order as a separate concern. Knowing that
this condition is possible from interrupted Git processes and not other
tools lends some weight that Git should be more flexible around this
scenario.

Add a check to see if the .pack file exists before adding it to the list
for repacking. This will stop a number of maintenance failures seen in
production but fixed by deleting the .idx files.

This brings us closer to the case before 73320e49add in that 'git
repack' will not fail when there is an orphaned .idx file, at least, not
due to the way we scan for packfiles. In the case that the .pack file
was erroneously deleted without copies of its objects in other installed
packfiles, then 'git repack' will fail due to the reachable object walk.

This does resolve the case where automated repacks will no longer be
halted on this case. The tests in t7700 show both these successful
scenarios and the case of failing if the .pack was truly required.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
    [RFC] repack: only repack .packs that exist
    
    This is based on tb/collect-pack-filenames-fix.
    
    This is an RFC since there is some internal disagreement as to whether
    this is a safe scenario to "ignore" during a repack.
    
    I'm of the opinion that blocking on the case where there are no missing
    objects is unnecessary noise, and matches our behavior before the change
    to scan for '.idx' files.
    
    I'm submitting this to the list to gather more opinions about the safety
    and/or necessity of this change before moving forward with a full
    review.
    
    This is related to the deletion of .packs before .idx files that I
    submitted as [1]
    
    [1]
    https://lore.kernel.org/git/pull.1547.git.1687287675248.gitgitgadget@gmail.com
    
    Thanks, -Stolee

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1546%2Fderrickstolee%2Frepack-ignore-idx-no-pack-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1546/derrickstolee/repack-ignore-idx-no-pack-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1546

 builtin/repack.c  |  3 +++
 t/t7700-repack.sh | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/builtin/repack.c b/builtin/repack.c
index 1e21a21ea82..b1695c9b2eb 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -123,6 +123,9 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 		strbuf_add(&buf, e->d_name, len);
 		strbuf_addstr(&buf, ".pack");
 
+		if (!file_exists(mkpath("%s/%s", packdir, buf.buf)))
+			continue;
+
 		for (i = 0; i < extra_keep->nr; i++)
 			if (!fspathcmp(buf.buf, extra_keep->items[i].string))
 				break;
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 08b5ba0c150..e780efe5e0c 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -239,6 +239,10 @@ test_expect_success 'repack --keep-pack' '
 			mv "$from" "$to" || return 1
 		done &&
 
+		# A .idx file without a .pack should not stop us from
+		# repacking what we can.
+		touch .git/objects/pack/pack-does-not-exist.idx &&
+
 		git repack --cruft -d --keep-pack $P1 --keep-pack $P4 &&
 
 		ls .git/objects/pack/*.pack >newer-counts &&
@@ -247,6 +251,36 @@ test_expect_success 'repack --keep-pack' '
 	)
 '
 
+test_expect_success 'repacking fails when missing .pack actually means missing objects' '
+	test_create_repo idx-without-pack &&
+	(
+		cd idx-without-pack &&
+
+		# Avoid producing difference packs to delta/base choices
+		git config pack.window 0 &&
+		P1=$(commit_and_pack 1) &&
+		P2=$(commit_and_pack 2) &&
+		P3=$(commit_and_pack 3) &&
+		P4=$(commit_and_pack 4) &&
+		ls .git/objects/pack/*.pack >old-counts &&
+		test_line_count = 4 old-counts &&
+
+		# Remove one .pack file
+		rm .git/objects/pack/$P2 &&
+
+		ls .git/objects/pack >before-pack-dir &&
+
+		test_must_fail git fsck &&
+		test_must_fail git repack --cruft -d 2>err &&
+		grep "bad object" err &&
+
+		# Before failing, the repack did not modify the
+		# pack directory.
+		ls .git/objects/pack >after-pack-dir &&
+		test_cmp before-pack-dir after-pack-dir
+	)
+'
+
 test_expect_success 'bitmaps are created by default in bare repos' '
 	git clone --bare .git bare.git &&
 	rm -f bare.git/objects/pack/*.bitmap &&

base-commit: 73320e49add4b56aba9bf5236a216498fa8ccc22
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] repack: only repack .packs that exist
  2023-06-20 19:03 [PATCH] repack: only repack .packs that exist Derrick Stolee via GitGitGadget
@ 2023-06-21  9:01 ` Taylor Blau
  2023-06-27  7:54 ` Jeff King
  1 sibling, 0 replies; 6+ messages in thread
From: Taylor Blau @ 2023-06-21  9:01 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, gitster, Derrick Stolee, Jeff King

On Tue, Jun 20, 2023 at 07:03:02PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
>
> In 73320e49add (builtin/repack.c: only collect fully-formed packs,
> 2023-06-07), we switched the check for which packs to collect by
> starting at the .idx files and looking for matching .pack files. This
> avoids trying to repack pack-files that have not had their pack-indexes
> installed yet.

Right; the motivation behind this change is based on the principle that
we consider a packfile (and any corresponding metadata it might have,
like .keep, .bitmap, .rev, etc.) to be accessible exactly when there is
an .idx file in place.

So if pack-P.idx exists, we expect to be able to read the contents of
pack-P.pack (and any other pack-P.* files that happen to be laying
around). The idea with 73320e49add is to make `collect_pack_filenames()`
consistent with the same pack existence check that we use everywhere
else.

> However, it does cause maintenance to halt if we find the (problematic,
> but not insurmountable) case of a .idx file without a corresponding
> .pack file. In an environment where packfile maintenance is a critical
> function, such a hard stop is costly and requires human intervention to
> resolve (by deleting the .idx file).

Do all cases fall into the "problematic, but not insurmountable"
category? I'm not entirely sure. One instance that is definitely OK is
having a partially unstaged pack whose objects were repacked elsewhere,
but the 'repack' process doing so died between unlink()-ing the ".pack"
file and the ".idx". This matches the patch that you and I discussed
yesterday to switch the order and restore git-repack.sh's original
behavior, which is correct.

But what about someone accidentally deleting a ".pack" file and leaving
around its index? Or copying files from one repository to another and
only partially copying the contents of $GIT_DIR/objects/pack?

What makes me leery about this kind of change is that there are
potentially many more cases like the above. And even if the interesting
cases are somehow limited to the above, continuing on in the face of a
malformed pack group doesn't seem like the right thing for repack to be
doing.

> This was not the case before. We successfully repacked through this
> scenario until the recent change to scan for .idx files.

Right, my feeling is that the new behavior post-73320e49add is arguably
more correct.

> In other cases, Git prepares its list of packfiles by scanning .idx
> files and then only adds it to the packfile list if the corresponding
> .pack file exists. It even does so without a warning! (See
> add_packed_git() in packfile.c for details.)

This is by design: we only try to open the pack itself when we call
`open_packed_git()`. All packs start out initialized with just their
".idx" as you note.

This feels similar to what you are trying to do here, but I do think
that it's different. While we're performing object reads, there may be
concurrent writers adding (e.g. from pushes) or removing packs (e.g.
from repacking) in the repository. So there we have to be resilient
to thinking that we might have a copy of some object in a given pack,
but still being able to handle the case where that pack disappears and
the object is actually found in some other pack.

Repacking feels different to me. There we are modifying the packs
themselves, so having a incorrectly staged pack group feels like a
condition that we should stop on, because we can no longer completely
trust the data that we are operating on.

> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
> index 08b5ba0c150..e780efe5e0c 100755
> --- a/t/t7700-repack.sh
> +++ b/t/t7700-repack.sh
> @@ -239,6 +239,10 @@ test_expect_success 'repack --keep-pack' '
>  			mv "$from" "$to" || return 1
>  		done &&
>
> +		# A .idx file without a .pack should not stop us from
> +		# repacking what we can.
> +		touch .git/objects/pack/pack-does-not-exist.idx &&
> +
>  		git repack --cruft -d --keep-pack $P1 --keep-pack $P4 &&
>
>  		ls .git/objects/pack/*.pack >newer-counts &&

If we do end up pursuing this patch, it may be worthwhile to ensure that
the dangling .idx file remains in place after the repack is complete.
You test for this implicitly below, but doing so here explicitly feels
worthwhile to me.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] repack: only repack .packs that exist
  2023-06-20 19:03 [PATCH] repack: only repack .packs that exist Derrick Stolee via GitGitGadget
  2023-06-21  9:01 ` Taylor Blau
@ 2023-06-27  7:54 ` Jeff King
  2023-06-29  9:24   ` Taylor Blau
  1 sibling, 1 reply; 6+ messages in thread
From: Jeff King @ 2023-06-27  7:54 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, gitster, me, Derrick Stolee

On Tue, Jun 20, 2023 at 07:03:02PM +0000, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
> 
> In 73320e49add (builtin/repack.c: only collect fully-formed packs,
> 2023-06-07), we switched the check for which packs to collect by
> starting at the .idx files and looking for matching .pack files. This
> avoids trying to repack pack-files that have not had their pack-indexes
> installed yet.
> 
> However, it does cause maintenance to halt if we find the (problematic,
> but not insurmountable) case of a .idx file without a corresponding
> .pack file. In an environment where packfile maintenance is a critical
> function, such a hard stop is costly and requires human intervention to
> resolve (by deleting the .idx file).
> 
> This was not the case before. We successfully repacked through this
> scenario until the recent change to scan for .idx files.
> 
> Further, if we are actually in a case where objects are missing, we
> detect this at a different point during the reachability walk.

I'd worry less about missing objects here, and more that we could be in
a race in which git-repack sees some partial state and decides to take
some action that causes harm. For example, in a "git repack -ad", could
we later mistakenly delete a .idx file because we thought it had no
matching pack? The repack itself would be happy, but it might be causing
headaches for later processes.

I _think_ the answer is "no", because any file we skip via
collect_pack_filenames() won't make it into the list of either the
existing nonkept or kept packs, and we use "nonkept" list to do the
deletion.

So it seems like missing a pack here (whether racily or not) could at
most result in cruft being left in place, or a bad --geometric decision
being made, etc. Which makes the stakes reasonably low, I think.

> In other cases, Git prepares its list of packfiles by scanning .idx
> files and then only adds it to the packfile list if the corresponding
> .pack file exists. It even does so without a warning! (See
> add_packed_git() in packfile.c for details.)

Interesting. I'd have expected a warning. ;)

I also kind of wonder if this repack code should simply be loading and
iterating the packed_git list, but that is a much bigger change.

> This case is much less likely to occur than the failures seen before
> 73320e49add. Packfiles are "installed" by writing the .pack file before
> the .idx and that process can be interrupted. Packfiles _should_ be
> deleted by deleting the .idx first, followed by the .pack file, but
> unlink_pack_path() does not do this: it deletes the .pack _first_,
> allowing a window where this process could be interrupted. We leave the
> consideration of changing this order as a separate concern. Knowing that
> this condition is possible from interrupted Git processes and not other
> tools lends some weight that Git should be more flexible around this
> scenario.

Hmm. So it really sounds to me like unlink_pack_path() is doing the
wrong thing. And once fixed, we would (assuming no other similar spots)
cease to see any races with this code. At which point would we still
want the looseness provided by this patch?

I.e., is the concern concurrent races, or is it leftover cruft (possibly
from a killed process, crash, etc)?

If the latter, then we might still want to quietly skip them. But it
makes me wonder how such .idx files should get cleaned up in the longer
term.

I'm kind of guessing that what you've seen is actually leftover cruft,
if only because I wouldn't expect concurrent unlinking to be very common
(it implies two repacks running at the same time, which is generally a
bad idea). Whereas new packs being written would be common (via
index-pack, etc from pushes).

> Add a check to see if the .pack file exists before adding it to the list
> for repacking. This will stop a number of maintenance failures seen in
> production but fixed by deleting the .idx files.

OK. The fact that this brings things in line with add_packed_git() is
compelling to me, and I think the stakes are pretty low for this causing
us to do something confusing.

It does make me feel like we should be doing something similar to
git-prune's prune_cruft(): any non-.pack file in objects/pack with an
old mtime and no matching .pack file should be a candidate for deletion.
I don't _think_ we have anything like that now, but it would clean these
in the longer term. I'm OK with that being a separate topic, though.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] repack: only repack .packs that exist
  2023-06-27  7:54 ` Jeff King
@ 2023-06-29  9:24   ` Taylor Blau
  2023-07-02 13:11     ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Taylor Blau @ 2023-06-29  9:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee via GitGitGadget, git, gitster, Derrick Stolee

On Tue, Jun 27, 2023 at 03:54:27AM -0400, Jeff King wrote:
> > In other cases, Git prepares its list of packfiles by scanning .idx
> > files and then only adds it to the packfile list if the corresponding
> > .pack file exists. It even does so without a warning! (See
> > add_packed_git() in packfile.c for details.)
>
> Interesting. I'd have expected a warning. ;)

Reading this all over, I was incorrect in my original suggestion that
not checking for the existence of the matching ".pack" file was the
expected thing to do.

We don't try to open the ".pack" file itself in add_packed_git(), that
happens later in open_packed_git() if we end up actually needing to read
objects from the pack, or want to retain a handle on it for later if
we're worried that the ".pack" file itself might get deleted.

But that add_packed_git() silently ignores a pack which has its ".idx"
file and not its ".pack" file is behavior that I wasn't aware of, and
had somehow missed when I reread the function last week.

Stolee: I apologize for having missed this important detail. I think in
that sense matching the behavior of add_packed_git() here is the right
thing to do, and that this patch makes sense to me.

> I also kind of wonder if this repack code should simply be loading and
> iterating the packed_git list, but that is a much bigger change.

I have wanted to do this for a while ;-). The minimal patch is less
invasive than I had thought:

--- 8< ---
diff --git a/builtin/repack.c b/builtin/repack.c
index 1e21a21ea8..e854c832fd 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -104,46 +104,33 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 				   struct string_list *fname_kept_list,
 				   const struct string_list *extra_keep)
 {
-	DIR *dir;
-	struct dirent *e;
-	char *fname;
+	struct packed_git *p;
 	struct strbuf buf = STRBUF_INIT;

-	if (!(dir = opendir(packdir)))
-		return;
-
-	while ((e = readdir(dir)) != NULL) {
-		size_t len;
+	for (p = get_all_packs(the_repository); p; p = p->next) {
 		int i;
-
-		if (!strip_suffix(e->d_name, ".idx", &len))
-			continue;
-
-		strbuf_reset(&buf);
-		strbuf_add(&buf, e->d_name, len);
-		strbuf_addstr(&buf, ".pack");
+		const char *base = basename(p->pack_name);

 		for (i = 0; i < extra_keep->nr; i++)
-			if (!fspathcmp(buf.buf, extra_keep->items[i].string))
+			if (!fspathcmp(base, extra_keep->items[i].string))
 				break;

-		fname = xmemdupz(e->d_name, len);
+		strbuf_reset(&buf);
+		strbuf_addstr(&buf, base);
+		strbuf_strip_suffix(&buf, ".pack");

-		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
-		    (file_exists(mkpath("%s/%s.keep", packdir, fname)))) {
-			string_list_append_nodup(fname_kept_list, fname);
-		} else {
+		if ((extra_keep->nr > 0 && i < extra_keep->nr) || p->pack_keep)
+			string_list_append(fname_kept_list, buf.buf);
+		else {
 			struct string_list_item *item;
-			item = string_list_append_nodup(fname_nonkept_list,
-							fname);
-			if (file_exists(mkpath("%s/%s.mtimes", packdir, fname)))
+			item = string_list_append(fname_nonkept_list, buf.buf);
+			if (p->is_cruft)
 				item->util = (void*)(uintptr_t)CRUFT_PACK;
 		}
 	}
-	closedir(dir);
-	strbuf_release(&buf);

 	string_list_sort(fname_kept_list);
+	strbuf_release(&buf);
 }

 static void remove_redundant_pack(const char *dir_name, const char *base_name)
--- >8 ---

I think you could probably go further than this, since having to store
the suffix-less pack names in the fname_kept and fname_nonkept lists is
kind of weird.

It would be nice if we could store pointers to the actual packed_git
structs themselves in place of those lists instead, but I'm not
immediately sure how feasible it would be to do since we re-prepare the
object store between enumerating and then removing these packs.

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] repack: only repack .packs that exist
  2023-06-29  9:24   ` Taylor Blau
@ 2023-07-02 13:11     ` Jeff King
  2023-07-10 19:39       ` Taylor Blau
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2023-07-02 13:11 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Derrick Stolee via GitGitGadget, git, gitster, Derrick Stolee

On Thu, Jun 29, 2023 at 05:24:40AM -0400, Taylor Blau wrote:

> > I also kind of wonder if this repack code should simply be loading and
> > iterating the packed_git list, but that is a much bigger change.
> 
> I have wanted to do this for a while ;-). The minimal patch is less
> invasive than I had thought:

Yeah, I agree it's not too bad. If we want to go that route, though, I
think we should do it on top of Stolee's patch, though (which makes it a
pure cleanup once the behaviors are aligned).

I'm also not sure if you'd need to do anything tricky with alternate
object dirs (it looks like the existing code ignores them entirely, so I
guess we'd want to skip entries without pack_local set).

> [...]
> I think you could probably go further than this, since having to store
> the suffix-less pack names in the fname_kept and fname_nonkept lists is
> kind of weird.
> 
> It would be nice if we could store pointers to the actual packed_git
> structs themselves in place of those lists instead, but I'm not
> immediately sure how feasible it would be to do since we re-prepare the
> object store between enumerating and then removing these packs.

I think that would work, because we do not ever drop packed_git entries
from the list (even if the files were deleted between prepare/reprepare).
But it also creates a subtle memory ownership dependency/assumption
between the two bits of code. It seems clearer to me to just copy the
names out to our own lists here (i.e., the patch you showed).

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] repack: only repack .packs that exist
  2023-07-02 13:11     ` Jeff King
@ 2023-07-10 19:39       ` Taylor Blau
  0 siblings, 0 replies; 6+ messages in thread
From: Taylor Blau @ 2023-07-10 19:39 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee via GitGitGadget, git, gitster, Derrick Stolee

On Sun, Jul 02, 2023 at 09:11:17AM -0400, Jeff King wrote:
> On Thu, Jun 29, 2023 at 05:24:40AM -0400, Taylor Blau wrote:
>
> > > I also kind of wonder if this repack code should simply be loading and
> > > iterating the packed_git list, but that is a much bigger change.
> >
> > I have wanted to do this for a while ;-). The minimal patch is less
> > invasive than I had thought:
>
> Yeah, I agree it's not too bad. If we want to go that route, though, I
> think we should do it on top of Stolee's patch, though (which makes it a
> pure cleanup once the behaviors are aligned).
>
> I'm also not sure if you'd need to do anything tricky with alternate
> object dirs (it looks like the existing code ignores them entirely, so I
> guess we'd want to skip entries without pack_local set).

Yep, we definitely want a `if (!p->is_local) continue` in there to be
consistent with the existing behavior.

> > [...]
> > I think you could probably go further than this, since having to store
> > the suffix-less pack names in the fname_kept and fname_nonkept lists is
> > kind of weird.
> >
> > It would be nice if we could store pointers to the actual packed_git
> > structs themselves in place of those lists instead, but I'm not
> > immediately sure how feasible it would be to do since we re-prepare the
> > object store between enumerating and then removing these packs.
>
> I think that would work, because we do not ever drop packed_git entries
> from the list (even if the files were deleted between prepare/reprepare).
> But it also creates a subtle memory ownership dependency/assumption
> between the two bits of code. It seems clearer to me to just copy the
> names out to our own lists here (i.e., the patch you showed).

Yeah, I agree. I thought that it might clean things up further, and to
an extent it does:

    https://github.com/ttaylorr/git/compare/tb/collect-packs-readdir

, but I think that the memory ownership issue is sufficiently subtle
that the clean-up isn't really worth it.

I put the above patch together with Stolee's, and sent the result back
to the list here:

    https://lore.kernel.org/git/cover.1689017830.git.me@ttaylorr.com/T/#t

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-07-10 19:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-20 19:03 [PATCH] repack: only repack .packs that exist Derrick Stolee via GitGitGadget
2023-06-21  9:01 ` Taylor Blau
2023-06-27  7:54 ` Jeff King
2023-06-29  9:24   ` Taylor Blau
2023-07-02 13:11     ` Jeff King
2023-07-10 19:39       ` Taylor Blau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).