Git Mailing List Archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] Repack objects into separate packfiles based on a filter
@ 2023-06-14 19:25 Christian Couder
  2023-06-14 19:25 ` [PATCH 1/9] pack-objects: allow `--filter` without `--stdout` Christian Couder
                   ` (10 more replies)
  0 siblings, 11 replies; 161+ messages in thread
From: Christian Couder @ 2023-06-14 19:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

# Intro

Last year, John Cai sent 2 versions of a patch series to implement
`git repack --filter=<filter-spec>` and later I sent 4 versions of a
patch series trying to do it a bit differently:

  - https://lore.kernel.org/git/pull.1206.git.git.1643248180.gitgitgadget@gmail.com/
  - https://lore.kernel.org/git/20221012135114.294680-1-christian.couder@gmail.com/

In these patch series, the `--filter=<filter-spec>` removed the
filtered out objects altogether which was considered very dangerous
even though we implemented different safety checks in some of the
latter series.

In some discussions, it was mentioned that such a feature, or a
similar feature in `git gc`, or in a new standalone command (perhaps
called `git prune-filtered`), should put the filtered out objects into
a new packfile instead of deleting them.

Recently there were internal discussions at GitLab about either moving
blobs from inactive repos onto cheaper storage, or moving large blobs
onto cheaper storage. This lead us to rethink at repacking using a
filter, but moving the filtered out objects into a separate packfile
instead of deleting them.

So here is a new patch series doing that while implementing the
`--filter=<filter-spec>` option in `git repack`.

This could be useful for the following purposes:

  - As a way for servers to save storage costs by for example moving
    large blobs, or blobs in inactive repos, to separate storage
    (while still making them accessible using for example the
    alternates mechanism).

  - As a way to use partial clone on a Git server to offload large
    blobs to, for example, an http server, while using multiple
    promisor remotes (to be able to access everything) on the client
    side. (In this case the packfile that contains the filtered out
    object can be manualy removed after checking that all the objects
    it contains are available through the promisor remote.)

  - As a way for clients to reclaim some space when they cloned with a
    filter to save disk space but then fetched a lot of unwanted
    objects (for example when checking out old branches) and now want
    to remove these unwanted objects. (In this case they can first
    move the packfile that contains filtered out objects to a separate
    directory or storage, then check that everything works well, and
    then manually remove the packfile after some time.)

As the features and the code are quite different from those in the
previous series, I decided to start a new series instead of continuing
a previous one.

# Commit overview

* 1/9 pack-objects: allow `--filter` without `--stdout`

  This patch is the same as the first patch in the previous series. To
  be able to later repack with a filter we need `git pack-objects` to
  write packfiles when it's filtering instead of just writing the pack
  without the filtered out objects to stdout.

* 2/9 pack-objects: add `--print-filtered` to print omitted objects

  We need a way to know the objects that are filtered out of the
  packfile generated by `git pack-objects --filter=<filter-spec>`. The
  simplest way is to teach pack-objects to print their oids to stdout.

* 3/9 t/helper: add 'find-pack' test-tool

  For testing `git repack --filter=...` that we are going to
  implement, it's useful to have a test helper that can tell which
  packfiles contain a specific object.

* - 4/9 repack: refactor piping an oid to a command
  - 5/9 repack: refactor finishing pack-objects command

  These are small refactorings so that `git repack --filter=...` will
  be able to reuse useful existing functions.

* 6/9 repack: add `--filter=<filter-spec>` option

  This actually adds the `--filter=<filter-spec>` option. It uses one
  `git pack-objects` process with both the `--filter` and the
  `--print-filtered` options. From this process it reads the oids of
  the filtered out objects and pass them to a separate `git
  pack-objects` process which will pack these objects into a separate
  packfile.

* 7/9 gc: add `gc.repackFilter` config option

  This is a gc config option so that `git gc` can also repack using a
  filter and put the filtered out objects into a separate packfile.

* 8/9 repack: implement `--filter-to` for storing filtered out objects

  For some use cases, it's interesting to create the packfile that
  contains the filtered out objects into a separate location. This is
  similar to the --expire-to option for cruft packfiles.

* 9/9 gc: add `gc.repackFilterTo` config option

  This allows specifying the location of the packfile that contains
  the filtered out objects when using `gc.repackFilter`.


Christian Couder (9):
  pack-objects: allow `--filter` without `--stdout`
  pack-objects: add `--print-filtered` to print omitted objects
  t/helper: add 'find-pack' test-tool
  repack: refactor piping an oid to a command
  repack: refactor finishing pack-objects command
  repack: add `--filter=<filter-spec>` option
  gc: add `gc.repackFilter` config option
  repack: implement `--filter-to` for storing filtered out objects
  gc: add `gc.repackFilterTo` config option

 Documentation/config/gc.txt            |  11 ++
 Documentation/git-pack-objects.txt     |  14 ++-
 Documentation/git-repack.txt           |  11 ++
 Makefile                               |   1 +
 builtin/gc.c                           |  10 ++
 builtin/pack-objects.c                 |  55 ++++++--
 builtin/repack.c                       | 166 ++++++++++++++++++-------
 t/helper/test-find-pack.c              |  35 ++++++
 t/helper/test-tool.c                   |   1 +
 t/helper/test-tool.h                   |   1 +
 t/t5317-pack-objects-filter-objects.sh |  27 ++++
 t/t6500-gc.sh                          |  23 ++++
 t/t7700-repack.sh                      |  43 +++++++
 13 files changed, 345 insertions(+), 53 deletions(-)
 create mode 100644 t/helper/test-find-pack.c

-- 
2.41.0.37.gae45d9845e


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH 1/9] pack-objects: allow `--filter` without `--stdout`
  2023-06-14 19:25 [PATCH 0/9] Repack objects into separate packfiles based on a filter Christian Couder
@ 2023-06-14 19:25 ` Christian Couder
  2023-06-21 10:49   ` Taylor Blau
  2023-06-14 19:25 ` [PATCH 2/9] pack-objects: add `--print-filtered` to print omitted objects Christian Couder
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-06-14 19:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

9535ce7337 (pack-objects: add list-objects filtering, 2017-11-21)
taught `git pack-objects` to use `--filter`, but required the use of
`--stdout` since a partial clone mechanism was not yet in place to
handle missing objects. Since then, changes like 9e27beaa23
(promisor-remote: implement promisor_remote_get_direct(), 2019-06-25)
and others added support to dynamically fetch objects that were missing.

Even without a promisor remote, filtering out objects can also be useful
if we can put the filtered out objects in a separate pack, and in this
case it also makes sense for pack-objects to write the packfile directly
to an actual file rather than on stdout.

Remove the `--stdout` requirement when using `--filter`, so that in a
follow-up commit, repack can pass `--filter` to pack-objects to omit
certain objects from the resulting packfile.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-pack-objects.txt | 4 ++--
 builtin/pack-objects.c             | 8 ++------
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index a9995a932c..583270a85f 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -298,8 +298,8 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it creates a bundle.
 	nevertheless.
 
 --filter=<filter-spec>::
-	Requires `--stdout`.  Omits certain objects (usually blobs) from
-	the resulting packfile.  See linkgit:git-rev-list[1] for valid
+	Omits certain objects (usually blobs) from the resulting
+	packfile.  See linkgit:git-rev-list[1] for valid
 	`<filter-spec>` forms.
 
 --no-filter::
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 9cfc8801f9..af007868c1 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4388,12 +4388,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (filter_options.choice) {
-		if (!pack_to_stdout)
-			die(_("cannot use --filter without --stdout"));
-		if (stdin_packs)
-			die(_("cannot use --filter with --stdin-packs"));
-	}
+	if (stdin_packs && filter_options.choice)
+		die(_("cannot use --filter with --stdin-packs"));
 
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
-- 
2.41.0.37.gae45d9845e


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH 2/9] pack-objects: add `--print-filtered` to print omitted objects
  2023-06-14 19:25 [PATCH 0/9] Repack objects into separate packfiles based on a filter Christian Couder
  2023-06-14 19:25 ` [PATCH 1/9] pack-objects: allow `--filter` without `--stdout` Christian Couder
@ 2023-06-14 19:25 ` Christian Couder
  2023-06-15 22:50   ` Junio C Hamano
  2023-06-14 19:25 ` [PATCH 3/9] t/helper: add 'find-pack' test-tool Christian Couder
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-06-14 19:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

When using the `--filter=<filter-spec>` option, `git pack-objects` will
omit some objects from the resulting packfile(s) it produces. It could
be useful to know about these omitted objects though.

For example, we might want to write these objects into a separate
packfile by piping them into another `git pack-object` process.
Or we might want to check if these objects are available from a
promisor remote.

Anyway, this patch implements a simple way to let us know about these
objects by simply printing their oid, one per line, on stdout when the
new `--print-filtered` flag is passed.

As `--print-filtered` doesn't make sense without `--filter`, it is
disallowed to use the former without the latter.

Using `--stdout` is likely to make the `--print-filtered` output
difficult to find or parse, so we also disallow using these two options
together.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-pack-objects.txt     | 10 ++++++
 builtin/pack-objects.c                 | 47 ++++++++++++++++++++++++--
 t/t5317-pack-objects-filter-objects.sh | 27 +++++++++++++++
 3 files changed, 81 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index 583270a85f..6469080029 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -305,6 +305,16 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it creates a bundle.
 --no-filter::
 	Turns off any previous `--filter=` argument.
 
+--print-filtered::
+	Requires `--filter=`. Prints on stdout, one per line, the
+	object IDs of the objects that are filtered out from the
+	resulting packfile by the filter. This is incompatible with
+	`--stdout`. As <SHA-1> hashes are already written to stdout
+	based on the resulting pack contents (see the `base-name`
+	argument), a line containing only six `-` characters is
+	written after those <SHA-1> hashes, before the filtered object
+	IDs.
+
 --missing=<missing-action>::
 	A debug option to help with future "partial clone" development.
 	This option specifies how missing objects are handled.
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index af007868c1..c8e2b6b859 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -266,6 +266,12 @@ static struct oidmap configured_exclusions;
 
 static struct oidset excluded_by_config;
 
+/*
+ * Objects omitted by filter
+ */
+static int print_filtered_out;
+static struct oidset *omitted_by_filter;
+
 /*
  * stats
  */
@@ -4065,11 +4071,18 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av)
 		die(_("revision walk setup failed"));
 	mark_edges_uninteresting(revs, show_edge, sparse);
 
+	if (print_filtered_out) {
+		omitted_by_filter = xmalloc(sizeof(*omitted_by_filter));
+		oidset_init(omitted_by_filter, 0);
+	}
+
 	if (!fn_show_object)
 		fn_show_object = show_object;
-	traverse_commit_list(revs,
-			     show_commit, fn_show_object,
-			     NULL);
+	traverse_commit_list_filtered(revs,
+				      show_commit,
+				      fn_show_object,
+				      NULL,
+				      omitted_by_filter);
 
 	if (unpack_unreachable_expiration) {
 		revs->ignore_missing_links = 1;
@@ -4165,6 +4178,23 @@ static int option_parse_cruft_expiration(const struct option *opt,
 	return 0;
 }
 
+static void print_omitted_by_filter(void)
+{
+	struct oidset_iter iter;
+	const struct object_id *oid;
+
+	fprintf_ln(stdout, "%s", "------");
+	fprintf_ln(stderr, "%s", _("Printing objects omitted by filter"));
+
+	oidset_iter_init(omitted_by_filter, &iter);
+
+	while ((oid = oidset_iter_next(&iter)))
+		fprintf_ln(stdout, "%s", oid_to_hex(oid));
+
+	oidset_clear(omitted_by_filter);
+	FREE_AND_NULL(omitted_by_filter);
+}
+
 int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 {
 	int use_internal_rev_list = 0;
@@ -4278,6 +4308,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		OPT_STRING_LIST(0, "uri-protocol", &uri_protocols,
 				N_("protocol"),
 				N_("exclude any configured uploadpack.blobpackfileuri with this protocol")),
+		OPT_BOOL(0, "print-filtered", &print_filtered_out,
+			 N_("print filtered out objects to stdout")),
 		OPT_END(),
 	};
 
@@ -4394,6 +4426,12 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
 
+	if (print_filtered_out && !filter_options.choice)
+		die(_("cannot use --print-filtered without --filter"));
+
+	if (print_filtered_out && pack_to_stdout)
+		die(_("cannot use --print-filtered with --stdout"));
+
 	if (cruft) {
 		if (use_internal_rev_list)
 			die(_("cannot use internal rev list with --cruft"));
@@ -4509,6 +4547,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			   written, written_delta, reused, reused_delta,
 			   reuse_packfile_objects);
 
+	if (omitted_by_filter)
+		print_omitted_by_filter();
+
 cleanup:
 	list_objects_filter_release(&filter_options);
 	strvec_clear(&rp);
diff --git a/t/t5317-pack-objects-filter-objects.sh b/t/t5317-pack-objects-filter-objects.sh
index b26d476c64..ec3a03d90a 100755
--- a/t/t5317-pack-objects-filter-objects.sh
+++ b/t/t5317-pack-objects-filter-objects.sh
@@ -438,6 +438,33 @@ test_expect_success 'verify sparse:oid=oid-ish' '
 	test_cmp expected observed
 '
 
+# Test pack-objects with --print-filtered option
+
+test_expect_success 'pack-objects fails w/ both --print-filtered and --stdout' '
+	test_must_fail git -C r1 pack-objects --revs --stdout \
+		--filter=blob:none --print-filtered >filter.out <<-EOF
+	HEAD
+	EOF
+'
+
+test_expect_success 'pack-objects w/ --print-filtered and a pack name' '
+	git -C r1 pack-objects --revs --filter=blob:none \
+		--print-filtered filtered-pack >filter.out <<-EOF &&
+	HEAD
+	EOF
+
+	# Check that the second line contains "------"
+	head -n 2 filter.out | tail -n 1 >actual &&
+	echo "------" >expected &&
+	test_cmp expected actual &&
+
+	# Remove the first two lines and check there are all the blobs
+	tail -n +3 filter.out | sort >actual &&
+	git -C r1 cat-file --batch-check --batch-all-objects | grep blob |
+		sed -e "s/ blob.*//" | sort >expected &&
+	test_cmp expected actual
+'
+
 # Delete some loose objects and use pack-objects, but WITHOUT any filtering.
 # This models previously omitted objects that we did not receive.
 
-- 
2.41.0.37.gae45d9845e


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH 3/9] t/helper: add 'find-pack' test-tool
  2023-06-14 19:25 [PATCH 0/9] Repack objects into separate packfiles based on a filter Christian Couder
  2023-06-14 19:25 ` [PATCH 1/9] pack-objects: allow `--filter` without `--stdout` Christian Couder
  2023-06-14 19:25 ` [PATCH 2/9] pack-objects: add `--print-filtered` to print omitted objects Christian Couder
@ 2023-06-14 19:25 ` Christian Couder
  2023-06-15 23:32   ` Junio C Hamano
  2023-06-14 19:25 ` [PATCH 4/9] repack: refactor piping an oid to a command Christian Couder
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-06-14 19:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

In a following commit, we will make it possible to separate objects in
different packfiles depending on a filter.

To make sure that the right objects are in the right packs, let's add a
new test-tool that can display which packfile(s) a given object is in.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Makefile                  |  1 +
 t/helper/test-find-pack.c | 35 +++++++++++++++++++++++++++++++++++
 t/helper/test-tool.c      |  1 +
 t/helper/test-tool.h      |  1 +
 4 files changed, 38 insertions(+)
 create mode 100644 t/helper/test-find-pack.c

diff --git a/Makefile b/Makefile
index e440728c24..c1cd735b31 100644
--- a/Makefile
+++ b/Makefile
@@ -800,6 +800,7 @@ TEST_BUILTINS_OBJS += test-dump-untracked-cache.o
 TEST_BUILTINS_OBJS += test-env-helper.o
 TEST_BUILTINS_OBJS += test-example-decorate.o
 TEST_BUILTINS_OBJS += test-fast-rebase.o
+TEST_BUILTINS_OBJS += test-find-pack.o
 TEST_BUILTINS_OBJS += test-fsmonitor-client.o
 TEST_BUILTINS_OBJS += test-genrandom.o
 TEST_BUILTINS_OBJS += test-genzeros.o
diff --git a/t/helper/test-find-pack.c b/t/helper/test-find-pack.c
new file mode 100644
index 0000000000..1928fe7329
--- /dev/null
+++ b/t/helper/test-find-pack.c
@@ -0,0 +1,35 @@
+#include "test-tool.h"
+#include "object-name.h"
+#include "object-store.h"
+#include "packfile.h"
+#include "setup.h"
+
+/*
+ * Display the path(s), one per line, of the packfile(s) containing
+ * the given object.
+ */
+
+static const char *find_pack_usage = "\n"
+"  test-tool find-pack <object>";
+
+
+int cmd__find_pack(int argc, const char **argv)
+{
+	struct object_id oid;
+	struct packed_git *p;
+
+	setup_git_directory();
+
+	if (argc != 2)
+		usage(find_pack_usage);
+
+	if (repo_get_oid(the_repository, argv[1], &oid))
+		die("cannot parse %s as an object name", argv[1]);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		if (find_pack_entry_one(oid.hash, p))
+			printf("%s\n", p->pack_name);
+	}
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index abe8a785eb..41da40c296 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -31,6 +31,7 @@ static struct test_cmd cmds[] = {
 	{ "env-helper", cmd__env_helper },
 	{ "example-decorate", cmd__example_decorate },
 	{ "fast-rebase", cmd__fast_rebase },
+	{ "find-pack", cmd__find_pack },
 	{ "fsmonitor-client", cmd__fsmonitor_client },
 	{ "genrandom", cmd__genrandom },
 	{ "genzeros", cmd__genzeros },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index ea2672436c..411dbf2db4 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -25,6 +25,7 @@ int cmd__dump_reftable(int argc, const char **argv);
 int cmd__env_helper(int argc, const char **argv);
 int cmd__example_decorate(int argc, const char **argv);
 int cmd__fast_rebase(int argc, const char **argv);
+int cmd__find_pack(int argc, const char **argv);
 int cmd__fsmonitor_client(int argc, const char **argv);
 int cmd__genrandom(int argc, const char **argv);
 int cmd__genzeros(int argc, const char **argv);
-- 
2.41.0.37.gae45d9845e


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH 4/9] repack: refactor piping an oid to a command
  2023-06-14 19:25 [PATCH 0/9] Repack objects into separate packfiles based on a filter Christian Couder
                   ` (2 preceding siblings ...)
  2023-06-14 19:25 ` [PATCH 3/9] t/helper: add 'find-pack' test-tool Christian Couder
@ 2023-06-14 19:25 ` Christian Couder
  2023-06-15 23:46   ` Junio C Hamano
  2023-06-14 19:25 ` [PATCH 5/9] repack: refactor finishing pack-objects command Christian Couder
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-06-14 19:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

Create a new write_oid_hex_cmd() function to send an oid to the standard
input of a running command. This new function will be used in a
following commit.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 builtin/repack.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 0541c3ce15..e591c295cf 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -182,6 +182,17 @@ static void prepare_pack_objects(struct child_process *cmd,
 	cmd->out = -1;
 }
 
+static void write_oid_hex_cmd(const char *oid_hex,
+			      struct child_process *cmd,
+			      const char *err_msg)
+{
+	if (cmd->in == -1 && start_command(cmd))
+		die("%s", err_msg);
+
+	xwrite(cmd->in, oid_hex, the_hash_algo->hexsz);
+	xwrite(cmd->in, "\n", 1);
+}
+
 /*
  * Write oid to the given struct child_process's stdin, starting it first if
  * necessary.
@@ -192,13 +203,8 @@ static int write_oid(const struct object_id *oid,
 {
 	struct child_process *cmd = data;
 
-	if (cmd->in == -1) {
-		if (start_command(cmd))
-			die(_("could not start pack-objects to repack promisor objects"));
-	}
-
-	xwrite(cmd->in, oid_to_hex(oid), the_hash_algo->hexsz);
-	xwrite(cmd->in, "\n", 1);
+	write_oid_hex_cmd(oid_to_hex(oid), cmd,
+			  _("could not start pack-objects to repack promisor objects"));
 	return 0;
 }
 
-- 
2.41.0.37.gae45d9845e


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH 5/9] repack: refactor finishing pack-objects command
  2023-06-14 19:25 [PATCH 0/9] Repack objects into separate packfiles based on a filter Christian Couder
                   ` (3 preceding siblings ...)
  2023-06-14 19:25 ` [PATCH 4/9] repack: refactor piping an oid to a command Christian Couder
@ 2023-06-14 19:25 ` Christian Couder
  2023-06-16  0:13   ` Junio C Hamano
  2023-06-21 11:05   ` Taylor Blau
  2023-06-14 19:25 ` [PATCH 6/9] repack: add `--filter=<filter-spec>` option Christian Couder
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 161+ messages in thread
From: Christian Couder @ 2023-06-14 19:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

Create a new finish_pack_objects_cmd() to refactor duplicated code
that handles reading the packfile names from the output of a
`git pack-objects` command and putting it into a string_list, as well as
calling finish_command().

While at it, beautify a code comment a bit in the new function.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org
---
 builtin/repack.c | 78 ++++++++++++++++++++++++------------------------
 1 file changed, 39 insertions(+), 39 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index e591c295cf..f1adacf1d0 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -703,6 +703,42 @@ static void remove_redundant_bitmaps(struct string_list *include,
 	strbuf_release(&path);
 }
 
+static int finish_pack_objects_cmd(struct child_process *cmd,
+				   struct string_list *names,
+				   const char *destination)
+{
+	int local = 1;
+	FILE *out;
+	struct strbuf line = STRBUF_INIT;
+
+	if (destination) {
+		const char *scratch;
+		local = skip_prefix(destination, packdir, &scratch);
+	}
+
+	out = xfdopen(cmd->out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		struct string_list_item *item;
+
+		if (line.len != the_hash_algo->hexsz)
+			die(_("repack: Expecting full hex object ID lines only "
+			      "from pack-objects."));
+		/*
+		 * Avoid putting packs written outside of the repository in the
+		 * list of names.
+		 */
+		if (local) {
+			item = string_list_append(names, line.buf);
+			item->util = populate_pack_exts(line.buf);
+		}
+	}
+	fclose(out);
+
+	strbuf_release(&line);
+
+	return finish_command(cmd);
+}
+
 static int write_cruft_pack(const struct pack_objects_args *args,
 			    const char *destination,
 			    const char *pack_prefix,
@@ -712,12 +748,9 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 			    struct string_list *existing_kept_packs)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
-	struct strbuf line = STRBUF_INIT;
 	struct string_list_item *item;
-	FILE *in, *out;
+	FILE *in;
 	int ret;
-	const char *scratch;
-	int local = skip_prefix(destination, packdir, &scratch);
 
 	prepare_pack_objects(&cmd, args, destination);
 
@@ -758,27 +791,7 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 		fprintf(in, "%s.pack\n", item->string);
 	fclose(in);
 
-	out = xfdopen(cmd.out, "r");
-	while (strbuf_getline_lf(&line, out) != EOF) {
-		struct string_list_item *item;
-
-		if (line.len != the_hash_algo->hexsz)
-			die(_("repack: Expecting full hex object ID lines only "
-			      "from pack-objects."));
-		/*
-		 * avoid putting packs written outside of the repository in the
-		 * list of names
-		 */
-		if (local) {
-			item = string_list_append(names, line.buf);
-			item->util = populate_pack_exts(line.buf);
-		}
-	}
-	fclose(out);
-
-	strbuf_release(&line);
-
-	return finish_command(&cmd);
+	return finish_pack_objects_cmd(&cmd, names, destination);
 }
 
 int cmd_repack(int argc, const char **argv, const char *prefix)
@@ -789,10 +802,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list existing_nonkept_packs = STRING_LIST_INIT_DUP;
 	struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
 	struct pack_geometry *geometry = NULL;
-	struct strbuf line = STRBUF_INIT;
 	struct tempfile *refs_snapshot = NULL;
 	int i, ext, ret;
-	FILE *out;
 	int show_progress;
 
 	/* variables to be filled by option parsing */
@@ -1023,18 +1034,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		fclose(in);
 	}
 
-	out = xfdopen(cmd.out, "r");
-	while (strbuf_getline_lf(&line, out) != EOF) {
-		struct string_list_item *item;
-
-		if (line.len != the_hash_algo->hexsz)
-			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
-		item = string_list_append(&names, line.buf);
-		item->util = populate_pack_exts(item->string);
-	}
-	strbuf_release(&line);
-	fclose(out);
-	ret = finish_command(&cmd);
+	ret = finish_pack_objects_cmd(&cmd, &names, NULL);
 	if (ret)
 		goto cleanup;
 
-- 
2.41.0.37.gae45d9845e


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH 6/9] repack: add `--filter=<filter-spec>` option
  2023-06-14 19:25 [PATCH 0/9] Repack objects into separate packfiles based on a filter Christian Couder
                   ` (4 preceding siblings ...)
  2023-06-14 19:25 ` [PATCH 5/9] repack: refactor finishing pack-objects command Christian Couder
@ 2023-06-14 19:25 ` Christian Couder
  2023-06-16  0:43   ` Junio C Hamano
  2023-06-21 11:17   ` Taylor Blau
  2023-06-14 19:25 ` [PATCH 7/9] gc: add `gc.repackFilter` config option Christian Couder
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 161+ messages in thread
From: Christian Couder @ 2023-06-14 19:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

After cloning with --filter=<filter-spec>, for example to avoid
getting unneeded large files on a user machine, it's possible
that some of these large files still get fetched for some reasons
(like checking out old branches) over time.

In this case the repo size could grow too much for no good reason and a
way to filter out some objects would be useful to remove the unneeded
large files.

Deleting objects right away could corrupt a repo though, so it might be
better to put those objects into a separate packfile instead of
deleting them. The separate pack could then be removed after checking
that all the objects in it are still available on a promisor remote it
can access.

Also splitting a packfile into 2 packs depending on a filter could be
useful in other usecases. For example some large blobs might take a lot
of precious space on fast storage while they are rarely accessed, and
it could make sense to move them in a separate cheaper, though slower,
storage.

This commit implements a new `--filter=<filter-spec>` option in
`git repack` that moves filtered out objects into a separate pack.

This is done by reading filtered out objects from `git pack-objects`'s
output and piping them into a separate `git pack-objects` process that
will put them into a separate packfile.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt |  5 +++
 builtin/repack.c             | 75 ++++++++++++++++++++++++++++++++++--
 t/t7700-repack.sh            | 16 ++++++++
 3 files changed, 93 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 4017157949..aa29c7e648 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -143,6 +143,11 @@ depth is 4095.
 	a larger and slower repository; see the discussion in
 	`pack.packSizeLimit`.
 
+--filter=<filter-spec>::
+	Remove objects matching the filter specification from the
+	resulting packfile and put them into a separate packfile. See
+	linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index f1adacf1d0..b13d7196de 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -53,6 +53,7 @@ struct pack_objects_args {
 	const char *depth;
 	const char *threads;
 	const char *max_pack_size;
+	const char *filter;
 	int no_reuse_delta;
 	int no_reuse_object;
 	int quiet;
@@ -167,6 +168,10 @@ static void prepare_pack_objects(struct child_process *cmd,
 		strvec_pushf(&cmd->args, "--threads=%s", args->threads);
 	if (args->max_pack_size)
 		strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
+	if (args->filter) {
+		strvec_pushf(&cmd->args, "--filter=%s", args->filter);
+		strvec_pushf(&cmd->args, "--print-filtered");
+	}
 	if (args->no_reuse_delta)
 		strvec_pushf(&cmd->args, "--no-reuse-delta");
 	if (args->no_reuse_object)
@@ -703,13 +708,21 @@ static void remove_redundant_bitmaps(struct string_list *include,
 	strbuf_release(&path);
 }
 
+static void pack_filtered(const char *oid_hex, struct child_process *cmd)
+{
+	write_oid_hex_cmd(oid_hex, cmd,
+			  _("could not start pack-objects to pack filtered objects"));
+}
+
 static int finish_pack_objects_cmd(struct child_process *cmd,
 				   struct string_list *names,
-				   const char *destination)
+				   const char *destination,
+				   struct child_process *pack_filtered_cmd)
 {
 	int local = 1;
 	FILE *out;
 	struct strbuf line = STRBUF_INIT;
+	int filtered_start = 0;
 
 	if (destination) {
 		const char *scratch;
@@ -720,9 +733,20 @@ static int finish_pack_objects_cmd(struct child_process *cmd,
 	while (strbuf_getline_lf(&line, out) != EOF) {
 		struct string_list_item *item;
 
+		if (!filtered_start && pack_filtered_cmd && !strcmp(line.buf, "------")) {
+			filtered_start = 1;
+			continue;
+		}
+
 		if (line.len != the_hash_algo->hexsz)
 			die(_("repack: Expecting full hex object ID lines only "
 			      "from pack-objects."));
+
+		if (pack_filtered_cmd && filtered_start) {
+			pack_filtered(line.buf, pack_filtered_cmd);
+			continue;
+		}
+
 		/*
 		 * Avoid putting packs written outside of the repository in the
 		 * list of names.
@@ -791,9 +815,44 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 		fprintf(in, "%s.pack\n", item->string);
 	fclose(in);
 
-	return finish_pack_objects_cmd(&cmd, names, destination);
+	return finish_pack_objects_cmd(&cmd, names, destination, NULL);
 }
 
+/*
+ * Prepare the command that will pack objects that have been filtered
+ * out from the original pack, so that they will end up in a separate
+ * pack.
+ */
+static void prepare_pack_filtered_cmd(struct child_process *cmd,
+				      const struct pack_objects_args *args,
+				      const char *destination)
+{
+	/* We need to copy args to modify it */
+	struct pack_objects_args new_args = *args;
+
+	/* No need to filter again */
+	new_args.filter = NULL;
+
+	prepare_pack_objects(cmd, &new_args, destination);
+	cmd->in = -1;
+}
+
+static void finish_pack_filtered_cmd(struct child_process *cmd,
+				     struct string_list *names)
+{
+	if (cmd->in == -1) {
+		/* No packed objects; cmd was never started */
+		child_process_clear(cmd);
+		return;
+	}
+
+	close(cmd->in);
+
+	if (finish_pack_objects_cmd(cmd, names, NULL, NULL))
+		die(_("could not finish pack-objects to pack filtered objects"));
+}
+
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -817,6 +876,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int write_midx = 0;
 	const char *cruft_expiration = NULL;
 	const char *expire_to = NULL;
+	struct child_process pack_filtered_cmd = CHILD_PROCESS_INIT;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -858,6 +918,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("limits the maximum number of threads")),
 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
 				N_("maximum size of each packfile")),
+		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
+				N_("object filtering")),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
@@ -1011,6 +1073,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		strvec_push(&cmd.args, "--incremental");
 	}
 
+	if (po_args.filter)
+		prepare_pack_filtered_cmd(&pack_filtered_cmd, &po_args, packtmp);
+
 	if (geometry)
 		cmd.in = -1;
 	else
@@ -1034,7 +1099,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		fclose(in);
 	}
 
-	ret = finish_pack_objects_cmd(&cmd, &names, NULL);
+	ret = finish_pack_objects_cmd(&cmd, &names, NULL,
+				      po_args.filter ? &pack_filtered_cmd : NULL);
 	if (ret)
 		goto cleanup;
 
@@ -1102,6 +1168,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		}
 	}
 
+	if (po_args.filter)
+		finish_pack_filtered_cmd(&pack_filtered_cmd, &names);
+
 	string_list_sort(&names);
 
 	close_object_store(the_repository->objects);
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index faa739eeb9..9e7654090f 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -270,6 +270,22 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_must_be_empty actual
 '
 
+test_expect_success 'repacking with a filter works' '
+	git -C bare.git repack -a -d &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+	git -C bare.git -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
+	test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
+	commit_pack=$(test-tool -C bare.git find-pack HEAD) &&
+	test -n "$commit_pack" &&
+	blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
+	test -n "$blob_pack" &&
+	test "$commit_pack" != "$blob_pack" &&
+	tree_pack=$(test-tool -C bare.git find-pack HEAD^{tree}) &&
+	test "$tree_pack" = "$commit_pack" &&
+	blob_pack2=$(test-tool -C bare.git find-pack HEAD:file2) &&
+	test "$blob_pack2" = "$blob_pack"
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.41.0.37.gae45d9845e


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH 7/9] gc: add `gc.repackFilter` config option
  2023-06-14 19:25 [PATCH 0/9] Repack objects into separate packfiles based on a filter Christian Couder
                   ` (5 preceding siblings ...)
  2023-06-14 19:25 ` [PATCH 6/9] repack: add `--filter=<filter-spec>` option Christian Couder
@ 2023-06-14 19:25 ` Christian Couder
  2023-06-14 19:25 ` [PATCH 8/9] repack: implement `--filter-to` for storing filtered out objects Christian Couder
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-06-14 19:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit has implemented `git repack --filter=<filter-spec>` to
allow users to filter out some objects from the main pack and move them
into a new different pack.

Users might want to perform such a cleanup regularly at the same time as
they perform other repacks and cleanups, so as part of `git gc`.

Let's allow them to configure a <filter-spec> for that purpose using a
new gc.repackFilter config option.

Now when `git gc` will perform a repack with a <filter-spec> configured
through this option and not empty, the repack process will be passed a
corresponding `--filter=<filter-spec>` argument.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/gc.txt |  5 +++++
 builtin/gc.c                |  6 ++++++
 t/t6500-gc.sh               | 12 ++++++++++++
 3 files changed, 23 insertions(+)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index 7f95c866e1..055c4e0db6 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -130,6 +130,11 @@ or rebase occurring.  Since these changes are not part of the current
 project most users will want to expire them sooner, which is why the
 default is more aggressive than `gc.reflogExpire`.
 
+gc.repackFilter::
+	When repacking, use the specified filter to move certain
+	objects into a separate packfile.  See the
+	`--filter=<filter-spec>` option of linkgit:git-repack[1].
+
 gc.rerereResolved::
 	Records of conflicted merge you resolved earlier are
 	kept for this many days when 'git rerere gc' is run.
diff --git a/builtin/gc.c b/builtin/gc.c
index f3942188a6..1c57913214 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -61,6 +61,7 @@ static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
+static char *repack_filter;
 static unsigned long big_pack_threshold;
 static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 
@@ -170,6 +171,8 @@ static void gc_config(void)
 	git_config_get_ulong("gc.bigpackthreshold", &big_pack_threshold);
 	git_config_get_ulong("pack.deltacachesize", &max_delta_cache_size);
 
+	git_config_get_string("gc.repackfilter", &repack_filter);
+
 	git_config(git_default_config, NULL);
 }
 
@@ -355,6 +358,9 @@ static void add_repack_all_option(struct string_list *keep_pack)
 
 	if (keep_pack)
 		for_each_string_list(keep_pack, keep_one_pack, NULL);
+
+	if (repack_filter && *repack_filter)
+		strvec_pushf(&repack, "--filter=%s", repack_filter);
 }
 
 static void add_repack_incremental_option(void)
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 69509d0c11..5b89faf505 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -202,6 +202,18 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
 	grep -E "^trace: (built-in|exec|run_command): git reflog expire --" trace.out
 '
 
+test_expect_success 'gc.repackFilter launches repack with a filter' '
+	test_when_finished "rm -rf bare.git" &&
+	git clone --no-local --bare . bare.git &&
+
+	git -C bare.git -c gc.cruftPacks=false gc &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+
+	GIT_TRACE=$(pwd)/trace.out git -C bare.git -c gc.repackFilter=blob:none -c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&
+	test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
+	grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
+'
+
 prepare_cruft_history () {
 	test_commit base &&
 
-- 
2.41.0.37.gae45d9845e


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH 8/9] repack: implement `--filter-to` for storing filtered out objects
  2023-06-14 19:25 [PATCH 0/9] Repack objects into separate packfiles based on a filter Christian Couder
                   ` (6 preceding siblings ...)
  2023-06-14 19:25 ` [PATCH 7/9] gc: add `gc.repackFilter` config option Christian Couder
@ 2023-06-14 19:25 ` Christian Couder
  2023-06-16  2:21   ` Junio C Hamano
  2023-06-21 11:49   ` Taylor Blau
  2023-06-14 19:25 ` [PATCH 9/9] gc: add `gc.repackFilterTo` config option Christian Couder
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 161+ messages in thread
From: Christian Couder @ 2023-06-14 19:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit has implemented `git repack --filter=<filter-spec>` to
allow users to filter out some objects from the main pack and move them
into a new different pack.

It would be nice if this new different pack could be created in a
different directory than the regular pack. This would make it possible
to move large blobs into a pack on a different kind of storage, for
example cheaper storage. Even in a different directory this pack can be
accessible if, for example, the Git alternates mechanism is used to
point to it.

If users want to remove a pack that contains filtered out objects after
checking that they are all already on a promisor remote, creating the
pack in a different directory makes it easier to do so.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt |  6 ++++++
 builtin/repack.c             | 17 ++++++++++++-----
 t/t7700-repack.sh            | 27 +++++++++++++++++++++++++++
 3 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index aa29c7e648..070dd22610 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -148,6 +148,12 @@ depth is 4095.
 	resulting packfile and put them into a separate packfile. See
 	linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
 
+--filter-to=<dir>::
+	Write the pack containing filtered out objects to the
+	directory `<dir>`. This can be used for putting the pack on a
+	separate object directory that is accessed through the Git
+	alternates mechanism. Only useful with `--filter`.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index b13d7196de..8c71e8fd51 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -838,7 +838,8 @@ static void prepare_pack_filtered_cmd(struct child_process *cmd,
 }
 
 static void finish_pack_filtered_cmd(struct child_process *cmd,
-				     struct string_list *names)
+				     struct string_list *names,
+				     const char *destination)
 {
 	if (cmd->in == -1) {
 		/* No packed objects; cmd was never started */
@@ -848,7 +849,7 @@ static void finish_pack_filtered_cmd(struct child_process *cmd,
 
 	close(cmd->in);
 
-	if (finish_pack_objects_cmd(cmd, names, NULL, NULL))
+	if (finish_pack_objects_cmd(cmd, names, destination, NULL))
 		die(_("could not finish pack-objects to pack filtered objects"));
 }
 
@@ -877,6 +878,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	const char *cruft_expiration = NULL;
 	const char *expire_to = NULL;
 	struct child_process pack_filtered_cmd = CHILD_PROCESS_INIT;
+	const char *filter_to = NULL;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -930,6 +932,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			   N_("write a multi-pack index of the resulting packs")),
 		OPT_STRING(0, "expire-to", &expire_to, N_("dir"),
 			   N_("pack prefix to store a pack containing pruned objects")),
+		OPT_STRING(0, "filter-to", &filter_to, N_("dir"),
+			   N_("pack prefix to store a pack containing filtered out objects")),
 		OPT_END()
 	};
 
@@ -1073,8 +1077,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		strvec_push(&cmd.args, "--incremental");
 	}
 
-	if (po_args.filter)
-		prepare_pack_filtered_cmd(&pack_filtered_cmd, &po_args, packtmp);
+	if (po_args.filter) {
+		if (!filter_to)
+			filter_to = packtmp;
+		prepare_pack_filtered_cmd(&pack_filtered_cmd, &po_args, filter_to);
+	}
 
 	if (geometry)
 		cmd.in = -1;
@@ -1169,7 +1176,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 
 	if (po_args.filter)
-		finish_pack_filtered_cmd(&pack_filtered_cmd, &names);
+		finish_pack_filtered_cmd(&pack_filtered_cmd, &names, filter_to);
 
 	string_list_sort(&names);
 
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 9e7654090f..898f8a01b4 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -286,6 +286,33 @@ test_expect_success 'repacking with a filter works' '
 	test "$blob_pack2" = "$blob_pack"
 '
 
+test_expect_success '--filter-to stores filtered out objects' '
+	git -C bare.git repack -a -d &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+
+	git init --bare filtered.git &&
+	git -C bare.git -c repack.writebitmaps=false repack -a -d \
+		--filter=blob:none \
+		--filter-to=../filtered.git/objects/pack/pack &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/pack-*.pack &&
+	test_stdout_line_count = 1 ls filtered.git/objects/pack/pack-*.pack &&
+
+	commit_pack=$(test-tool -C bare.git find-pack HEAD) &&
+	test -n "$commit_pack" &&
+	blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
+	test -z "$blob_pack" &&
+	blob_hash=$(git -C bare.git rev-parse HEAD:file1) &&
+	test -n "$blob_hash" &&
+	blob_pack=$(test-tool -C filtered.git find-pack $blob_hash) &&
+	test -n "$blob_pack" &&
+
+	echo $(pwd)/filtered.git/objects >bare.git/objects/info/alternates &&
+	blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
+	test -n "$blob_pack" &&
+	blob_content=$(git -C bare.git show $blob_hash) &&
+	test "$blob_content" = "content1"
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.41.0.37.gae45d9845e


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH 9/9] gc: add `gc.repackFilterTo` config option
  2023-06-14 19:25 [PATCH 0/9] Repack objects into separate packfiles based on a filter Christian Couder
                   ` (7 preceding siblings ...)
  2023-06-14 19:25 ` [PATCH 8/9] repack: implement `--filter-to` for storing filtered out objects Christian Couder
@ 2023-06-14 19:25 ` Christian Couder
  2023-06-16  2:54   ` Junio C Hamano
  2023-06-14 21:36 ` [PATCH 0/9] Repack objects into separate packfiles based on a filter Junio C Hamano
  2023-07-05  6:08 ` [PATCH v2 0/8] " Christian Couder
  10 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-06-14 19:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit implemented the `gc.repackFilter` config option
to specify a filter that should be used by `git gc` when
performing repacks.

Another previous commit has implemented
`git repack --filter-to=<dir>` to specify the location of the
packfile containing filtered out objects when using a filter.

Let's implement the `gc.repackFilterTo` config option to specify
that location in the config when `gc.repackFilter` is used.

Now when `git gc` will perform a repack with a <dir> configured
through this option and not empty, the repack process will be
passed a corresponding `--filter-to=<dir>` argument.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/gc.txt |  6 ++++++
 builtin/gc.c                |  4 ++++
 t/t6500-gc.sh               | 13 ++++++++++++-
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index 055c4e0db6..699ad887b3 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -135,6 +135,12 @@ gc.repackFilter::
 	objects into a separate packfile.  See the
 	`--filter=<filter-spec>` option of linkgit:git-repack[1].
 
+gc.repackFilterTo::
+	When repacking and using a filter, see `gc.repackFilter`, the
+	specified location will be used to create the packfile
+	containing the filtered out objects.  See the
+	`--filter-to=<dir>` option of linkgit:git-repack[1].
+
 gc.rerereResolved::
 	Records of conflicted merge you resolved earlier are
 	kept for this many days when 'git rerere gc' is run.
diff --git a/builtin/gc.c b/builtin/gc.c
index 1c57913214..87f5fc6946 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -62,6 +62,7 @@ static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
 static char *repack_filter;
+static char *repack_filter_to;
 static unsigned long big_pack_threshold;
 static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 
@@ -172,6 +173,7 @@ static void gc_config(void)
 	git_config_get_ulong("pack.deltacachesize", &max_delta_cache_size);
 
 	git_config_get_string("gc.repackfilter", &repack_filter);
+	git_config_get_string("gc.repackfilterto", &repack_filter_to);
 
 	git_config(git_default_config, NULL);
 }
@@ -361,6 +363,8 @@ static void add_repack_all_option(struct string_list *keep_pack)
 
 	if (repack_filter && *repack_filter)
 		strvec_pushf(&repack, "--filter=%s", repack_filter);
+	if (repack_filter_to && *repack_filter_to)
+		strvec_pushf(&repack, "--filter-to=%s", repack_filter_to);
 }
 
 static void add_repack_incremental_option(void)
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 5b89faf505..37056a824b 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -203,7 +203,6 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
 '
 
 test_expect_success 'gc.repackFilter launches repack with a filter' '
-	test_when_finished "rm -rf bare.git" &&
 	git clone --no-local --bare . bare.git &&
 
 	git -C bare.git -c gc.cruftPacks=false gc &&
@@ -214,6 +213,18 @@ test_expect_success 'gc.repackFilter launches repack with a filter' '
 	grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
 '
 
+test_expect_success 'gc.repackFilterTo store filtered out objects' '
+	test_when_finished "rm -rf bare.git filtered.git" &&
+
+	git init --bare filtered.git &&
+	git -C bare.git -c gc.repackFilter=blob:none \
+		-c gc.repackFilterTo=../filtered.git/objects/pack/pack \
+		-c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&
+
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+	test_stdout_line_count = 1 ls filtered.git/objects/pack/*.pack
+'
+
 prepare_cruft_history () {
 	test_commit base &&
 
-- 
2.41.0.37.gae45d9845e


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH 0/9] Repack objects into separate packfiles based on a filter
  2023-06-14 19:25 [PATCH 0/9] Repack objects into separate packfiles based on a filter Christian Couder
                   ` (8 preceding siblings ...)
  2023-06-14 19:25 ` [PATCH 9/9] gc: add `gc.repackFilterTo` config option Christian Couder
@ 2023-06-14 21:36 ` Junio C Hamano
  2023-06-16  3:08   ` Junio C Hamano
  2023-07-05  6:08 ` [PATCH v2 0/8] " Christian Couder
  10 siblings, 1 reply; 161+ messages in thread
From: Junio C Hamano @ 2023-06-14 21:36 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt

Christian Couder <christian.couder@gmail.com> writes:

> In some discussions, it was mentioned that such a feature, or a
> similar feature in `git gc`, or in a new standalone command (perhaps
> called `git prune-filtered`), should put the filtered out objects into
> a new packfile instead of deleting them.
>
> Recently there were internal discussions at GitLab about either moving
> blobs from inactive repos onto cheaper storage, or moving large blobs
> onto cheaper storage. This lead us to rethink at repacking using a
> filter, but moving the filtered out objects into a separate packfile
> instead of deleting them.
>
> So here is a new patch series doing that while implementing the
> `--filter=<filter-spec>` option in `git repack`.

Very interesting idea, indeed, and would be very useful.
Thanks.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 2/9] pack-objects: add `--print-filtered` to print omitted objects
  2023-06-14 19:25 ` [PATCH 2/9] pack-objects: add `--print-filtered` to print omitted objects Christian Couder
@ 2023-06-15 22:50   ` Junio C Hamano
  2023-06-21 10:52     ` Taylor Blau
  0 siblings, 1 reply; 161+ messages in thread
From: Junio C Hamano @ 2023-06-15 22:50 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> When using the `--filter=<filter-spec>` option, `git pack-objects` will
> omit some objects from the resulting packfile(s) it produces. It could
> be useful to know about these omitted objects though.
>
> For example, we might want to write these objects into a separate
> packfile by piping them into another `git pack-object` process.
> Or we might want to check if these objects are available from a
> promisor remote.
>
> Anyway, this patch implements a simple way to let us know about these
> objects by simply printing their oid, one per line, on stdout when the
> new `--print-filtered` flag is passed.

Makes sense.  It is a bit sad that we have to accumulate everything
until the end at which time we have to dump the accumulated in bulk,
but that is a current limitation of list-objects-filter API and not
within the scope of this change.  We may in the longer term want to
see if we can make the collection of filtered-out objects streamable
by replacing the .omits object array with a callback function, or do
something along that line.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 3/9] t/helper: add 'find-pack' test-tool
  2023-06-14 19:25 ` [PATCH 3/9] t/helper: add 'find-pack' test-tool Christian Couder
@ 2023-06-15 23:32   ` Junio C Hamano
  2023-06-21 10:40     ` Christian Couder
  2023-06-21 10:54     ` Taylor Blau
  0 siblings, 2 replies; 161+ messages in thread
From: Junio C Hamano @ 2023-06-15 23:32 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> In a following commit, we will make it possible to separate objects in
> different packfiles depending on a filter.
>
> To make sure that the right objects are in the right packs, let's add a
> new test-tool that can display which packfile(s) a given object is in.

This tool would be serviceable if we only are interested in checking
just a few objects, but if we were to check many objects, I have to
wonder if it would be more efficient to use show-index to dump the
list of objects per pack, which should be sorted by object name, so
it should be trivial to run "comm" with the list of objects you want
to check.

Or if you only are checking about a dozen or so, taking one or more
arguments from the command line and looping over them may also be
OK.  The output format of course may have to be changed, if we were
to go that route, though.

It really depends on the granularity at which this test helper wants
to work at, I think.

Anyway...

> diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
> index abe8a785eb..41da40c296 100644
> --- a/t/helper/test-tool.c
> +++ b/t/helper/test-tool.c
> @@ -31,6 +31,7 @@ static struct test_cmd cmds[] = {
>  	{ "env-helper", cmd__env_helper },
>  	{ "example-decorate", cmd__example_decorate },
>  	{ "fast-rebase", cmd__fast_rebase },
> +	{ "find-pack", cmd__find_pack },
>  	{ "fsmonitor-client", cmd__fsmonitor_client },
>  	{ "genrandom", cmd__genrandom },
>  	{ "genzeros", cmd__genzeros },
> diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
> index ea2672436c..411dbf2db4 100644
> --- a/t/helper/test-tool.h
> +++ b/t/helper/test-tool.h
> @@ -25,6 +25,7 @@ int cmd__dump_reftable(int argc, const char **argv);
>  int cmd__env_helper(int argc, const char **argv);
>  int cmd__example_decorate(int argc, const char **argv);
>  int cmd__fast_rebase(int argc, const char **argv);
> +int cmd__find_pack(int argc, const char **argv);
>  int cmd__fsmonitor_client(int argc, const char **argv);
>  int cmd__genrandom(int argc, const char **argv);
>  int cmd__genzeros(int argc, const char **argv);

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 4/9] repack: refactor piping an oid to a command
  2023-06-14 19:25 ` [PATCH 4/9] repack: refactor piping an oid to a command Christian Couder
@ 2023-06-15 23:46   ` Junio C Hamano
  2023-06-21 10:55     ` Taylor Blau
  2023-06-21 10:56     ` Christian Couder
  0 siblings, 2 replies; 161+ messages in thread
From: Junio C Hamano @ 2023-06-15 23:46 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> Create a new write_oid_hex_cmd() function to send an oid to the standard
> input of a running command. This new function will be used in a
> following commit.
>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
>  builtin/repack.c | 20 +++++++++++++-------
>  1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/builtin/repack.c b/builtin/repack.c
> index 0541c3ce15..e591c295cf 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -182,6 +182,17 @@ static void prepare_pack_objects(struct child_process *cmd,
>  	cmd->out = -1;
>  }
>  
> +static void write_oid_hex_cmd(const char *oid_hex,
> +			      struct child_process *cmd,
> +			      const char *err_msg)
> +{
> +	if (cmd->in == -1 && start_command(cmd))
> +		die("%s", err_msg);

I am not sure why we would want to conflate the "if we haven't
started the command, auto-start it upon our first attempt to write"
in these low-level "I am designed to do one thing, which is to feed
the object name to the process, and do it well" function.

The caller in the original shares the same issue, so we could say
that this patch is not creating a new problem, but this somehow
feels it is mak ng the existing problem even worse.

And I think the error handling here shows why the API feels wrong.
When auto-start fails, we have a message, but when write fails,
there is no custom message---it makes as if write_oid_hex_cmd() is
primarily about starting, which is so important relative to its
other functionalities and deserves a custom error message, but that
is not the message you want to be conveying.

> +	xwrite(cmd->in, oid_hex, the_hash_algo->hexsz);
> +	xwrite(cmd->in, "\n", 1);

I would have expected that the "refactor" at least would reduce the
number of system calls by combining these two writes into one using
an on-stack local variable char buf[GIT_MAX_HEZSZ+1] or something.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 5/9] repack: refactor finishing pack-objects command
  2023-06-14 19:25 ` [PATCH 5/9] repack: refactor finishing pack-objects command Christian Couder
@ 2023-06-16  0:13   ` Junio C Hamano
  2023-06-21 11:06     ` Taylor Blau
  2023-06-21 11:05   ` Taylor Blau
  1 sibling, 1 reply; 161+ messages in thread
From: Junio C Hamano @ 2023-06-16  0:13 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt

Christian Couder <christian.couder@gmail.com> writes:

> +static int finish_pack_objects_cmd(struct child_process *cmd,
> +				   struct string_list *names,
> +				   const char *destination)
> +{
> +	int local = 1;
> +	FILE *out;
> +	struct strbuf line = STRBUF_INIT;
> +
> +	if (destination) {
> +		const char *scratch;
> +		local = skip_prefix(destination, packdir, &scratch);
> +	}
> +
> +	out = xfdopen(cmd->out, "r");
> +	while (strbuf_getline_lf(&line, out) != EOF) {
> +		struct string_list_item *item;
> +
> +		if (line.len != the_hash_algo->hexsz)
> +			die(_("repack: Expecting full hex object ID lines only "
> +			      "from pack-objects."));
> +		/*
> +		 * Avoid putting packs written outside of the repository in the
> +		 * list of names.
> +		 */
> +		if (local) {
> +			item = string_list_append(names, line.buf);
> +			item->util = populate_pack_exts(line.buf);
> +		}
> +	}
> +	fclose(out);
> +
> +	strbuf_release(&line);
> +
> +	return finish_command(cmd);
> +}

Computing "is it local?" based on the value of "destination" feels
it belongs to the caller (one of the callers that do need the
computation), not to this function, especially given that the full
value of "destination" is not even used in any other way in this
function.  And the "is_local?" bit can instead be passesd into this
helper function as a parameter.

I wondered what "beautify" was about---the original looks OK to me
already, and while I do not mind to see a full sentence spelled in a
more gramatically correct way like in the postimage, I do not think
the change was worth wasting reviewer's time wondering if there are
other improvements by pointing it out in the proposed log message.

Thanks.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 6/9] repack: add `--filter=<filter-spec>` option
  2023-06-14 19:25 ` [PATCH 6/9] repack: add `--filter=<filter-spec>` option Christian Couder
@ 2023-06-16  0:43   ` Junio C Hamano
  2023-06-21 11:20     ` Taylor Blau
  2023-06-21 14:40     ` Christian Couder
  2023-06-21 11:17   ` Taylor Blau
  1 sibling, 2 replies; 161+ messages in thread
From: Junio C Hamano @ 2023-06-16  0:43 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> After cloning with --filter=<filter-spec>, for example to avoid
> getting unneeded large files on a user machine, it's possible
> that some of these large files still get fetched for some reasons
> (like checking out old branches) over time.
>
> In this case the repo size could grow too much for no good reason and a
> way to filter out some objects would be useful to remove the unneeded
> large files.

Makes sense.

If we repack without these objects, when the repository has a
promisor remote, we should be able to rely on that remote to supply
them on demand, once we need them again, no?

> Deleting objects right away could corrupt a repo though,...

Hmph, could you elaborate why it is the case?  Isn't it the whole
point to have promisor remote and use a lazy clone with the --filter
option, so that objects that _ought_ to exist from connectivity's
point of view _are_ allowed to be missing because the promisor
promises to make them available on-demand?

	Side note: I think I know the answer. While trying to remove
	UNNEEDED large files, doing so may discard NEEDED large
	files when done carelessly (e.g. the file may have been
	created locally and haven't been pushed back). But (1) if
	that is the problem, perhaps we should be more careful in
	the first place? (2) if it inherently is impossible to tell
	which ones are unneeded reliably, the reason why it is
	impossible, and the reason why "try sifting into two bins,
	one that we _think_ are unneeded and another for the rest,
	and verify what we _thought_ are unneeded are all available
	from the promisor remote" is the best we can do, must be
	described, I think.

> ... so it might be
> better to put those objects into a separate packfile instead of
> deleting them. The separate pack could then be removed after checking
> that all the objects in it are still available on a promisor remote it
> can access.

Surely, sifting the objects into two bins (i.e. those that we
wouldn't have received if we cloned from the promisor remote just
now, which are prunable, and those that we cannot lose because the
promisor remote would not have them, e.g. we created them and have
not pushed them to the remote yet) without removing anything would
be safe, but if the result of such sifting must be verified, doesn't
it indicate that the sifting step was buggy or misdesigned?  It does
not sound like a very good justification to save them in a separate
packfile.  It does smell somewhat similar to the cruft packs but not
really (the choice over there is between exploding to loose and
keeping in a pack, and never involves loss of objects).

> Also splitting a packfile into 2 packs depending on a filter could be
> useful in other usecases. For example some large blobs might take a lot
> of precious space on fast storage while they are rarely accessed, and
> it could make sense to move them in a separate cheaper, though slower,
> storage.

This one, outside the context of partial clone client, does make
tons of sense.

I guess what I suspect is that this option, while it would be very
useful for the "in other usecases" scenario above, may not become
all that useful in the "our lazy clone got bloated and we want to
trim objects we know we can retrieve from the promisor remote again
if necessary" scenario, until the repack machinery learns to use an
extra piece of information (namely "these are objects that we can
fetch from the promisor remote") at the same time.

> This commit implements a new `--filter=<filter-spec>` option in
> `git repack` that moves filtered out objects into a separate pack.
>
> This is done by reading filtered out objects from `git pack-objects`'s
> output and piping them into a separate `git pack-objects` process that
> will put them into a separate packfile.

So, for example, you may say "no blobs" in the filter, and while
packing the local repository with the filter, resulting in a pack
that exclude all blobs, we will learn what blob objects we did not
pack into that packfile.  We can pack them into a separate one, and
most of the blobs are what we could retrieve again from the promisor
remote, but some of the blobs are what we locally created ourselves
and haven't pushed back to the promisor remote yet.  Now what?  My
earlier suspicion that this mechanism may not be all that useful for
the "slim bloated lazy clone" comes from that I cannot think of a
good answer to this "Now what?" question---my naive solution would
involve enumerating the objects in that "separate packfile" that is
a mixture of precious ones and expendable ones, and then learning
which ones are precious, and creating a new pack that is a subset of
that "separate packfile" with only the precious ones.  But if I do
so, I do not think we need this new mechanism that seems to go only
the half-way.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 8/9] repack: implement `--filter-to` for storing filtered out objects
  2023-06-14 19:25 ` [PATCH 8/9] repack: implement `--filter-to` for storing filtered out objects Christian Couder
@ 2023-06-16  2:21   ` Junio C Hamano
  2023-06-21 11:49   ` Taylor Blau
  1 sibling, 0 replies; 161+ messages in thread
From: Junio C Hamano @ 2023-06-16  2:21 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> A previous commit has implemented `git repack --filter=<filter-spec>` to
> allow users to filter out some objects from the main pack and move them
> into a new different pack.
>
> It would be nice if this new different pack could be created in a
> different directory than the regular pack. This would make it possible
> to move large blobs into a pack on a different kind of storage, for
> example cheaper storage. Even in a different directory this pack can be
> accessible if, for example, the Git alternates mechanism is used to
> point to it.

Makes sense, I guess, for "in other usecases" scenario.  I am not
sure how this would be useful for the originally stated goal of
unbloating a bloated repository with promisor remote(s), though. 

> If users want to remove a pack that contains filtered out objects after
> checking that they are all already on a promisor remote, creating the
> pack in a different directory makes it easier to do so.

Care to elaborate?  I do not see how a separate directory would make
it easier.  After separating the potential cruft into a packfile,
you'd walk its .idx and see if there are any objects that are not
available (yet) at the promisor remotes to check if it is safe to
remove.  That can be done regardless of the location of the packfile
that is suspected to be now removable.

> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
>  Documentation/git-repack.txt |  6 ++++++
>  builtin/repack.c             | 17 ++++++++++++-----
>  t/t7700-repack.sh            | 27 +++++++++++++++++++++++++++
>  3 files changed, 45 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
> index aa29c7e648..070dd22610 100644
> --- a/Documentation/git-repack.txt
> +++ b/Documentation/git-repack.txt
> @@ -148,6 +148,12 @@ depth is 4095.
>  	resulting packfile and put them into a separate packfile. See
>  	linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
>  
> +--filter-to=<dir>::
> +	Write the pack containing filtered out objects to the
> +	directory `<dir>`. This can be used for putting the pack on a
> +	separate object directory that is accessed through the Git
> +	alternates mechanism. Only useful with `--filter`.
> +
>  -b::
>  --write-bitmap-index::
>  	Write a reachability bitmap index as part of the repack. This
> diff --git a/builtin/repack.c b/builtin/repack.c
> index b13d7196de..8c71e8fd51 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -838,7 +838,8 @@ static void prepare_pack_filtered_cmd(struct child_process *cmd,
>  }
>  
>  static void finish_pack_filtered_cmd(struct child_process *cmd,
> -				     struct string_list *names)
> +				     struct string_list *names,
> +				     const char *destination)
>  {
>  	if (cmd->in == -1) {
>  		/* No packed objects; cmd was never started */
> @@ -848,7 +849,7 @@ static void finish_pack_filtered_cmd(struct child_process *cmd,
>  
>  	close(cmd->in);
>  
> -	if (finish_pack_objects_cmd(cmd, names, NULL, NULL))
> +	if (finish_pack_objects_cmd(cmd, names, destination, NULL))
>  		die(_("could not finish pack-objects to pack filtered objects"));
>  }
>  
> @@ -877,6 +878,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	const char *cruft_expiration = NULL;
>  	const char *expire_to = NULL;
>  	struct child_process pack_filtered_cmd = CHILD_PROCESS_INIT;
> +	const char *filter_to = NULL;
>  
>  	struct option builtin_repack_options[] = {
>  		OPT_BIT('a', NULL, &pack_everything,
> @@ -930,6 +932,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  			   N_("write a multi-pack index of the resulting packs")),
>  		OPT_STRING(0, "expire-to", &expire_to, N_("dir"),
>  			   N_("pack prefix to store a pack containing pruned objects")),
> +		OPT_STRING(0, "filter-to", &filter_to, N_("dir"),
> +			   N_("pack prefix to store a pack containing filtered out objects")),
>  		OPT_END()
>  	};
>  
> @@ -1073,8 +1077,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		strvec_push(&cmd.args, "--incremental");
>  	}
>  
> -	if (po_args.filter)
> -		prepare_pack_filtered_cmd(&pack_filtered_cmd, &po_args, packtmp);
> +	if (po_args.filter) {
> +		if (!filter_to)
> +			filter_to = packtmp;
> +		prepare_pack_filtered_cmd(&pack_filtered_cmd, &po_args, filter_to);
> +	}
>  
>  	if (geometry)
>  		cmd.in = -1;
> @@ -1169,7 +1176,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	}
>  
>  	if (po_args.filter)
> -		finish_pack_filtered_cmd(&pack_filtered_cmd, &names);
> +		finish_pack_filtered_cmd(&pack_filtered_cmd, &names, filter_to);
>  
>  	string_list_sort(&names);
>  
> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
> index 9e7654090f..898f8a01b4 100755
> --- a/t/t7700-repack.sh
> +++ b/t/t7700-repack.sh
> @@ -286,6 +286,33 @@ test_expect_success 'repacking with a filter works' '
>  	test "$blob_pack2" = "$blob_pack"
>  '
>  
> +test_expect_success '--filter-to stores filtered out objects' '
> +	git -C bare.git repack -a -d &&
> +	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
> +
> +	git init --bare filtered.git &&
> +	git -C bare.git -c repack.writebitmaps=false repack -a -d \
> +		--filter=blob:none \
> +		--filter-to=../filtered.git/objects/pack/pack &&
> +	test_stdout_line_count = 1 ls bare.git/objects/pack/pack-*.pack &&
> +	test_stdout_line_count = 1 ls filtered.git/objects/pack/pack-*.pack &&
> +
> +	commit_pack=$(test-tool -C bare.git find-pack HEAD) &&
> +	test -n "$commit_pack" &&
> +	blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
> +	test -z "$blob_pack" &&
> +	blob_hash=$(git -C bare.git rev-parse HEAD:file1) &&
> +	test -n "$blob_hash" &&
> +	blob_pack=$(test-tool -C filtered.git find-pack $blob_hash) &&
> +	test -n "$blob_pack" &&
> +
> +	echo $(pwd)/filtered.git/objects >bare.git/objects/info/alternates &&
> +	blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
> +	test -n "$blob_pack" &&
> +	blob_content=$(git -C bare.git show $blob_hash) &&
> +	test "$blob_content" = "content1"
> +'
> +
>  objdir=.git/objects
>  midx=$objdir/pack/multi-pack-index

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 9/9] gc: add `gc.repackFilterTo` config option
  2023-06-14 19:25 ` [PATCH 9/9] gc: add `gc.repackFilterTo` config option Christian Couder
@ 2023-06-16  2:54   ` Junio C Hamano
  0 siblings, 0 replies; 161+ messages in thread
From: Junio C Hamano @ 2023-06-16  2:54 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> A previous commit implemented the `gc.repackFilter` config option
> to specify a filter that should be used by `git gc` when
> performing repacks.
>
> Another previous commit has implemented
> `git repack --filter-to=<dir>` to specify the location of the
> packfile containing filtered out objects when using a filter.
>
> Let's implement the `gc.repackFilterTo` config option to specify
> that location in the config when `gc.repackFilter` is used.
>
> Now when `git gc` will perform a repack with a <dir> configured
> through this option and not empty, the repack process will be
> passed a corresponding `--filter-to=<dir>` argument.
>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---

That's an obvious follow-up on the previous step.

Thanks.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 0/9] Repack objects into separate packfiles based on a filter
  2023-06-14 21:36 ` [PATCH 0/9] Repack objects into separate packfiles based on a filter Junio C Hamano
@ 2023-06-16  3:08   ` Junio C Hamano
  0 siblings, 0 replies; 161+ messages in thread
From: Junio C Hamano @ 2023-06-16  3:08 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt

Junio C Hamano <gitster@pobox.com> writes:

> Christian Couder <christian.couder@gmail.com> writes:
>
>> In some discussions, it was mentioned that such a feature, or a
>> similar feature in `git gc`, or in a new standalone command (perhaps
>> called `git prune-filtered`), should put the filtered out objects into
>> a new packfile instead of deleting them.
>>
>> Recently there were internal discussions at GitLab about either moving
>> blobs from inactive repos onto cheaper storage, or moving large blobs
>> onto cheaper storage. This lead us to rethink at repacking using a
>> filter, but moving the filtered out objects into a separate packfile
>> instead of deleting them.
>>
>> So here is a new patch series doing that while implementing the
>> `--filter=<filter-spec>` option in `git repack`.
>
> Very interesting idea, indeed, and would be very useful.
> Thanks.

Overall, I have a split feeling on the series.

One side of my brain thinks that the series does a very good job to
address the needs of those who want to partition their objects into
two classes, and the problem I saw in the series was mostly the way
it was sold (in other words, if it did not mention unbloating lazily
cloned repositories at all, I would have said "Yes!  It is an
excellent series.", and if it said "this mechanism is not meant to
be used to unbloat a lazily cloned repository, because the mechanism
does not distinguish objects that are only locally available and
objects that are retrievable from the promisor remotes, among those
that match the filter", it would have been even better)

To the other side of my brain, it smells as if the series wanted to
address the unbloating issue, but ended up with an unsatisfactory
solution, and used "partitioning objects in a full repository on the
server side " as an excuse for the resulting mechanism to still
exist, even though it is not usable for the original purpose.

Ideally, it would be great to have a mechanism that can be used for
both.  The "partitioning" can be treated as a degenerate case where
the repository does not have its upstream promisor (hence, any
object that match the filtering criteria can be excluded from the
primary pack because there are no "not available (yet) in our
promisor" objects), while the "unbloat" case can know who its
promisors are and ask the promisors what objects, among those that
match the filtering criteria, are still available from them to
exclude only those objects from the primary pack.

In the second ideal world, we may not be ready to tackle the
unbloating issue, but "partitioning" alone may still be a useful
feature.  In that case, perhaps the series can be salvaged by
updating how the feature is sold, with some comments indicating the
future direction to extend the mechanism later.

Thanks.





^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 3/9] t/helper: add 'find-pack' test-tool
  2023-06-15 23:32   ` Junio C Hamano
@ 2023-06-21 10:40     ` Christian Couder
  2023-06-21 10:54     ` Taylor Blau
  1 sibling, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-06-21 10:40 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Fri, Jun 16, 2023 at 1:32 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > In a following commit, we will make it possible to separate objects in
> > different packfiles depending on a filter.
> >
> > To make sure that the right objects are in the right packs, let's add a
> > new test-tool that can display which packfile(s) a given object is in.
>
> This tool would be serviceable if we only are interested in checking
> just a few objects, but if we were to check many objects, I have to
> wonder if it would be more efficient to use show-index to dump the
> list of objects per pack, which should be sorted by object name, so
> it should be trivial to run "comm" with the list of objects you want
> to check.

I agree that this new tool is for checking just a few objects.

> Or if you only are checking about a dozen or so, taking one or more
> arguments from the command line and looping over them may also be
> OK.  The output format of course may have to be changed, if we were
> to go that route, though.
>
> It really depends on the granularity at which this test helper wants
> to work at, I think.

Yeah, in the previous commit implementing --print-filtered, we check
that all the objects that should be printed to stdout are indeed
printed. So later when git repack --filter=... is implemented by using
git pack-objects --print-filtered and piping the printed objects to a
regular git pack-objects command, I don't think it's necessary to test
a lot of objects, as hopefully both pack-objects command are supposed
to work well at that point.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 1/9] pack-objects: allow `--filter` without `--stdout`
  2023-06-14 19:25 ` [PATCH 1/9] pack-objects: allow `--filter` without `--stdout` Christian Couder
@ 2023-06-21 10:49   ` Taylor Blau
  2023-07-05  6:16     ` Christian Couder
  0 siblings, 1 reply; 161+ messages in thread
From: Taylor Blau @ 2023-06-21 10:49 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jun 14, 2023 at 09:25:33PM +0200, Christian Couder wrote:
> 9535ce7337 (pack-objects: add list-objects filtering, 2017-11-21)
> taught `git pack-objects` to use `--filter`, but required the use of
> `--stdout` since a partial clone mechanism was not yet in place to
> handle missing objects. Since then, changes like 9e27beaa23
> (promisor-remote: implement promisor_remote_get_direct(), 2019-06-25)
> and others added support to dynamically fetch objects that were missing.
>
> Even without a promisor remote, filtering out objects can also be useful
> if we can put the filtered out objects in a separate pack, and in this
> case it also makes sense for pack-objects to write the packfile directly
> to an actual file rather than on stdout.
>
> Remove the `--stdout` requirement when using `--filter`, so that in a
> follow-up commit, repack can pass `--filter` to pack-objects to omit
> certain objects from the resulting packfile.

Makes sense.

Is there any situation in which using --stdout with --filter would be a
potential foot-gun? I am not as familiar with the partial clone
mechanism as others CC'd, so I have no idea one way or the other.

If it is unsafe in certain situations (or, at the very least, could
produce surprising behavior), it may be worthwhile to only allow
`--filter=<filter> --stdout` with some kind of
`--filter-to-stdout-is-ok` flag to indicate that the caller knows what
they are doing.

Presumably 'git repack --filter' would pass such a flag later on in the
series.

> Signed-off-by: John Cai <johncai86@gmail.com>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
>  Documentation/git-pack-objects.txt | 4 ++--
>  builtin/pack-objects.c             | 8 ++------
>  2 files changed, 4 insertions(+), 8 deletions(-)

Should there be a trivial test here? I'm thinking something on the order
of writing a filtered pack to stdout and redirecting it to a file,
moving it into place, and then indexing the pack to make sure that we
got the expected set of objects.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 2/9] pack-objects: add `--print-filtered` to print omitted objects
  2023-06-15 22:50   ` Junio C Hamano
@ 2023-06-21 10:52     ` Taylor Blau
  2023-06-21 11:11       ` Christian Couder
  0 siblings, 1 reply; 161+ messages in thread
From: Taylor Blau @ 2023-06-21 10:52 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Thu, Jun 15, 2023 at 03:50:17PM -0700, Junio C Hamano wrote:
> Christian Couder <christian.couder@gmail.com> writes:
>
> > When using the `--filter=<filter-spec>` option, `git pack-objects` will
> > omit some objects from the resulting packfile(s) it produces. It could
> > be useful to know about these omitted objects though.
> >
> > For example, we might want to write these objects into a separate
> > packfile by piping them into another `git pack-object` process.
> > Or we might want to check if these objects are available from a
> > promisor remote.
> >
> > Anyway, this patch implements a simple way to let us know about these
> > objects by simply printing their oid, one per line, on stdout when the
> > new `--print-filtered` flag is passed.
>
> Makes sense.  It is a bit sad that we have to accumulate everything
> until the end at which time we have to dump the accumulated in bulk,
> but that is a current limitation of list-objects-filter API and not
> within the scope of this change.  We may in the longer term want to
> see if we can make the collection of filtered-out objects streamable
> by replacing the .omits object array with a callback function, or do
> something along that line.

Hmm. I think it is possible to use something like `git pack-objects`'s
`--stdin-packs` mode to accomplish this without needing to keep track of
the set of discarded objects (i.e. those which don't match the filter).

IIUC, the set of objects which don't match the filter is the same as the
set of all objects in packs beforehand, differenced with the set of
objects that shows up in the pack containing objects which *do* match
the filter.

If you mark all of the "before" packs with `-` in the input to
`--stdin-packs`, and then pass along the pack containing the filtered
set without `-` (to indicate that the resulting pack should not contain
any objects which appear in that pack), I think you would end up with
the set of non-matching objects.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 3/9] t/helper: add 'find-pack' test-tool
  2023-06-15 23:32   ` Junio C Hamano
  2023-06-21 10:40     ` Christian Couder
@ 2023-06-21 10:54     ` Taylor Blau
  1 sibling, 0 replies; 161+ messages in thread
From: Taylor Blau @ 2023-06-21 10:54 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Thu, Jun 15, 2023 at 04:32:35PM -0700, Junio C Hamano wrote:
> Christian Couder <christian.couder@gmail.com> writes:
>
> > In a following commit, we will make it possible to separate objects in
> > different packfiles depending on a filter.
> >
> > To make sure that the right objects are in the right packs, let's add a
> > new test-tool that can display which packfile(s) a given object is in.
>
> This tool would be serviceable if we only are interested in checking
> just a few objects, but if we were to check many objects, I have to
> wonder if it would be more efficient to use show-index to dump the
> list of objects per pack, which should be sorted by object name, so
> it should be trivial to run "comm" with the list of objects you want
> to check.

I was going to say the exact same thing. Even if we were checking many
objects, can't we dump the output of show-index to a file, and then grep
it repeatedly? Presumably these tests are working on repositories with
tens of objects, so I doubt it matters much either way.

If we do end up taking this approach to use the test-helper instead, the
implementation seems reasonable.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 4/9] repack: refactor piping an oid to a command
  2023-06-15 23:46   ` Junio C Hamano
@ 2023-06-21 10:55     ` Taylor Blau
  2023-06-21 10:56     ` Christian Couder
  1 sibling, 0 replies; 161+ messages in thread
From: Taylor Blau @ 2023-06-21 10:55 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Thu, Jun 15, 2023 at 04:46:43PM -0700, Junio C Hamano wrote:
> Christian Couder <christian.couder@gmail.com> writes:
>
> > Create a new write_oid_hex_cmd() function to send an oid to the standard
> > input of a running command. This new function will be used in a
> > following commit.
> >
> > Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> > ---
> >  builtin/repack.c | 20 +++++++++++++-------
> >  1 file changed, 13 insertions(+), 7 deletions(-)
> >
> > diff --git a/builtin/repack.c b/builtin/repack.c
> > index 0541c3ce15..e591c295cf 100644
> > --- a/builtin/repack.c
> > +++ b/builtin/repack.c
> > @@ -182,6 +182,17 @@ static void prepare_pack_objects(struct child_process *cmd,
> >  	cmd->out = -1;
> >  }
> >
> > +static void write_oid_hex_cmd(const char *oid_hex,
> > +			      struct child_process *cmd,
> > +			      const char *err_msg)
> > +{
> > +	if (cmd->in == -1 && start_command(cmd))
> > +		die("%s", err_msg);
>
> I am not sure why we would want to conflate the "if we haven't
> started the command, auto-start it upon our first attempt to write"
> in these low-level "I am designed to do one thing, which is to feed
> the object name to the process, and do it well" function.

I agree, the implementation of `write_oid_hex_cmd()` seems too magical
to me.

Perhaps there was some awkwardness with using the pre-image w.r.t some
later change? Let's see...

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 4/9] repack: refactor piping an oid to a command
  2023-06-15 23:46   ` Junio C Hamano
  2023-06-21 10:55     ` Taylor Blau
@ 2023-06-21 10:56     ` Christian Couder
  1 sibling, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-06-21 10:56 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Fri, Jun 16, 2023 at 1:46 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > Create a new write_oid_hex_cmd() function to send an oid to the standard
> > input of a running command. This new function will be used in a
> > following commit.
> >
> > Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> > ---
> >  builtin/repack.c | 20 +++++++++++++-------
> >  1 file changed, 13 insertions(+), 7 deletions(-)
> >
> > diff --git a/builtin/repack.c b/builtin/repack.c
> > index 0541c3ce15..e591c295cf 100644
> > --- a/builtin/repack.c
> > +++ b/builtin/repack.c
> > @@ -182,6 +182,17 @@ static void prepare_pack_objects(struct child_process *cmd,
> >       cmd->out = -1;
> >  }
> >
> > +static void write_oid_hex_cmd(const char *oid_hex,
> > +                           struct child_process *cmd,
> > +                           const char *err_msg)
> > +{
> > +     if (cmd->in == -1 && start_command(cmd))
> > +             die("%s", err_msg);
>
> I am not sure why we would want to conflate the "if we haven't
> started the command, auto-start it upon our first attempt to write"
> in these low-level "I am designed to do one thing, which is to feed
> the object name to the process, and do it well" function.
>
> The caller in the original shares the same issue, so we could say
> that this patch is not creating a new problem, but this somehow
> feels it is mak ng the existing problem even worse.

Ok. I think I will rework this patch from version 2 of this series to
remove that code. It will perhaps look like there is a bit of
duplicated code, but I don't think it will be too bad.

> And I think the error handling here shows why the API feels wrong.
> When auto-start fails, we have a message, but when write fails,
> there is no custom message---it makes as if write_oid_hex_cmd() is
> primarily about starting, which is so important relative to its
> other functionalities and deserves a custom error message, but that
> is not the message you want to be conveying.

Right.

> > +     xwrite(cmd->in, oid_hex, the_hash_algo->hexsz);
> > +     xwrite(cmd->in, "\n", 1);
>
> I would have expected that the "refactor" at least would reduce the
> number of system calls by combining these two writes into one using
> an on-stack local variable char buf[GIT_MAX_HEZSZ+1] or something.

Ok, I will change it to reduce the number of system calls.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 5/9] repack: refactor finishing pack-objects command
  2023-06-14 19:25 ` [PATCH 5/9] repack: refactor finishing pack-objects command Christian Couder
  2023-06-16  0:13   ` Junio C Hamano
@ 2023-06-21 11:05   ` Taylor Blau
  1 sibling, 0 replies; 161+ messages in thread
From: Taylor Blau @ 2023-06-21 11:05 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

> diff --git a/builtin/repack.c b/builtin/repack.c
> index e591c295cf..f1adacf1d0 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -703,6 +703,42 @@ static void remove_redundant_bitmaps(struct string_list *include,
>  	strbuf_release(&path);
>  }
>
> +static int finish_pack_objects_cmd(struct child_process *cmd,
> +				   struct string_list *names,
> +				   const char *destination)
> +{
> +	int local = 1;
> +	FILE *out;
> +	struct strbuf line = STRBUF_INIT;
> +
> +	if (destination) {
> +		const char *scratch;

Maybe stick the declaration above (and consider making it static), so
that this can become

    if (destination)
      local = skip_prefix(destination, packdir, &scratch);

without the braces. Although it might be nice to either put this behind
a "has_prefix()" convenience function (which itself owns the scratch
buffer and hides that detail from the caller), or to make skip_prefix
skip trying to assign into the buffer if given NULL.

> +		local = skip_prefix(destination, packdir, &scratch);
> +	}
> +
> +	out = xfdopen(cmd->out, "r");
> +	while (strbuf_getline_lf(&line, out) != EOF) {
> +		struct string_list_item *item;
> +
> +		if (line.len != the_hash_algo->hexsz)
> +			die(_("repack: Expecting full hex object ID lines only "
> +			      "from pack-objects."));
> +		/*
> +		 * Avoid putting packs written outside of the repository in the
> +		 * list of names.
> +		 */
> +		if (local) {

Consider moving the declaration of item into this block, since it's not
used elsewhere throughout the body of the loop.

Alternatively, if you want to leave it up there, it might be easier to
adjust the if condition to be more in line with the comment, like:

    if (!local)
      continue;

> +	ret = finish_pack_objects_cmd(&cmd, &names, NULL);

OK, we don't even bother calling skip_prefix if given a NULL
destination. I wonder if it might make sense to force the caller to
compute "is this local?" ahead of time. This one would always pass "1"
trivially, and the cruft case would depend on whether or not we are
handling the `--expire-to` option.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 5/9] repack: refactor finishing pack-objects command
  2023-06-16  0:13   ` Junio C Hamano
@ 2023-06-21 11:06     ` Taylor Blau
  2023-06-21 11:19       ` Christian Couder
  0 siblings, 1 reply; 161+ messages in thread
From: Taylor Blau @ 2023-06-21 11:06 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

On Thu, Jun 15, 2023 at 05:13:14PM -0700, Junio C Hamano wrote:
> Computing "is it local?" based on the value of "destination" feels
> it belongs to the caller (one of the callers that do need the
> computation), not to this function, especially given that the full
> value of "destination" is not even used in any other way in this
> function.  And the "is_local?" bit can instead be passesd into this
> helper function as a parameter.

Hah. I had the same suggestion down-thread, but hadn't read your reply
yet. There are either a couple of changes we could make to
skip_prefix(), or foist the responsibility onto the caller (I tend to
prefer the latter).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 2/9] pack-objects: add `--print-filtered` to print omitted objects
  2023-06-21 10:52     ` Taylor Blau
@ 2023-06-21 11:11       ` Christian Couder
  2023-06-21 11:54         ` Taylor Blau
  0 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-06-21 11:11 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Junio C Hamano, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jun 21, 2023 at 12:52 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Thu, Jun 15, 2023 at 03:50:17PM -0700, Junio C Hamano wrote:

> > Makes sense.  It is a bit sad that we have to accumulate everything
> > until the end at which time we have to dump the accumulated in bulk,
> > but that is a current limitation of list-objects-filter API and not
> > within the scope of this change.  We may in the longer term want to
> > see if we can make the collection of filtered-out objects streamable
> > by replacing the .omits object array with a callback function, or do
> > something along that line.
>
> Hmm. I think it is possible to use something like `git pack-objects`'s
> `--stdin-packs` mode to accomplish this without needing to keep track of
> the set of discarded objects (i.e. those which don't match the filter).
>
> IIUC, the set of objects which don't match the filter is the same as the
> set of all objects in packs beforehand, differenced with the set of
> objects that shows up in the pack containing objects which *do* match
> the filter.
>
> If you mark all of the "before" packs with `-` in the input to
> `--stdin-packs`, and then pass along the pack containing the filtered
> set without `-` (to indicate that the resulting pack should not contain
> any objects which appear in that pack), I think you would end up with
> the set of non-matching objects.

I agree that it can be done like this, but I am not sure it's very
efficient to do it like this. When we create the pack with filtered
out objects, we know the set of objects we filtered out, so it doesn't
seem efficient to make `git pack-objects --stdin-packs` read more
packfiles or their indexes than necessary and compute that set of
objects again.

Now I haven't checked if there is a real performance difference for
large packfiles, and perhaps `git pack-objects --stdin-packs` is very
efficient. But I hope that going the way I implemented it and perhaps
using some optimization ideas that Junio suggested above, will make it
easier to improve performance in the future.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 6/9] repack: add `--filter=<filter-spec>` option
  2023-06-14 19:25 ` [PATCH 6/9] repack: add `--filter=<filter-spec>` option Christian Couder
  2023-06-16  0:43   ` Junio C Hamano
@ 2023-06-21 11:17   ` Taylor Blau
  2023-07-05  7:18     ` Christian Couder
  1 sibling, 1 reply; 161+ messages in thread
From: Taylor Blau @ 2023-06-21 11:17 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jun 14, 2023 at 09:25:38PM +0200, Christian Couder wrote:
> ---
>  Documentation/git-repack.txt |  5 +++
>  builtin/repack.c             | 75 ++++++++++++++++++++++++++++++++++--
>  t/t7700-repack.sh            | 16 ++++++++
>  3 files changed, 93 insertions(+), 3 deletions(-)

Having read through the implementation in the repack builtin, I am
almost certain that my suggestion earlier in the thread to implement
this in terms of 'git pack-objects --filter' and 'git pack-objects
--stdin-packs' would work.

> diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
> index 4017157949..aa29c7e648 100644
> --- a/Documentation/git-repack.txt
> +++ b/Documentation/git-repack.txt
> @@ -143,6 +143,11 @@ depth is 4095.
>  	a larger and slower repository; see the discussion in
>  	`pack.packSizeLimit`.
>
> +--filter=<filter-spec>::
> +	Remove objects matching the filter specification from the
> +	resulting packfile and put them into a separate packfile. See
> +	linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
> +

This documentation leaves me with a handful of questions about how it
interacts with other options. Here are some:

  - What happens when you pass it with "-d"? Does it delete objects that
    didn't match the filter? Leave them alone? If the latter, should
    this combination be declared invalid instead of silently ignoring
    the user's request to delete redundant packs?

  - What happens with --max-pack-size? Does the filtered pack get split
    into multiple packs (as I think we would expect from such a
    combination)?

  - What about with `--cruft`? Does it split the cruft pack into two
    based on whether or not the unreachable object(s) matched or didn't
    match the filter?

  - What happens when passed with "--geometric"? I don't think there is
    a sensible interpretation (at least, I can't think of what it would
    mean to do "--filter=<spec> --geometric=<factor>" off the top of my
    head).

  - What about with "--write-bitmap-index"? Do we write one bitmap
    index? Two? If the latter, do we combine the packs into a MIDX
    before writing the bitmap? Should we?

I think it may be worth spelling out answers to some of these questions
in the documentation, and codifying those answers in the form of tests.

This makes me wonder whether or not this option should belong in repack
at all, or whether there should be some new special-purpose builtin that
is designed to split existing pack(s) based on whether or not they meet
some filter criteria.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 5/9] repack: refactor finishing pack-objects command
  2023-06-21 11:06     ` Taylor Blau
@ 2023-06-21 11:19       ` Christian Couder
  0 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-06-21 11:19 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Junio C Hamano, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

On Wed, Jun 21, 2023 at 1:06 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Thu, Jun 15, 2023 at 05:13:14PM -0700, Junio C Hamano wrote:
> > Computing "is it local?" based on the value of "destination" feels
> > it belongs to the caller (one of the callers that do need the
> > computation), not to this function, especially given that the full
> > value of "destination" is not even used in any other way in this
> > function.  And the "is_local?" bit can instead be passesd into this
> > helper function as a parameter.
>
> Hah. I had the same suggestion down-thread, but hadn't read your reply
> yet. There are either a couple of changes we could make to
> skip_prefix(), or foist the responsibility onto the caller (I tend to
> prefer the latter).

Ok, I will make callers compute the "is_local?" bit and pass it to the
helper function as a parameter.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 6/9] repack: add `--filter=<filter-spec>` option
  2023-06-16  0:43   ` Junio C Hamano
@ 2023-06-21 11:20     ` Taylor Blau
  2023-06-21 15:04       ` Christian Couder
  2023-06-21 14:40     ` Christian Couder
  1 sibling, 1 reply; 161+ messages in thread
From: Taylor Blau @ 2023-06-21 11:20 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Thu, Jun 15, 2023 at 05:43:27PM -0700, Junio C Hamano wrote:
> Christian Couder <christian.couder@gmail.com> writes:
>
> > After cloning with --filter=<filter-spec>, for example to avoid
> > getting unneeded large files on a user machine, it's possible
> > that some of these large files still get fetched for some reasons
> > (like checking out old branches) over time.
> >
> > In this case the repo size could grow too much for no good reason and a
> > way to filter out some objects would be useful to remove the unneeded
> > large files.
>
> Makes sense.
>
> If we repack without these objects, when the repository has a
> promisor remote, we should be able to rely on that remote to supply
> them on demand, once we need them again, no?

I think in theory, yes, but this patch series (at least up to this
point) does not seem to implement that functionality by marking the
relevant remote(s) as promisors, if they weren't already.

> [...] It does smell somewhat similar to the cruft packs but not
> really (the choice over there is between exploding to loose and
> keeping in a pack, and never involves loss of objects).

Indeed. `pack-objects`'s `--stdin-packs` and `--cruft` work similarly,
and I believe that we could use `--stdin-packs` here instead of having
to store the list of objects which don't meet the filter's spec. IOW, I
think that this similarity is no coincidence...

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 8/9] repack: implement `--filter-to` for storing filtered out objects
  2023-06-14 19:25 ` [PATCH 8/9] repack: implement `--filter-to` for storing filtered out objects Christian Couder
  2023-06-16  2:21   ` Junio C Hamano
@ 2023-06-21 11:49   ` Taylor Blau
  2023-06-21 12:08     ` Christian Couder
  2023-07-05  6:19     ` Christian Couder
  1 sibling, 2 replies; 161+ messages in thread
From: Taylor Blau @ 2023-06-21 11:49 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jun 14, 2023 at 09:25:40PM +0200, Christian Couder wrote:
> diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
> index aa29c7e648..070dd22610 100644
> --- a/Documentation/git-repack.txt
> +++ b/Documentation/git-repack.txt
> @@ -148,6 +148,12 @@ depth is 4095.
>  	resulting packfile and put them into a separate packfile. See
>  	linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
>
> +--filter-to=<dir>::
> +	Write the pack containing filtered out objects to the
> +	directory `<dir>`. This can be used for putting the pack on a
> +	separate object directory that is accessed through the Git
> +	alternates mechanism. Only useful with `--filter`.

Here you say "only useful with --filter", but...

> @@ -1073,8 +1077,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		strvec_push(&cmd.args, "--incremental");
>  	}
>
> -	if (po_args.filter)
> -		prepare_pack_filtered_cmd(&pack_filtered_cmd, &po_args, packtmp);
> +	if (po_args.filter) {
> +		if (!filter_to)
> +			filter_to = packtmp;
> +		prepare_pack_filtered_cmd(&pack_filtered_cmd, &po_args, filter_to);
> +	}

Would you want an "} else if (filter_to)" here to die and show the usage
message, since --filter-to needs --filter? Or maybe it should imply
--filter-to.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 2/9] pack-objects: add `--print-filtered` to print omitted objects
  2023-06-21 11:11       ` Christian Couder
@ 2023-06-21 11:54         ` Taylor Blau
  0 siblings, 0 replies; 161+ messages in thread
From: Taylor Blau @ 2023-06-21 11:54 UTC (permalink / raw)
  To: Christian Couder
  Cc: Junio C Hamano, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jun 21, 2023 at 01:11:38PM +0200, Christian Couder wrote:
> On Wed, Jun 21, 2023 at 12:52 PM Taylor Blau <me@ttaylorr.com> wrote:
> >
> > On Thu, Jun 15, 2023 at 03:50:17PM -0700, Junio C Hamano wrote:
>
> > > Makes sense.  It is a bit sad that we have to accumulate everything
> > > until the end at which time we have to dump the accumulated in bulk,
> > > but that is a current limitation of list-objects-filter API and not
> > > within the scope of this change.  We may in the longer term want to
> > > see if we can make the collection of filtered-out objects streamable
> > > by replacing the .omits object array with a callback function, or do
> > > something along that line.
> >
> > Hmm. I think it is possible to use something like `git pack-objects`'s
> > `--stdin-packs` mode to accomplish this without needing to keep track of
> > the set of discarded objects (i.e. those which don't match the filter).
> >
> > IIUC, the set of objects which don't match the filter is the same as the
> > set of all objects in packs beforehand, differenced with the set of
> > objects that shows up in the pack containing objects which *do* match
> > the filter.
> >
> > If you mark all of the "before" packs with `-` in the input to
> > `--stdin-packs`, and then pass along the pack containing the filtered
> > set without `-` (to indicate that the resulting pack should not contain
> > any objects which appear in that pack), I think you would end up with
> > the set of non-matching objects.
>
> I agree that it can be done like this, but I am not sure it's very
> efficient to do it like this. When we create the pack with filtered
> out objects, we know the set of objects we filtered out, so it doesn't
> seem efficient to make `git pack-objects --stdin-packs` read more
> packfiles or their indexes than necessary and compute that set of
> objects again.

Discovering that an object appears in one of the packs whose objects
we're excluding from the resulting pack is extremely cheap, since we
have a cache of just those packs (see the call to
`has_object_kept_pack()` in `want_found_object()`).

I'd think that in small cases the performance is probably worse, though
the working set would be small enough for the differences to not matter
that much in absolute terms.

In cases with many objects, storing all of their OIDs in memory would be
a pain.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 8/9] repack: implement `--filter-to` for storing filtered out objects
  2023-06-21 11:49   ` Taylor Blau
@ 2023-06-21 12:08     ` Christian Couder
  2023-06-21 12:25       ` Taylor Blau
  2023-07-05  6:19     ` Christian Couder
  1 sibling, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-06-21 12:08 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jun 21, 2023 at 1:49 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Wed, Jun 14, 2023 at 09:25:40PM +0200, Christian Couder wrote:

> > +--filter-to=<dir>::
> > +     Write the pack containing filtered out objects to the
> > +     directory `<dir>`. This can be used for putting the pack on a
> > +     separate object directory that is accessed through the Git
> > +     alternates mechanism. Only useful with `--filter`.
>
> Here you say "only useful with --filter", but...
>
> > @@ -1073,8 +1077,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
> >               strvec_push(&cmd.args, "--incremental");
> >       }
> >
> > -     if (po_args.filter)
> > -             prepare_pack_filtered_cmd(&pack_filtered_cmd, &po_args, packtmp);
> > +     if (po_args.filter) {
> > +             if (!filter_to)
> > +                     filter_to = packtmp;
> > +             prepare_pack_filtered_cmd(&pack_filtered_cmd, &po_args, filter_to);
> > +     }
>
> Would you want an "} else if (filter_to)" here to die and show the usage
> message, since --filter-to needs --filter? Or maybe it should imply
> --filter-to.

In the doc for --expire-to=<dir> there is "Only useful with `--cruft
-d`" and I don't think there is a check to see if --cruft and -d have
been passed when --expire-to is passed. So I am not sure if it's
better to be consistent with --expire-to or not.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 8/9] repack: implement `--filter-to` for storing filtered out objects
  2023-06-21 12:08     ` Christian Couder
@ 2023-06-21 12:25       ` Taylor Blau
  2023-06-21 16:44         ` Junio C Hamano
  0 siblings, 1 reply; 161+ messages in thread
From: Taylor Blau @ 2023-06-21 12:25 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jun 21, 2023 at 02:08:38PM +0200, Christian Couder wrote:
> > > @@ -1073,8 +1077,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
> > >               strvec_push(&cmd.args, "--incremental");
> > >       }
> > >
> > > -     if (po_args.filter)
> > > -             prepare_pack_filtered_cmd(&pack_filtered_cmd, &po_args, packtmp);
> > > +     if (po_args.filter) {
> > > +             if (!filter_to)
> > > +                     filter_to = packtmp;
> > > +             prepare_pack_filtered_cmd(&pack_filtered_cmd, &po_args, filter_to);
> > > +     }
> >
> > Would you want an "} else if (filter_to)" here to die and show the usage
> > message, since --filter-to needs --filter? Or maybe it should imply
> > --filter-to.
>
> In the doc for --expire-to=<dir> there is "Only useful with `--cruft
> -d`" and I don't think there is a check to see if --cruft and -d have
> been passed when --expire-to is passed. So I am not sure if it's
> better to be consistent with --expire-to or not.

TBH, I don't think that my decision at the time to silently accept
--expire-to without --cruft was the right one. It should at least
require --cruft, or imply it. It doesn't make a ton of sense to use
without -d, but doing so is OK, so I wouldn't consider that a failing
condition.

In other words, I would be fine with something like:

--- 8< ---
diff --git a/builtin/repack.c b/builtin/repack.c
index 0541c3ce15..1890f283ee 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -866,6 +866,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
 		die(_("options '%s' and '%s' cannot be used together"), "--keep-unreachable", "-A");

+	/* --expire-to implies cruft */
+	if (expire_to)
+		pack_everything |= PACK_CRUFT;
+
 	if (pack_everything & PACK_CRUFT) {
 		pack_everything |= ALL_INTO_ONE;

--- >8 ---

But that sounds like a good candidate for some #leftoverbits.

In the meantime, I would be absolutely fine with deviating from the
existing behavior of --expire-to w.r.t --cruft.

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH 6/9] repack: add `--filter=<filter-spec>` option
  2023-06-16  0:43   ` Junio C Hamano
  2023-06-21 11:20     ` Taylor Blau
@ 2023-06-21 14:40     ` Christian Couder
  2023-06-21 16:53       ` Junio C Hamano
  1 sibling, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-06-21 14:40 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Fri, Jun 16, 2023 at 2:43 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > After cloning with --filter=<filter-spec>, for example to avoid
> > getting unneeded large files on a user machine, it's possible
> > that some of these large files still get fetched for some reasons
> > (like checking out old branches) over time.
> >
> > In this case the repo size could grow too much for no good reason and a
> > way to filter out some objects would be useful to remove the unneeded
> > large files.
>
> Makes sense.
>
> If we repack without these objects, when the repository has a
> promisor remote, we should be able to rely on that remote to supply
> them on demand, once we need them again, no?

Yeah, sure.

> > Deleting objects right away could corrupt a repo though,...
>
> Hmph, could you elaborate why it is the case?  Isn't it the whole
> point to have promisor remote and use a lazy clone with the --filter
> option, so that objects that _ought_ to exist from connectivity's
> point of view _are_ allowed to be missing because the promisor
> promises to make them available on-demand?
>
>         Side note: I think I know the answer. While trying to remove
>         UNNEEDED large files, doing so may discard NEEDED large
>         files when done carelessly (e.g. the file may have been
>         created locally and haven't been pushed back).

Yeah, right.

>         But (1) if
>         that is the problem, perhaps we should be more careful in
>         the first place?

Yeah, but earlier when we implemented `repack --filter=...` that was
removing objects, saying that one should be very careful in the docs
and implementing safeguards didn't seem to be safe enough for
reviewers. Reviewers said that the feature would anyway provide a too
easy way for users to shoot their own foot.

>        (2) if it inherently is impossible to tell
>         which ones are unneeded reliably, the reason why it is
>         impossible, and the reason why "try sifting into two bins,
>         one that we _think_ are unneeded and another for the rest,
>         and verify what we _thought_ are unneeded are all available
>         from the promisor remote" is the best we can do, must be
>         described, I think.

You mean described in the `repack --filter=` doc? Yeah, I can describe
this use case in the doc, but see below.

> > ... so it might be
> > better to put those objects into a separate packfile instead of
> > deleting them. The separate pack could then be removed after checking
> > that all the objects in it are still available on a promisor remote it
> > can access.
>
> Surely, sifting the objects into two bins (i.e. those that we
> wouldn't have received if we cloned from the promisor remote just
> now, which are prunable, and those that we cannot lose because the
> promisor remote would not have them, e.g. we created them and have
> not pushed them to the remote yet) without removing anything would
> be safe, but if the result of such sifting must be verified, doesn't
> it indicate that the sifting step was buggy or misdesigned?

It might indicate that we prefer to be safe, do things in different
steps and not provide an easy way for users to shoot their own foot.
For example it seems pretty safe to do things like this:

  1) put all the objects we think should be on the promisor remote in
a separate packfile
  2) start checking that each object in that packfile is available on
the promisor remote
  3) if an object in that packfile isn't on the promisor remote, try
to send it there
  4) if we couldn't send the object, error out
  5) if we haven't errored out after checking all the objects in the
packfile, it means all these objects are now available from the
promisor remote and we can safely delete the packfile

The above steps can be done while new objects are created on the repo,
or fetched, or pushed into the repo. And, at least for now, it would
be done by a custom script, so users writing and installing it should
know what they are doing and would hopefully not complain that we
provided an easy way for them to shoot their foot.

If we don't even document the above in the --filter=... doc, it makes
it even less likely that they will do this and that their script might
be wrong. So even if I could document it in version 2, I am not sure I
should.

>  It does
> not sound like a very good justification to save them in a separate
> packfile.  It does smell somewhat similar to the cruft packs but not
> really (the choice over there is between exploding to loose and
> keeping in a pack, and never involves loss of objects).

If we are still worried about possible loss of objects, I am Ok with
not talking at all about use cases involving possible loss of objects.

> > Also splitting a packfile into 2 packs depending on a filter could be
> > useful in other usecases. For example some large blobs might take a lot
> > of precious space on fast storage while they are rarely accessed, and
> > it could make sense to move them in a separate cheaper, though slower,
> > storage.
>
> This one, outside the context of partial clone client, does make
> tons of sense.

Ok, so perhaps it is enough to justify this feature and patch series.
And I can just avoid talking about other use cases at all?

> I guess what I suspect is that this option, while it would be very
> useful for the "in other usecases" scenario above, may not become
> all that useful in the "our lazy clone got bloated and we want to
> trim objects we know we can retrieve from the promisor remote again
> if necessary" scenario, until the repack machinery learns to use an
> extra piece of information (namely "these are objects that we can
> fetch from the promisor remote") at the same time.

Yeah, perhaps we should wait for a command or a repack option or some
helper scripts to be able to perform steps 2) to 4) or 2) to 5) above
before talking about use cases involving a promisor remote.

On the other hand, it's possible to imagine other steps than the steps
2) to 4) described above. For example, if we want to repack on a
server where new large blobs can hardly be created and where there is
a receive hook that automatically sends all the large blobs to a
promisor remote as soon as they are received, we might not need steps
3) and 4) to send objects to the promisor remote. Just checking that
they are on the promisor remote might be enough.

Also even if we think we should have features covering all the 5
steps, should we cover all the ways blobs could be sent to the
promisor remote as part of step 3)? Some people or server platforms
might want to use git for that purpose, but others might prefer for
example FTP or plain HTTP(S) so that a transfer can be restarted if it
fails.

So should we really wait until we have all possible such use cases
covered by some features or scripts, or not? When does it become Ok to
talk about this? And then how much is it Ok to talk about this?

> > This commit implements a new `--filter=<filter-spec>` option in
> > `git repack` that moves filtered out objects into a separate pack.
> >
> > This is done by reading filtered out objects from `git pack-objects`'s
> > output and piping them into a separate `git pack-objects` process that
> > will put them into a separate packfile.
>
> So, for example, you may say "no blobs" in the filter, and while
> packing the local repository with the filter, resulting in a pack
> that exclude all blobs, we will learn what blob objects we did not
> pack into that packfile.  We can pack them into a separate one, and
> most of the blobs are what we could retrieve again from the promisor
> remote, but some of the blobs are what we locally created ourselves
> and haven't pushed back to the promisor remote yet.  Now what?  My
> earlier suspicion that this mechanism may not be all that useful for
> the "slim bloated lazy clone" comes from that I cannot think of a
> good answer to this "Now what?" question---my naive solution would
> involve enumerating the objects in that "separate packfile" that is
> a mixture of precious ones and expendable ones, and then learning
> which ones are precious, and creating a new pack that is a subset of
> that "separate packfile" with only the precious ones.  But if I do
> so, I do not think we need this new mechanism that seems to go only
> the half-way.

I hope the above 2) to 5) steps and related explanations are a good
answer to the "Now what?" question.

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 6/9] repack: add `--filter=<filter-spec>` option
  2023-06-21 11:20     ` Taylor Blau
@ 2023-06-21 15:04       ` Christian Couder
  2023-06-22 11:05         ` Taylor Blau
  0 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-06-21 15:04 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Junio C Hamano, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jun 21, 2023 at 1:20 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Thu, Jun 15, 2023 at 05:43:27PM -0700, Junio C Hamano wrote:
> > Christian Couder <christian.couder@gmail.com> writes:
> >
> > > After cloning with --filter=<filter-spec>, for example to avoid
> > > getting unneeded large files on a user machine, it's possible
> > > that some of these large files still get fetched for some reasons
> > > (like checking out old branches) over time.
> > >
> > > In this case the repo size could grow too much for no good reason and a
> > > way to filter out some objects would be useful to remove the unneeded
> > > large files.
> >
> > Makes sense.
> >
> > If we repack without these objects, when the repository has a
> > promisor remote, we should be able to rely on that remote to supply
> > them on demand, once we need them again, no?
>
> I think in theory, yes, but this patch series (at least up to this
> point) does not seem to implement that functionality by marking the
> relevant remote(s) as promisors, if they weren't already.

Yeah, it's not part of this patch series to implement all the features
that could be useful in the case of promisor remotes. This patch
series only hopes to implement a `repack --filter=...` option that can
help in a number of different use cases. I'm open to opinions about
whether or not the doc and commit messages should talk, and how much,
about use cases related to promisor remotes.

> > [...] It does smell somewhat similar to the cruft packs but not
> > really (the choice over there is between exploding to loose and
> > keeping in a pack, and never involves loss of objects).
>
> Indeed. `pack-objects`'s `--stdin-packs` and `--cruft` work similarly,
> and I believe that we could use `--stdin-packs` here instead of having
> to store the list of objects which don't meet the filter's spec. IOW, I
> think that this similarity is no coincidence...

Yeah, I agree that we could use `--stdin-packs` to implement `repack
--filter=...`. I am just not sure it's the best path forward
performance wise in the long run. So others' opinions are welcome
about that.

Also, as Junio said, this patch series is not responsible for the fact
that traverse_commit_list_filtered() stores oids into an oidset
instead of using a callback function. Fixing this would likely avoid
accumulating oids in memory. And creating a packfile by sending oids
into pack-objects is something that is already done by
repack_promisor_objects(). So even if `--filter=...` is not reusing
`--stdin-packs`, it is still reusing a lot of existing mechanisms.

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 8/9] repack: implement `--filter-to` for storing filtered out objects
  2023-06-21 12:25       ` Taylor Blau
@ 2023-06-21 16:44         ` Junio C Hamano
  0 siblings, 0 replies; 161+ messages in thread
From: Junio C Hamano @ 2023-06-21 16:44 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Taylor Blau <me@ttaylorr.com> writes:

> In other words, I would be fine with something like:
>
> --- 8< ---
> diff --git a/builtin/repack.c b/builtin/repack.c
> index 0541c3ce15..1890f283ee 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -866,6 +866,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
>  		die(_("options '%s' and '%s' cannot be used together"), "--keep-unreachable", "-A");
>
> +	/* --expire-to implies cruft */
> +	if (expire_to)
> +		pack_everything |= PACK_CRUFT;
> +
>  	if (pack_everything & PACK_CRUFT) {
>  		pack_everything |= ALL_INTO_ONE;
>
> --- >8 ---
>
> But that sounds like a good candidate for some #leftoverbits.

It does.  Thanks.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 6/9] repack: add `--filter=<filter-spec>` option
  2023-06-21 14:40     ` Christian Couder
@ 2023-06-21 16:53       ` Junio C Hamano
  2023-06-22  8:39         ` Christian Couder
  0 siblings, 1 reply; 161+ messages in thread
From: Junio C Hamano @ 2023-06-21 16:53 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> It might indicate that we prefer to be safe, do things in different
> steps and not provide an easy way for users to shoot their own foot.
> For example it seems pretty safe to do things like this:
>
>   1) put all the objects we think should be on the promisor remote in
> a separate packfile
>   2) start checking that each object in that packfile is available on
> the promisor remote
>   3) if an object in that packfile isn't on the promisor remote, try
> to send it there
>   4) if we couldn't send the object, error out
>   5) if we haven't errored out after checking all the objects in the
> packfile, it means all these objects are now available from the
> promisor remote and we can safely delete the packfile

I may be missing something, but to me, the above sound more like a
tail wagging the dog.

Instead of saying "while repacking, we'll create the new pack with
the objects that we suspect that we cannot re-fetch from the
promisor (allowing false positives for safety), and store the rest
in a backup pack (that can immediately be discarded)", the above
says "while repacking, we'll create the new pack with objects that
match the filter, and store the rest to another pack".  But because
the object selection criteria used in the latter is not something
with practical/useful meaning, in other words, it does not exactly
match what we want, we fill the gaps between what we want (i.e. sift
the objects into "refetchable" and "other" bins) and what we
happened to have implemented (i.e. sift the objects into "match
filter" and "other" bints) by sending the objects that we _should_
have included in the new pack (i.e. "not refetchable") to the
promisor to make them refetchable.

I do not know what to think about that.  I do not think there is
even a way to guarantee that the push done for 3) will always be
taken and still leave the resulting promisor usable (e.g.  we can
make them connected by coming up with a random new ref to point
these "we are sending these only because we failed to include them
in the set of objects we should consider local" objects, but then
how would we avoid bloating the refs at the promisor remote side
(which now has become a "dumping ground", rather than holding the
objects needed for histories that project participants care about).

As an argument to salvage this series as (one of the possible
ingredients to) a solution to "slim down a bloated lazy clone"
problem, it sounds a bit weak.

Thanks.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 6/9] repack: add `--filter=<filter-spec>` option
  2023-06-21 16:53       ` Junio C Hamano
@ 2023-06-22  8:39         ` Christian Couder
  2023-06-22 18:32           ` Junio C Hamano
  0 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-06-22  8:39 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jun 21, 2023 at 6:53 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > It might indicate that we prefer to be safe, do things in different
> > steps and not provide an easy way for users to shoot their own foot.
> > For example it seems pretty safe to do things like this:
> >
> >   1) put all the objects we think should be on the promisor remote in
> > a separate packfile
> >   2) start checking that each object in that packfile is available on
> > the promisor remote
> >   3) if an object in that packfile isn't on the promisor remote, try
> > to send it there
> >   4) if we couldn't send the object, error out
> >   5) if we haven't errored out after checking all the objects in the
> > packfile, it means all these objects are now available from the
> > promisor remote and we can safely delete the packfile
>
> I may be missing something, but to me, the above sound more like a
> tail wagging the dog.
>
> Instead of saying "while repacking, we'll create the new pack with
> the objects that we suspect that we cannot re-fetch from the
> promisor (allowing false positives for safety), and store the rest
> in a backup pack (that can immediately be discarded)",

This might be a good idea, but what if users prefer to send to a
promisor remote the objects that should be on that promisor remote as
soon as possible, instead of keeping them on the local machine where
they take up possibly valuable space for no good reason?

My point is that there are a lot of different strategies that people
operating with a promisor remote could adopt, so it's better to
iteratively give them building blocks that can help them instead of
trying to find and implement right away the best solution for a
special use case or for every use case.

> the above
> says "while repacking, we'll create the new pack with objects that
> match the filter, and store the rest to another pack".  But because
> the object selection criteria used in the latter is not something
> with practical/useful meaning, in other words, it does not exactly
> match what we want,

"What we want" depends on the strategy chosen to manage objects on
promisor remotes and I am not sure that the strategy you mention is
always better than the example strategy I talked about. For some users
it might be better for others it might not.

> we fill the gaps between what we want (i.e. sift
> the objects into "refetchable" and "other" bins) and what we
> happened to have implemented (i.e. sift the objects into "match
> filter" and "other" bints) by sending the objects that we _should_
> have included in the new pack (i.e. "not refetchable") to the
> promisor to make them refetchable.
>
> I do not know what to think about that.  I do not think there is
> even a way to guarantee that the push done for 3) will always be
> taken and still leave the resulting promisor usable (e.g.  we can
> make them connected by coming up with a random new ref to point
> these "we are sending these only because we failed to include them
> in the set of objects we should consider local" objects, but then
> how would we avoid bloating the refs at the promisor remote side
> (which now has become a "dumping ground", rather than holding the
> objects needed for histories that project participants care about).

There are some configurations where users never want to delete any git
object. In those cases it doesn't matter if the promisor remote is a
"dumping ground". Users might just want a promisor remote to keep all
the large files that have ever been pushed into the repo, to save more
precious space on the machine hosting the regular repo.

There are configurations where users can have garanties that the push
done for 3) will work with a very high probability so that the example
strategy I talked about can work reliably enough.

The example strategy I talked about is just one example where having
repack --filter work like it does in this patch series can be useful
and safe. I don't pretend that it is always the best strategy and that
some users might not prefer another better strategy for them. If
that's the case perhaps they can implement another different option
that just checks that filtered out objects are indeed available from a
promisor remote and then just omit these objects from the resulting
pack. In fact I would have nothing against such an option, and I might
even implement it myself one day (no promise though).

Right now they have nearly no helpful command (except perhaps using
pack-objects directly), so even if this is not the best possible help
in all use cases, I am just saying that this can be useful in _some_
cases.

> As an argument to salvage this series as (one of the possible
> ingredients to) a solution to "slim down a bloated lazy clone"
> problem, it sounds a bit weak.

I don't quite agree, but anyway I will use this argument less in
version 2 of this patch series, and I will talk more about the
argument that `repack --filter=...` allows users to put packfiles
containing some kinds of objects (like large blobs or all the blobs)
on cheaper disks (when not using a promisor remote).

Thanks.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 6/9] repack: add `--filter=<filter-spec>` option
  2023-06-21 15:04       ` Christian Couder
@ 2023-06-22 11:05         ` Taylor Blau
  0 siblings, 0 replies; 161+ messages in thread
From: Taylor Blau @ 2023-06-22 11:05 UTC (permalink / raw)
  To: Christian Couder
  Cc: Junio C Hamano, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jun 21, 2023 at 05:04:48PM +0200, Christian Couder wrote:
> On Wed, Jun 21, 2023 at 1:20 PM Taylor Blau <me@ttaylorr.com> wrote:
> >
> > On Thu, Jun 15, 2023 at 05:43:27PM -0700, Junio C Hamano wrote:
> > > Christian Couder <christian.couder@gmail.com> writes:
> > >
> > > > After cloning with --filter=<filter-spec>, for example to avoid
> > > > getting unneeded large files on a user machine, it's possible
> > > > that some of these large files still get fetched for some reasons
> > > > (like checking out old branches) over time.
> > > >
> > > > In this case the repo size could grow too much for no good reason and a
> > > > way to filter out some objects would be useful to remove the unneeded
> > > > large files.
> > >
> > > Makes sense.
> > >
> > > If we repack without these objects, when the repository has a
> > > promisor remote, we should be able to rely on that remote to supply
> > > them on demand, once we need them again, no?
> >
> > I think in theory, yes, but this patch series (at least up to this
> > point) does not seem to implement that functionality by marking the
> > relevant remote(s) as promisors, if they weren't already.
>
> Yeah, it's not part of this patch series to implement all the features
> that could be useful in the case of promisor remotes. This patch
> series only hopes to implement a `repack --filter=...` option that can
> help in a number of different use cases. I'm open to opinions about
> whether or not the doc and commit messages should talk, and how much,
> about use cases related to promisor remotes.
>
> > > [...] It does smell somewhat similar to the cruft packs but not
> > > really (the choice over there is between exploding to loose and
> > > keeping in a pack, and never involves loss of objects).
> >
> > Indeed. `pack-objects`'s `--stdin-packs` and `--cruft` work similarly,
> > and I believe that we could use `--stdin-packs` here instead of having
> > to store the list of objects which don't meet the filter's spec. IOW, I
> > think that this similarity is no coincidence...
>
> Yeah, I agree that we could use `--stdin-packs` to implement `repack
> --filter=...`. I am just not sure it's the best path forward
> performance wise in the long run. So others' opinions are welcome
> about that.

I think it would almost certainly have comparable performance in most
cases, and significantly better performance in large repositories. IIUC,
the current system has to remember the OID of every object which did not
pass the filter, and then construct a pack containing just those
objects.

It would be nice from a memory-savings perspective to not have to
remember these OIDs. But it also just seems error prone to me to do so:
what if we lose an OID, or reorder the list?

I dunno. I feel pretty strongly that implementing this in terms of:

  - Write a filtered pack.
  - Construct the list of existing packs (marked with '-') and the
    filtered pack.
  - Pass that as input to `git pack-objects --stdin-packs`
  - If '-d' given, delete any existing pack(s).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 6/9] repack: add `--filter=<filter-spec>` option
  2023-06-22  8:39         ` Christian Couder
@ 2023-06-22 18:32           ` Junio C Hamano
  0 siblings, 0 replies; 161+ messages in thread
From: Junio C Hamano @ 2023-06-22 18:32 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

>> I may be missing something, but to me, the above sound more like a
>> tail wagging the dog.
> ...
> This might be a good idea, but what if users prefer to send to a
> promisor remote the objects that should be on that promisor remote as
> soon as possible, instead of keeping them on the local machine where
> they take up possibly valuable space for no good reason?
> ...
> There are some configurations where users never want to delete any git
> object. In those cases it doesn't matter if the promisor remote is a
> "dumping ground".

When one says "everything" in these sentences, I doubt one
necessarily means "everything".  A topic one works on will have
iterations that is never pushed out, a topic one started may not
even get to the state that is pushable to the central server.  But
the objects that need to support such a topic (and its historical
versions in its reflog) would need to be retained until they are
expired.

Certainly, by pushing even such objects, you can say "here is a pack
with filter=blob:none, and because I sent every blob every time I
create locally to the promisor, I can always refetch what is not in
them by definition".

But is that a good use of everybody's resources?  The key phrase in
what I said was "... and still leve the resulting promisor usable".

The promisor remote is in the unfortunate and unenviable position
that it cannot garbage collect anything because there may be
somebody who is still depending on such an object nobody planned to
use, but there is no mechanism to let it find out which ones are in
active use (or if you added some recently that I am forgetting, it
would change the equation---please remind me if that is the case).

So I would imagine that it would be fairly high in the priority list
of server operators and project leads to make sure their promisor
remotes do not become a true "dumping ground".

For "trim a bloated lazy clone" problem, I suspect that you would
need to know what is currently re-fetchable from the promisor and
drop those objects from your local repository, and the computation
of what is currently re-fetchable would certainly involve the filter
specification you had with the promisor.  The remote-tracking
branches you have for the promisor would serve as the other source
of input to perform the computation.

For "partition local and complete object store" problem, using
filter specification to sift the objects into two bins (those that
match and the rest), as the code changes in the series implements,
may be a useful mechansim.  I briefly had to wonder if partitioning
into two (and not arbitrary number N) bins is sufficient, but did
not think of a scenario where we would benefit from 3 bins more than
having 2 bins offhand.

Thanks.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v2 0/8] Repack objects into separate packfiles based on a filter
  2023-06-14 19:25 [PATCH 0/9] Repack objects into separate packfiles based on a filter Christian Couder
                   ` (9 preceding siblings ...)
  2023-06-14 21:36 ` [PATCH 0/9] Repack objects into separate packfiles based on a filter Junio C Hamano
@ 2023-07-05  6:08 ` Christian Couder
  2023-07-05  6:08   ` [PATCH v2 1/8] pack-objects: allow `--filter` without `--stdout` Christian Couder
                     ` (8 more replies)
  10 siblings, 9 replies; 161+ messages in thread
From: Christian Couder @ 2023-07-05  6:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

# Intro

Last year, John Cai sent 2 versions of a patch series to implement
`git repack --filter=<filter-spec>` and later I sent 4 versions of a
patch series trying to do it a bit differently:

  - https://lore.kernel.org/git/pull.1206.git.git.1643248180.gitgitgadget@gmail.com/
  - https://lore.kernel.org/git/20221012135114.294680-1-christian.couder@gmail.com/

In these patch series, the `--filter=<filter-spec>` removed the
filtered out objects altogether which was considered very dangerous
even though we implemented different safety checks in some of the
latter series.

In some discussions, it was mentioned that such a feature, or a
similar feature in `git gc`, or in a new standalone command (perhaps
called `git prune-filtered`), should put the filtered out objects into
a new packfile instead of deleting them.

Recently there were internal discussions at GitLab about either moving
blobs from inactive repos onto cheaper storage, or moving large blobs
onto cheaper storage. This lead us to rethink at repacking using a
filter, but moving the filtered out objects into a separate packfile
instead of deleting them.

So here is a new patch series doing that while implementing the
`--filter=<filter-spec>` option in `git repack`.

# Use cases for the new feature

This could be useful for example for the following purposes:

  1) As a way for servers to save storage costs by for example moving
     large blobs, or all the blobs, or all the blobs in inactive
     repos, to separate storage (while still making them accessible
     using for example the alternates mechanism).

  2) As a way to use partial clone on a Git server to offload large
     blobs to, for example, an http server, while using multiple
     promisor remotes (to be able to access everything) on the client
     side. (In this case the packfile that contains the filtered out
     object can be manualy removed after checking that all the objects
     it contains are available through the promisor remote.)

  3) As a way for clients to reclaim some space when they cloned with
     a filter to save disk space but then fetched a lot of unwanted
     objects (for example when checking out old branches) and now want
     to remove these unwanted objects. (In this case they can first
     move the packfile that contains filtered out objects to a
     separate directory or storage, then check that everything works
     well, and then manually remove the packfile after some time.)

As the features and the code are quite different from those in the
previous series, I decided to start a new series instead of continuing
a previous one.

Also since this version 2 of this new series, commit messages, don't
mention uses cases like 2) or 3) above, as people have different
opinions on how it should be done. How it should be done could depend
a lot on the way promisor remotes are used, the software and hardware
setups used, etc, so it seems more difficult to "sell" this series by
talking about such use cases. As use case 1) seems simpler and more
appealing, it makes more sense to only talk about it in the commit
messages.

# Changes since version 1

Thanks to Junio and Taylor who reviewed version 1! The changes are the
following:

- I think that in the long run it might have been better for
  performance reasons to implement the `--filter=...` in `git repack`
  the way it was done in version 1.

  (It was done by first implementing `git pack-objects
  --print-filtered` to print objects omitted by a `--filter=...`
  option, then having `git repack --filter=...` launch such a command,
  read filtered out objects from it, and pipe them into a separate
  `git pack-objects` process that would put them into a separate
  packfile.)

  Anyway our list-objects-filter API is currently not well suited for
  that as it doesn't have a way to output the oids of the filtered out
  objects while the filtering is happening. So version 1 had to get
  them from an oid_set afterwards, which could use a lot of memory.

  As I don't want to have to work on improving the list-objects-filter
  API right now, I think it's better for now to just do as Taylor
  suggested, and as what the cruft code is already doing, which is to
  use the `--stdin-pack` option of `git pack-objects` and to pass it
  packfile names, some prefixed with '^', to let it compute what
  should be in the packfile(s) containing the filtered out objects and
  then to create that(/those) packfile(s).

  So patch 5/8 in version 2 which implements the
  `--filter=<filter-spec>` option in `git repack` is very different
  from patch 6/9 which implemented it in version 1.

- By doing it this way, this version gets rid of patch 2/9 in version
  1 which added the `--print-filtered` option to pack-objects.

- We also get rid of patch 4/9 in version 1 which refactored
  piping an oid to a command.

- On the other hand, we can refactor the code which finds the pack
  prefix into a new small function as this will be needed by the new
  code implementing `--filter=<filter-spec>`, so patch 4/8 in version
  2 is new.

- As suggested by Taylor, a small test has been added to verify that
  `--filter=blob:none` can work without `--stdout` in patch 1/8
  (previously 1/9).

- In patch 3/8 (previously 5/9) which refactors the code to finish a
  pack-objects command into a new function, the `is_local` bit is not
  computed inside the new function anymore as suggested by both Junio
  and Taylor.

- The commit messages of patch 5/8 (previously 6/9) and patch 7/8
  (previously 8/9) that implement the `--filter=<filter-spec>` and
  `--filter-to=<dir>` options respectively have been changed to not
  talk about use cases related to promisor remotes as explained
  towards the end of the "Use cases for the new feature" section
  above.

- Also in patch 5/8 (previously 6/9) that implements the
  `--filter=<filter-spec>`option, the documentation for this section
  has been improved, suggesting to use it on bare repos along with -a
  and -d to get the best possible filtering.

- And in patch 7/8 (previously 8/9) that implements the
  `--filter-to=<dir>` option, a new test has been added to check that
  `--filter=<filter-spec>` and `--filter-to=<dir>` work well with
  the `--max-pack-size=...` option as suggested by Taylor.

- Also in patch 7/8 (previously 8/9) we now check that
  `--filter=<filter-spec>` has been passed and error out if not, as it
  doesn't make sense to use `--filter-to=<dir>` without
  `--filter=<filter-spec>`. This was suggested by Taylor.

- To avoid small merge conflicts in the `git gc` doc, this series has
  been rebased onto 9748a68200 (The sixth batch, 2023-06-29).

# Commit overview

* 1/8 pack-objects: allow `--filter` without `--stdout`

  This patch is the same as the first patch in the previous series and
  in v1. To be able to later repack with a filter we need `git
  pack-objects` to write packfiles when it's filtering instead of just
  writing the pack without the filtered out objects to stdout.

* 2/8 t/helper: add 'find-pack' test-tool

  For testing `git repack --filter=...` that we are going to
  implement, it's useful to have a test helper that can tell which
  packfiles contain a specific object. No change in this patch
  compared to v1.

* 3/8 repack: refactor finishing pack-objects command

  This is a small refactoring creating a new useful function, so that
  `git repack --filter=...` will be able to reuse it. The change
  compared to v1 is that the `is_local` bit is not computed inside the
  new function anymore.

* 4/8 repack: refactor finding pack prefix

  This is a new patch with a small refactoring creating a small
  function that will be reused in the next patch.

* 5/8 repack: add `--filter=<filter-spec>` option

  This actually adds the `--filter=<filter-spec>` option. As explained
  above it works differently than in v1. It now uses one `git
  pack-objects` process with the `--filter` option. And then another
  `git pack-objects` process with the `--stdin-packs` option. Also the
  documentation of the new option has been improved compared to v1.

* 6/8 gc: add `gc.repackFilter` config option

  This is a gc config option so that `git gc` can also repack using a
  filter and put the filtered out objects into a separate packfile. No
  changes compared to v1.

* 7/8 repack: implement `--filter-to` for storing filtered out objects

  For some use cases, it's interesting to create the packfile that
  contains the filtered out objects into a separate location. This is
  similar to the `--expire-to` option for cruft packfiles. Since
  version 1 we now check that `--filter=<filter-spec>` has been passed
  as using `--filter-to` without it doesn't make sense. Also a new
  test has been added to check that these options work well with
  `--max-pack-size=...`.

* 8/8 gc: add `gc.repackFilterTo` config option

  This allows specifying the location of the packfile that contains
  the filtered out objects when using `gc.repackFilter`. No change
  since v1.

# Range-diff

 1:  f4e1cc24d2 !  1:  0bd1ad3071 pack-objects: allow `--filter` without `--stdout`
    @@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const
      
        if (stdin_packs && use_internal_rev_list)
                die(_("cannot use internal rev list with --stdin-packs"));
    +
    + ## t/t5317-pack-objects-filter-objects.sh ##
    +@@ t/t5317-pack-objects-filter-objects.sh: test_expect_success 'verify blob:none packfile has no blobs' '
    +   ! grep blob verify_result
    + '
    + 
    ++test_expect_success 'verify blob:none packfile without --stdout' '
    ++  git -C r1 pack-objects --revs --filter=blob:none mypackname >packhash <<-EOF &&
    ++  HEAD
    ++  EOF
    ++  git -C r1 verify-pack -v "mypackname-$(cat packhash).pack" >verify_result &&
    ++  ! grep blob verify_result
    ++'
    ++
    + test_expect_success 'verify normal and blob:none packfiles have same commits/trees' '
    +   git -C r1 verify-pack -v ../all.pack >verify_result &&
    +   grep -E "commit|tree" verify_result |
 2:  8cf3db088e <  -:  ---------- pack-objects: add `--print-filtered` to print omitted objects
 3:  2f3b16281c =  2:  e49cd723c7 t/helper: add 'find-pack' test-tool
 4:  0021a5e3bb <  -:  ---------- repack: refactor piping an oid to a command
 5:  dce5087cc3 !  3:  3f87772ea6 repack: refactor finishing pack-objects command
    @@ builtin/repack.c: static void remove_redundant_bitmaps(struct string_list *inclu
      
     +static int finish_pack_objects_cmd(struct child_process *cmd,
     +                             struct string_list *names,
    -+                             const char *destination)
    ++                             int local)
     +{
    -+  int local = 1;
     +  FILE *out;
     +  struct strbuf line = STRBUF_INIT;
     +
    -+  if (destination) {
    -+          const char *scratch;
    -+          local = skip_prefix(destination, packdir, &scratch);
    -+  }
    -+
     +  out = xfdopen(cmd->out, "r");
     +  while (strbuf_getline_lf(&line, out) != EOF) {
     +          struct string_list_item *item;
    @@ builtin/repack.c: static int write_cruft_pack(const struct pack_objects_args *ar
     -  FILE *in, *out;
     +  FILE *in;
        int ret;
    --  const char *scratch;
    --  int local = skip_prefix(destination, packdir, &scratch);
    - 
    -   prepare_pack_objects(&cmd, args, destination);
    - 
    +   const char *scratch;
    +   int local = skip_prefix(destination, packdir, &scratch);
     @@ builtin/repack.c: static int write_cruft_pack(const struct pack_objects_args *args,
                fprintf(in, "%s.pack\n", item->string);
        fclose(in);
    @@ builtin/repack.c: static int write_cruft_pack(const struct pack_objects_args *ar
     -  strbuf_release(&line);
     -
     -  return finish_command(&cmd);
    -+  return finish_pack_objects_cmd(&cmd, names, destination);
    ++  return finish_pack_objects_cmd(&cmd, names, local);
      }
      
      int cmd_repack(int argc, const char **argv, const char *prefix)
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
     -  strbuf_release(&line);
     -  fclose(out);
     -  ret = finish_command(&cmd);
    -+  ret = finish_pack_objects_cmd(&cmd, &names, NULL);
    ++  ret = finish_pack_objects_cmd(&cmd, &names, 1);
        if (ret)
                goto cleanup;
      
 6:  fedde52ca1 <  -:  ---------- repack: add `--filter=<filter-spec>` option
 -:  ---------- >  4:  9997efaf33 repack: refactor finding pack prefix
 -:  ---------- >  5:  da27ecb91b repack: add `--filter=<filter-spec>` option
 7:  6ebd274334 !  6:  49e4a184b4 gc: add `gc.repackFilter` config option
    @@ Commit message
         Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
     
      ## Documentation/config/gc.txt ##
    -@@ Documentation/config/gc.txt: or rebase occurring.  Since these changes are not part of the current
    - project most users will want to expire them sooner, which is why the
    - default is more aggressive than `gc.reflogExpire`.
    +@@ Documentation/config/gc.txt: Multiple hooks are supported, but all must exit successfully, else the
    + operation (either generating a cruft pack or unpacking unreachable
    + objects) will be halted.
      
     +gc.repackFilter::
     +  When repacking, use the specified filter to move certain
 8:  5d68501b1f !  7:  243c93aad3 repack: implement `--filter-to` for storing filtered out objects
    @@ Commit message
         accessible if, for example, the Git alternates mechanism is used to
         point to it.
     
    -    If users want to remove a pack that contains filtered out objects after
    -    checking that they are all already on a promisor remote, creating the
    -    pack in a different directory makes it easier to do so.
    +    While at it, as an example to show that `--filter` and `--filter-to`
    +    work well with other options, let's also add a test to check that these
    +    options work well with `--max-pack-size`.
     
         Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
     
    +    repack: add test with --max-pack-size
    +
      ## Documentation/git-repack.txt ##
     @@ Documentation/git-repack.txt: depth is 4095.
    -   resulting packfile and put them into a separate packfile. See
    -   linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
    +   this option.  See linkgit:git-rev-list[1] for valid
    +   `<filter-spec>` forms.
      
     +--filter-to=<dir>::
     +  Write the pack containing filtered out objects to the
    @@ Documentation/git-repack.txt: depth is 4095.
        Write a reachability bitmap index as part of the repack. This
     
      ## builtin/repack.c ##
    -@@ builtin/repack.c: static void prepare_pack_filtered_cmd(struct child_process *cmd,
    - }
    - 
    - static void finish_pack_filtered_cmd(struct child_process *cmd,
    --                               struct string_list *names)
    -+                               struct string_list *names,
    -+                               const char *destination)
    - {
    -   if (cmd->in == -1) {
    -           /* No packed objects; cmd was never started */
    -@@ builtin/repack.c: static void finish_pack_filtered_cmd(struct child_process *cmd,
    - 
    -   close(cmd->in);
    - 
    --  if (finish_pack_objects_cmd(cmd, names, NULL, NULL))
    -+  if (finish_pack_objects_cmd(cmd, names, destination, NULL))
    -           die(_("could not finish pack-objects to pack filtered objects"));
    - }
    - 
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    +   int write_midx = 0;
        const char *cruft_expiration = NULL;
        const char *expire_to = NULL;
    -   struct child_process pack_filtered_cmd = CHILD_PROCESS_INIT;
     +  const char *filter_to = NULL;
      
        struct option builtin_repack_options[] = {
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
                strvec_push(&cmd.args, "--incremental");
        }
      
    --  if (po_args.filter)
    --          prepare_pack_filtered_cmd(&pack_filtered_cmd, &po_args, packtmp);
    -+  if (po_args.filter) {
    -+          if (!filter_to)
    -+                  filter_to = packtmp;
    -+          prepare_pack_filtered_cmd(&pack_filtered_cmd, &po_args, filter_to);
    -+  }
    - 
    ++  if (filter_to && !po_args.filter)
    ++          die(_("option '%s' can only be used along with '%s'"), "--filter-to", "--filter");
    ++
        if (geometry)
                cmd.in = -1;
    +   else
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
        }
      
    -   if (po_args.filter)
    --          finish_pack_filtered_cmd(&pack_filtered_cmd, &names);
    -+          finish_pack_filtered_cmd(&pack_filtered_cmd, &names, filter_to);
    - 
    -   string_list_sort(&names);
    - 
    +   if (po_args.filter) {
    ++          if (!filter_to)
    ++                  filter_to = packtmp;
    ++
    +           ret = write_filtered_pack(&po_args,
    +-                                    packtmp,
    ++                                    filter_to,
    +                                     find_pack_prefix(),
    +                                     &names,
    +                                     &existing_nonkept_packs,
     
      ## t/t7700-repack.sh ##
     @@ t/t7700-repack.sh: test_expect_success 'repacking with a filter works' '
    @@ t/t7700-repack.sh: test_expect_success 'repacking with a filter works' '
     +  blob_content=$(git -C bare.git show $blob_hash) &&
     +  test "$blob_content" = "content1"
     +'
    ++
    ++test_expect_success '--filter works with --max-pack-size' '
    ++  rm -rf filtered.git &&
    ++  git init --bare filtered.git &&
    ++  git init max-pack-size &&
    ++  (
    ++          cd max-pack-size &&
    ++          test_commit base &&
    ++          # two blobs which exceed the maximum pack size
    ++          test-tool genrandom foo 1048576 >foo &&
    ++          git hash-object -w foo &&
    ++          test-tool genrandom bar 1048576 >bar &&
    ++          git hash-object -w bar &&
    ++          git add foo bar &&
    ++          git commit -m "adding foo and bar"
    ++  ) &&
    ++  git clone --no-local --bare max-pack-size max-pack-size.git &&
    ++  (
    ++          cd max-pack-size.git &&
    ++          git -c repack.writebitmaps=false repack -a -d --filter=blob:none \
    ++                  --max-pack-size=1M \
    ++                  --filter-to=../filtered.git/objects/pack/pack &&
    ++          echo $(cd .. && pwd)/filtered.git/objects >objects/info/alternates &&
    ++
    ++          # Check that the 3 blobs are in different packfiles in filtered.git
    ++          test_stdout_line_count = 3 ls ../filtered.git/objects/pack/pack-*.pack &&
    ++          test_stdout_line_count = 1 ls objects/pack/pack-*.pack &&
    ++          foo_pack=$(test-tool find-pack HEAD:foo) &&
    ++          bar_pack=$(test-tool find-pack HEAD:bar) &&
    ++          base_pack=$(test-tool find-pack HEAD:base.t) &&
    ++          test "$foo_pack" != "$bar_pack" &&
    ++          test "$foo_pack" != "$base_pack" &&
    ++          test "$bar_pack" != "$base_pack" &&
    ++          for pack in "$foo_pack" "$bar_pack" "$base_pack"
    ++          do
    ++                  case "$foo_pack" in */filtered.git/objects/pack/*) true ;; *) return 1 ;; esac
    ++          done
    ++  )
    ++'
     +
      objdir=.git/objects
      midx=$objdir/pack/multi-pack-index
 9:  ae45d9845e =  8:  8cb3faa74c gc: add `gc.repackFilterTo` config option


Christian Couder (8):
  pack-objects: allow `--filter` without `--stdout`
  t/helper: add 'find-pack' test-tool
  repack: refactor finishing pack-objects command
  repack: refactor finding pack prefix
  repack: add `--filter=<filter-spec>` option
  gc: add `gc.repackFilter` config option
  repack: implement `--filter-to` for storing filtered out objects
  gc: add `gc.repackFilterTo` config option

 Documentation/config/gc.txt            |  11 ++
 Documentation/git-pack-objects.txt     |   4 +-
 Documentation/git-repack.txt           |  15 +++
 Makefile                               |   1 +
 builtin/gc.c                           |  10 ++
 builtin/pack-objects.c                 |   8 +-
 builtin/repack.c                       | 162 ++++++++++++++++++-------
 t/helper/test-find-pack.c              |  35 ++++++
 t/helper/test-tool.c                   |   1 +
 t/helper/test-tool.h                   |   1 +
 t/t5317-pack-objects-filter-objects.sh |   8 ++
 t/t6500-gc.sh                          |  23 ++++
 t/t7700-repack.sh                      |  82 +++++++++++++
 13 files changed, 311 insertions(+), 50 deletions(-)
 create mode 100644 t/helper/test-find-pack.c

-- 
2.41.0.244.g8cb3faa74c


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v2 1/8] pack-objects: allow `--filter` without `--stdout`
  2023-07-05  6:08 ` [PATCH v2 0/8] " Christian Couder
@ 2023-07-05  6:08   ` Christian Couder
  2023-07-05  6:08   ` [PATCH v2 2/8] t/helper: add 'find-pack' test-tool Christian Couder
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-07-05  6:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

9535ce7337 (pack-objects: add list-objects filtering, 2017-11-21)
taught `git pack-objects` to use `--filter`, but required the use of
`--stdout` since a partial clone mechanism was not yet in place to
handle missing objects. Since then, changes like 9e27beaa23
(promisor-remote: implement promisor_remote_get_direct(), 2019-06-25)
and others added support to dynamically fetch objects that were missing.

Even without a promisor remote, filtering out objects can also be useful
if we can put the filtered out objects in a separate pack, and in this
case it also makes sense for pack-objects to write the packfile directly
to an actual file rather than on stdout.

Remove the `--stdout` requirement when using `--filter`, so that in a
follow-up commit, repack can pass `--filter` to pack-objects to omit
certain objects from the resulting packfile.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-pack-objects.txt     | 4 ++--
 builtin/pack-objects.c                 | 8 ++------
 t/t5317-pack-objects-filter-objects.sh | 8 ++++++++
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index a9995a932c..583270a85f 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -298,8 +298,8 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it creates a bundle.
 	nevertheless.
 
 --filter=<filter-spec>::
-	Requires `--stdout`.  Omits certain objects (usually blobs) from
-	the resulting packfile.  See linkgit:git-rev-list[1] for valid
+	Omits certain objects (usually blobs) from the resulting
+	packfile.  See linkgit:git-rev-list[1] for valid
 	`<filter-spec>` forms.
 
 --no-filter::
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3c4db66478..614721684a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4388,12 +4388,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (filter_options.choice) {
-		if (!pack_to_stdout)
-			die(_("cannot use --filter without --stdout"));
-		if (stdin_packs)
-			die(_("cannot use --filter with --stdin-packs"));
-	}
+	if (stdin_packs && filter_options.choice)
+		die(_("cannot use --filter with --stdin-packs"));
 
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
diff --git a/t/t5317-pack-objects-filter-objects.sh b/t/t5317-pack-objects-filter-objects.sh
index b26d476c64..2ff3eef9a3 100755
--- a/t/t5317-pack-objects-filter-objects.sh
+++ b/t/t5317-pack-objects-filter-objects.sh
@@ -53,6 +53,14 @@ test_expect_success 'verify blob:none packfile has no blobs' '
 	! grep blob verify_result
 '
 
+test_expect_success 'verify blob:none packfile without --stdout' '
+	git -C r1 pack-objects --revs --filter=blob:none mypackname >packhash <<-EOF &&
+	HEAD
+	EOF
+	git -C r1 verify-pack -v "mypackname-$(cat packhash).pack" >verify_result &&
+	! grep blob verify_result
+'
+
 test_expect_success 'verify normal and blob:none packfiles have same commits/trees' '
 	git -C r1 verify-pack -v ../all.pack >verify_result &&
 	grep -E "commit|tree" verify_result |
-- 
2.41.0.244.g8cb3faa74c


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 2/8] t/helper: add 'find-pack' test-tool
  2023-07-05  6:08 ` [PATCH v2 0/8] " Christian Couder
  2023-07-05  6:08   ` [PATCH v2 1/8] pack-objects: allow `--filter` without `--stdout` Christian Couder
@ 2023-07-05  6:08   ` Christian Couder
  2023-07-05  6:08   ` [PATCH v2 3/8] repack: refactor finishing pack-objects command Christian Couder
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-07-05  6:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

In a following commit, we will make it possible to separate objects in
different packfiles depending on a filter.

To make sure that the right objects are in the right packs, let's add a
new test-tool that can display which packfile(s) a given object is in.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Makefile                  |  1 +
 t/helper/test-find-pack.c | 35 +++++++++++++++++++++++++++++++++++
 t/helper/test-tool.c      |  1 +
 t/helper/test-tool.h      |  1 +
 4 files changed, 38 insertions(+)
 create mode 100644 t/helper/test-find-pack.c

diff --git a/Makefile b/Makefile
index fb541dedc9..14ee0c45d4 100644
--- a/Makefile
+++ b/Makefile
@@ -800,6 +800,7 @@ TEST_BUILTINS_OBJS += test-dump-untracked-cache.o
 TEST_BUILTINS_OBJS += test-env-helper.o
 TEST_BUILTINS_OBJS += test-example-decorate.o
 TEST_BUILTINS_OBJS += test-fast-rebase.o
+TEST_BUILTINS_OBJS += test-find-pack.o
 TEST_BUILTINS_OBJS += test-fsmonitor-client.o
 TEST_BUILTINS_OBJS += test-genrandom.o
 TEST_BUILTINS_OBJS += test-genzeros.o
diff --git a/t/helper/test-find-pack.c b/t/helper/test-find-pack.c
new file mode 100644
index 0000000000..1928fe7329
--- /dev/null
+++ b/t/helper/test-find-pack.c
@@ -0,0 +1,35 @@
+#include "test-tool.h"
+#include "object-name.h"
+#include "object-store.h"
+#include "packfile.h"
+#include "setup.h"
+
+/*
+ * Display the path(s), one per line, of the packfile(s) containing
+ * the given object.
+ */
+
+static const char *find_pack_usage = "\n"
+"  test-tool find-pack <object>";
+
+
+int cmd__find_pack(int argc, const char **argv)
+{
+	struct object_id oid;
+	struct packed_git *p;
+
+	setup_git_directory();
+
+	if (argc != 2)
+		usage(find_pack_usage);
+
+	if (repo_get_oid(the_repository, argv[1], &oid))
+		die("cannot parse %s as an object name", argv[1]);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		if (find_pack_entry_one(oid.hash, p))
+			printf("%s\n", p->pack_name);
+	}
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index abe8a785eb..41da40c296 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -31,6 +31,7 @@ static struct test_cmd cmds[] = {
 	{ "env-helper", cmd__env_helper },
 	{ "example-decorate", cmd__example_decorate },
 	{ "fast-rebase", cmd__fast_rebase },
+	{ "find-pack", cmd__find_pack },
 	{ "fsmonitor-client", cmd__fsmonitor_client },
 	{ "genrandom", cmd__genrandom },
 	{ "genzeros", cmd__genzeros },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index ea2672436c..411dbf2db4 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -25,6 +25,7 @@ int cmd__dump_reftable(int argc, const char **argv);
 int cmd__env_helper(int argc, const char **argv);
 int cmd__example_decorate(int argc, const char **argv);
 int cmd__fast_rebase(int argc, const char **argv);
+int cmd__find_pack(int argc, const char **argv);
 int cmd__fsmonitor_client(int argc, const char **argv);
 int cmd__genrandom(int argc, const char **argv);
 int cmd__genzeros(int argc, const char **argv);
-- 
2.41.0.244.g8cb3faa74c


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 3/8] repack: refactor finishing pack-objects command
  2023-07-05  6:08 ` [PATCH v2 0/8] " Christian Couder
  2023-07-05  6:08   ` [PATCH v2 1/8] pack-objects: allow `--filter` without `--stdout` Christian Couder
  2023-07-05  6:08   ` [PATCH v2 2/8] t/helper: add 'find-pack' test-tool Christian Couder
@ 2023-07-05  6:08   ` Christian Couder
  2023-07-05  6:08   ` [PATCH v2 4/8] repack: refactor finding pack prefix Christian Couder
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-07-05  6:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

Create a new finish_pack_objects_cmd() to refactor duplicated code
that handles reading the packfile names from the output of a
`git pack-objects` command and putting it into a string_list, as well as
calling finish_command().

While at it, beautify a code comment a bit in the new function.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org
---
 builtin/repack.c | 70 +++++++++++++++++++++++-------------------------
 1 file changed, 33 insertions(+), 37 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index a96e1c2638..916ba7c6d0 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -704,6 +704,36 @@ static void remove_redundant_bitmaps(struct string_list *include,
 	strbuf_release(&path);
 }
 
+static int finish_pack_objects_cmd(struct child_process *cmd,
+				   struct string_list *names,
+				   int local)
+{
+	FILE *out;
+	struct strbuf line = STRBUF_INIT;
+
+	out = xfdopen(cmd->out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		struct string_list_item *item;
+
+		if (line.len != the_hash_algo->hexsz)
+			die(_("repack: Expecting full hex object ID lines only "
+			      "from pack-objects."));
+		/*
+		 * Avoid putting packs written outside of the repository in the
+		 * list of names.
+		 */
+		if (local) {
+			item = string_list_append(names, line.buf);
+			item->util = populate_pack_exts(line.buf);
+		}
+	}
+	fclose(out);
+
+	strbuf_release(&line);
+
+	return finish_command(cmd);
+}
+
 static int write_cruft_pack(const struct pack_objects_args *args,
 			    const char *destination,
 			    const char *pack_prefix,
@@ -713,9 +743,8 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 			    struct string_list *existing_kept_packs)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
-	struct strbuf line = STRBUF_INIT;
 	struct string_list_item *item;
-	FILE *in, *out;
+	FILE *in;
 	int ret;
 	const char *scratch;
 	int local = skip_prefix(destination, packdir, &scratch);
@@ -759,27 +788,7 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 		fprintf(in, "%s.pack\n", item->string);
 	fclose(in);
 
-	out = xfdopen(cmd.out, "r");
-	while (strbuf_getline_lf(&line, out) != EOF) {
-		struct string_list_item *item;
-
-		if (line.len != the_hash_algo->hexsz)
-			die(_("repack: Expecting full hex object ID lines only "
-			      "from pack-objects."));
-		/*
-		 * avoid putting packs written outside of the repository in the
-		 * list of names
-		 */
-		if (local) {
-			item = string_list_append(names, line.buf);
-			item->util = populate_pack_exts(line.buf);
-		}
-	}
-	fclose(out);
-
-	strbuf_release(&line);
-
-	return finish_command(&cmd);
+	return finish_pack_objects_cmd(&cmd, names, local);
 }
 
 int cmd_repack(int argc, const char **argv, const char *prefix)
@@ -790,10 +799,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list existing_nonkept_packs = STRING_LIST_INIT_DUP;
 	struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
 	struct pack_geometry *geometry = NULL;
-	struct strbuf line = STRBUF_INIT;
 	struct tempfile *refs_snapshot = NULL;
 	int i, ext, ret;
-	FILE *out;
 	int show_progress;
 
 	/* variables to be filled by option parsing */
@@ -1024,18 +1031,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		fclose(in);
 	}
 
-	out = xfdopen(cmd.out, "r");
-	while (strbuf_getline_lf(&line, out) != EOF) {
-		struct string_list_item *item;
-
-		if (line.len != the_hash_algo->hexsz)
-			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
-		item = string_list_append(&names, line.buf);
-		item->util = populate_pack_exts(item->string);
-	}
-	strbuf_release(&line);
-	fclose(out);
-	ret = finish_command(&cmd);
+	ret = finish_pack_objects_cmd(&cmd, &names, 1);
 	if (ret)
 		goto cleanup;
 
-- 
2.41.0.244.g8cb3faa74c


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 4/8] repack: refactor finding pack prefix
  2023-07-05  6:08 ` [PATCH v2 0/8] " Christian Couder
                     ` (2 preceding siblings ...)
  2023-07-05  6:08   ` [PATCH v2 3/8] repack: refactor finishing pack-objects command Christian Couder
@ 2023-07-05  6:08   ` Christian Couder
  2023-07-05  6:08   ` [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option Christian Couder
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-07-05  6:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

Create a new find_pack_prefix() to refactor code that handles finding
the pack prefix from the packtmp and packdir global variables, as we are
going to need this feature again in following commit.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org
---
 builtin/repack.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 916ba7c6d0..4e5afee8d8 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -791,6 +791,17 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 	return finish_pack_objects_cmd(&cmd, names, local);
 }
 
+static const char *find_pack_prefix(void)
+{
+	const char *pack_prefix;
+	if (!skip_prefix(packtmp, packdir, &pack_prefix))
+		die(_("pack prefix %s does not begin with objdir %s"),
+		    packtmp, packdir);
+	if (*pack_prefix == '/')
+		pack_prefix++;
+	return pack_prefix;
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -1039,12 +1050,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		printf_ln(_("Nothing new to pack."));
 
 	if (pack_everything & PACK_CRUFT) {
-		const char *pack_prefix;
-		if (!skip_prefix(packtmp, packdir, &pack_prefix))
-			die(_("pack prefix %s does not begin with objdir %s"),
-			    packtmp, packdir);
-		if (*pack_prefix == '/')
-			pack_prefix++;
+		const char *pack_prefix = find_pack_prefix();
 
 		if (!cruft_po_args.window)
 			cruft_po_args.window = po_args.window;
-- 
2.41.0.244.g8cb3faa74c


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option
  2023-07-05  6:08 ` [PATCH v2 0/8] " Christian Couder
                     ` (3 preceding siblings ...)
  2023-07-05  6:08   ` [PATCH v2 4/8] repack: refactor finding pack prefix Christian Couder
@ 2023-07-05  6:08   ` Christian Couder
  2023-07-05 17:53     ` Junio C Hamano
  2023-07-05 18:12     ` Junio C Hamano
  2023-07-05  6:08   ` [PATCH v2 6/8] gc: add `gc.repackFilter` config option Christian Couder
                     ` (3 subsequent siblings)
  8 siblings, 2 replies; 161+ messages in thread
From: Christian Couder @ 2023-07-05  6:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

This new option puts the objects specified by `<filter-spec>` into a
separate packfile.

This could be useful if, for example, some large blobs take a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.

In other use cases it might make sense to put all the blobs into
separate storage.

This is done by running two `git pack-objects` commands. The first one
is run with `--filter=<filter-spec>`, using the specified filter. It
packs objects while omitting the objects specified by the filter.
Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt |  9 +++++
 builtin/repack.c             | 67 ++++++++++++++++++++++++++++++++++++
 t/t7700-repack.sh            | 16 +++++++++
 3 files changed, 92 insertions(+)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 4017157949..d702553033 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -143,6 +143,15 @@ depth is 4095.
 	a larger and slower repository; see the discussion in
 	`pack.packSizeLimit`.
 
+--filter=<filter-spec>::
+	Remove objects matching the filter specification from the
+	resulting packfile and put them into a separate packfile. Note
+	that objects used in the working directory are not filtered
+	out. So for the split to fully work, it's best to perform it
+	in a bare repo and to use the `-a` and `-d` options along with
+	this option.  See linkgit:git-rev-list[1] for valid
+	`<filter-spec>` forms.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index 4e5afee8d8..e2661b956c 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -54,6 +54,7 @@ struct pack_objects_args {
 	const char *depth;
 	const char *threads;
 	const char *max_pack_size;
+	const char *filter;
 	int no_reuse_delta;
 	int no_reuse_object;
 	int quiet;
@@ -174,6 +175,8 @@ static void prepare_pack_objects(struct child_process *cmd,
 		strvec_pushf(&cmd->args, "--threads=%s", args->threads);
 	if (args->max_pack_size)
 		strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
+	if (args->filter)
+		strvec_pushf(&cmd->args, "--filter=%s", args->filter);
 	if (args->no_reuse_delta)
 		strvec_pushf(&cmd->args, "--no-reuse-delta");
 	if (args->no_reuse_object)
@@ -734,6 +737,57 @@ static int finish_pack_objects_cmd(struct child_process *cmd,
 	return finish_command(cmd);
 }
 
+static int write_filtered_pack(const struct pack_objects_args *args,
+			       const char *destination,
+			       const char *pack_prefix,
+			       struct string_list *names,
+			       struct string_list *existing_packs,
+			       struct string_list *existing_kept_packs)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct string_list_item *item;
+	FILE *in;
+	int ret;
+	const char *scratch;
+	int local = skip_prefix(destination, packdir, &scratch);
+
+	/* We need to copy 'args' to modify it */
+	struct pack_objects_args new_args = *args;
+
+	/* No need to filter again */
+	new_args.filter = NULL;
+
+	prepare_pack_objects(&cmd, &new_args, destination);
+
+	strvec_push(&cmd.args, "--stdin-packs");
+
+	cmd.in = -1;
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	/*
+	 * names has a confusing double use: it both provides the list
+	 * of just-written new packs, and accepts the name of the
+	 * filtered pack we are writing.
+	 *
+	 * By the time it is read here, it contains only the pack(s)
+	 * that were just written, which is exactly the set of packs we
+	 * want to consider kept.
+	 */
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, names)
+		fprintf(in, "^%s-%s.pack\n", pack_prefix, item->string);
+	for_each_string_list_item(item, existing_packs)
+		fprintf(in, "%s.pack\n", item->string);
+	for_each_string_list_item(item, existing_kept_packs)
+		fprintf(in, "^%s.pack\n", item->string);
+	fclose(in);
+
+	return finish_pack_objects_cmd(&cmd, names, local);
+}
+
 static int write_cruft_pack(const struct pack_objects_args *args,
 			    const char *destination,
 			    const char *pack_prefix,
@@ -866,6 +920,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("limits the maximum number of threads")),
 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
 				N_("maximum size of each packfile")),
+		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
+				N_("object filtering")),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
@@ -1105,6 +1161,17 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		}
 	}
 
+	if (po_args.filter) {
+		ret = write_filtered_pack(&po_args,
+					  packtmp,
+					  find_pack_prefix(),
+					  &names,
+					  &existing_nonkept_packs,
+					  &existing_kept_packs);
+		if (ret)
+			goto cleanup;
+	}
+
 	string_list_sort(&names);
 
 	close_object_store(the_repository->objects);
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index af79266c58..66589e4217 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -293,6 +293,22 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_must_be_empty actual
 '
 
+test_expect_success 'repacking with a filter works' '
+	git -C bare.git repack -a -d &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+	git -C bare.git -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
+	test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
+	commit_pack=$(test-tool -C bare.git find-pack HEAD) &&
+	test -n "$commit_pack" &&
+	blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
+	test -n "$blob_pack" &&
+	test "$commit_pack" != "$blob_pack" &&
+	tree_pack=$(test-tool -C bare.git find-pack HEAD^{tree}) &&
+	test "$tree_pack" = "$commit_pack" &&
+	blob_pack2=$(test-tool -C bare.git find-pack HEAD:file2) &&
+	test "$blob_pack2" = "$blob_pack"
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.41.0.244.g8cb3faa74c


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 6/8] gc: add `gc.repackFilter` config option
  2023-07-05  6:08 ` [PATCH v2 0/8] " Christian Couder
                     ` (4 preceding siblings ...)
  2023-07-05  6:08   ` [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option Christian Couder
@ 2023-07-05  6:08   ` Christian Couder
  2023-07-05  6:08   ` [PATCH v2 7/8] repack: implement `--filter-to` for storing filtered out objects Christian Couder
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-07-05  6:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit has implemented `git repack --filter=<filter-spec>` to
allow users to filter out some objects from the main pack and move them
into a new different pack.

Users might want to perform such a cleanup regularly at the same time as
they perform other repacks and cleanups, so as part of `git gc`.

Let's allow them to configure a <filter-spec> for that purpose using a
new gc.repackFilter config option.

Now when `git gc` will perform a repack with a <filter-spec> configured
through this option and not empty, the repack process will be passed a
corresponding `--filter=<filter-spec>` argument.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/gc.txt |  5 +++++
 builtin/gc.c                |  6 ++++++
 t/t6500-gc.sh               | 12 ++++++++++++
 3 files changed, 23 insertions(+)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index ca47eb2008..2153bde7ac 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -145,6 +145,11 @@ Multiple hooks are supported, but all must exit successfully, else the
 operation (either generating a cruft pack or unpacking unreachable
 objects) will be halted.
 
+gc.repackFilter::
+	When repacking, use the specified filter to move certain
+	objects into a separate packfile.  See the
+	`--filter=<filter-spec>` option of linkgit:git-repack[1].
+
 gc.rerereResolved::
 	Records of conflicted merge you resolved earlier are
 	kept for this many days when 'git rerere gc' is run.
diff --git a/builtin/gc.c b/builtin/gc.c
index 91eec7703a..046147fdcc 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -62,6 +62,7 @@ static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
+static char *repack_filter;
 static unsigned long big_pack_threshold;
 static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 
@@ -171,6 +172,8 @@ static void gc_config(void)
 	git_config_get_ulong("gc.bigpackthreshold", &big_pack_threshold);
 	git_config_get_ulong("pack.deltacachesize", &max_delta_cache_size);
 
+	git_config_get_string("gc.repackfilter", &repack_filter);
+
 	git_config(git_default_config, NULL);
 }
 
@@ -356,6 +359,9 @@ static void add_repack_all_option(struct string_list *keep_pack)
 
 	if (keep_pack)
 		for_each_string_list(keep_pack, keep_one_pack, NULL);
+
+	if (repack_filter && *repack_filter)
+		strvec_pushf(&repack, "--filter=%s", repack_filter);
 }
 
 static void add_repack_incremental_option(void)
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 69509d0c11..5b89faf505 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -202,6 +202,18 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
 	grep -E "^trace: (built-in|exec|run_command): git reflog expire --" trace.out
 '
 
+test_expect_success 'gc.repackFilter launches repack with a filter' '
+	test_when_finished "rm -rf bare.git" &&
+	git clone --no-local --bare . bare.git &&
+
+	git -C bare.git -c gc.cruftPacks=false gc &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+
+	GIT_TRACE=$(pwd)/trace.out git -C bare.git -c gc.repackFilter=blob:none -c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&
+	test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
+	grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
+'
+
 prepare_cruft_history () {
 	test_commit base &&
 
-- 
2.41.0.244.g8cb3faa74c


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 7/8] repack: implement `--filter-to` for storing filtered out objects
  2023-07-05  6:08 ` [PATCH v2 0/8] " Christian Couder
                     ` (5 preceding siblings ...)
  2023-07-05  6:08   ` [PATCH v2 6/8] gc: add `gc.repackFilter` config option Christian Couder
@ 2023-07-05  6:08   ` Christian Couder
  2023-07-05 18:26     ` Junio C Hamano
  2023-07-05  6:08   ` [PATCH v2 8/8] gc: add `gc.repackFilterTo` config option Christian Couder
  2023-07-24  8:59   ` [PATCH v3 0/8] Repack objects into separate packfiles based on a filter Christian Couder
  8 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-07-05  6:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit has implemented `git repack --filter=<filter-spec>` to
allow users to filter out some objects from the main pack and move them
into a new different pack.

It would be nice if this new different pack could be created in a
different directory than the regular pack. This would make it possible
to move large blobs into a pack on a different kind of storage, for
example cheaper storage. Even in a different directory this pack can be
accessible if, for example, the Git alternates mechanism is used to
point to it.

While at it, as an example to show that `--filter` and `--filter-to`
work well with other options, let's also add a test to check that these
options work well with `--max-pack-size`.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>

repack: add test with --max-pack-size
---
 Documentation/git-repack.txt |  6 ++++
 builtin/repack.c             | 11 +++++-
 t/t7700-repack.sh            | 66 ++++++++++++++++++++++++++++++++++++
 3 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index d702553033..396a91b9ac 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -152,6 +152,12 @@ depth is 4095.
 	this option.  See linkgit:git-rev-list[1] for valid
 	`<filter-spec>` forms.
 
+--filter-to=<dir>::
+	Write the pack containing filtered out objects to the
+	directory `<dir>`. This can be used for putting the pack on a
+	separate object directory that is accessed through the Git
+	alternates mechanism. Only useful with `--filter`.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index e2661b956c..5695f9734d 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -879,6 +879,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int write_midx = 0;
 	const char *cruft_expiration = NULL;
 	const char *expire_to = NULL;
+	const char *filter_to = NULL;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -932,6 +933,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			   N_("write a multi-pack index of the resulting packs")),
 		OPT_STRING(0, "expire-to", &expire_to, N_("dir"),
 			   N_("pack prefix to store a pack containing pruned objects")),
+		OPT_STRING(0, "filter-to", &filter_to, N_("dir"),
+			   N_("pack prefix to store a pack containing filtered out objects")),
 		OPT_END()
 	};
 
@@ -1075,6 +1078,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		strvec_push(&cmd.args, "--incremental");
 	}
 
+	if (filter_to && !po_args.filter)
+		die(_("option '%s' can only be used along with '%s'"), "--filter-to", "--filter");
+
 	if (geometry)
 		cmd.in = -1;
 	else
@@ -1162,8 +1168,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 
 	if (po_args.filter) {
+		if (!filter_to)
+			filter_to = packtmp;
+
 		ret = write_filtered_pack(&po_args,
-					  packtmp,
+					  filter_to,
 					  find_pack_prefix(),
 					  &names,
 					  &existing_nonkept_packs,
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 66589e4217..a96c1635b2 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -309,6 +309,72 @@ test_expect_success 'repacking with a filter works' '
 	test "$blob_pack2" = "$blob_pack"
 '
 
+test_expect_success '--filter-to stores filtered out objects' '
+	git -C bare.git repack -a -d &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+
+	git init --bare filtered.git &&
+	git -C bare.git -c repack.writebitmaps=false repack -a -d \
+		--filter=blob:none \
+		--filter-to=../filtered.git/objects/pack/pack &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/pack-*.pack &&
+	test_stdout_line_count = 1 ls filtered.git/objects/pack/pack-*.pack &&
+
+	commit_pack=$(test-tool -C bare.git find-pack HEAD) &&
+	test -n "$commit_pack" &&
+	blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
+	test -z "$blob_pack" &&
+	blob_hash=$(git -C bare.git rev-parse HEAD:file1) &&
+	test -n "$blob_hash" &&
+	blob_pack=$(test-tool -C filtered.git find-pack $blob_hash) &&
+	test -n "$blob_pack" &&
+
+	echo $(pwd)/filtered.git/objects >bare.git/objects/info/alternates &&
+	blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
+	test -n "$blob_pack" &&
+	blob_content=$(git -C bare.git show $blob_hash) &&
+	test "$blob_content" = "content1"
+'
+
+test_expect_success '--filter works with --max-pack-size' '
+	rm -rf filtered.git &&
+	git init --bare filtered.git &&
+	git init max-pack-size &&
+	(
+		cd max-pack-size &&
+		test_commit base &&
+		# two blobs which exceed the maximum pack size
+		test-tool genrandom foo 1048576 >foo &&
+		git hash-object -w foo &&
+		test-tool genrandom bar 1048576 >bar &&
+		git hash-object -w bar &&
+		git add foo bar &&
+		git commit -m "adding foo and bar"
+	) &&
+	git clone --no-local --bare max-pack-size max-pack-size.git &&
+	(
+		cd max-pack-size.git &&
+		git -c repack.writebitmaps=false repack -a -d --filter=blob:none \
+			--max-pack-size=1M \
+			--filter-to=../filtered.git/objects/pack/pack &&
+		echo $(cd .. && pwd)/filtered.git/objects >objects/info/alternates &&
+
+		# Check that the 3 blobs are in different packfiles in filtered.git
+		test_stdout_line_count = 3 ls ../filtered.git/objects/pack/pack-*.pack &&
+		test_stdout_line_count = 1 ls objects/pack/pack-*.pack &&
+		foo_pack=$(test-tool find-pack HEAD:foo) &&
+		bar_pack=$(test-tool find-pack HEAD:bar) &&
+		base_pack=$(test-tool find-pack HEAD:base.t) &&
+		test "$foo_pack" != "$bar_pack" &&
+		test "$foo_pack" != "$base_pack" &&
+		test "$bar_pack" != "$base_pack" &&
+		for pack in "$foo_pack" "$bar_pack" "$base_pack"
+		do
+			case "$foo_pack" in */filtered.git/objects/pack/*) true ;; *) return 1 ;; esac
+		done
+	)
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.41.0.244.g8cb3faa74c


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 8/8] gc: add `gc.repackFilterTo` config option
  2023-07-05  6:08 ` [PATCH v2 0/8] " Christian Couder
                     ` (6 preceding siblings ...)
  2023-07-05  6:08   ` [PATCH v2 7/8] repack: implement `--filter-to` for storing filtered out objects Christian Couder
@ 2023-07-05  6:08   ` Christian Couder
  2023-07-24  8:59   ` [PATCH v3 0/8] Repack objects into separate packfiles based on a filter Christian Couder
  8 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-07-05  6:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit implemented the `gc.repackFilter` config option
to specify a filter that should be used by `git gc` when
performing repacks.

Another previous commit has implemented
`git repack --filter-to=<dir>` to specify the location of the
packfile containing filtered out objects when using a filter.

Let's implement the `gc.repackFilterTo` config option to specify
that location in the config when `gc.repackFilter` is used.

Now when `git gc` will perform a repack with a <dir> configured
through this option and not empty, the repack process will be
passed a corresponding `--filter-to=<dir>` argument.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/gc.txt |  6 ++++++
 builtin/gc.c                |  4 ++++
 t/t6500-gc.sh               | 13 ++++++++++++-
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index 2153bde7ac..0e32007502 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -150,6 +150,12 @@ gc.repackFilter::
 	objects into a separate packfile.  See the
 	`--filter=<filter-spec>` option of linkgit:git-repack[1].
 
+gc.repackFilterTo::
+	When repacking and using a filter, see `gc.repackFilter`, the
+	specified location will be used to create the packfile
+	containing the filtered out objects.  See the
+	`--filter-to=<dir>` option of linkgit:git-repack[1].
+
 gc.rerereResolved::
 	Records of conflicted merge you resolved earlier are
 	kept for this many days when 'git rerere gc' is run.
diff --git a/builtin/gc.c b/builtin/gc.c
index 046147fdcc..f31167be6d 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -63,6 +63,7 @@ static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
 static char *repack_filter;
+static char *repack_filter_to;
 static unsigned long big_pack_threshold;
 static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 
@@ -173,6 +174,7 @@ static void gc_config(void)
 	git_config_get_ulong("pack.deltacachesize", &max_delta_cache_size);
 
 	git_config_get_string("gc.repackfilter", &repack_filter);
+	git_config_get_string("gc.repackfilterto", &repack_filter_to);
 
 	git_config(git_default_config, NULL);
 }
@@ -362,6 +364,8 @@ static void add_repack_all_option(struct string_list *keep_pack)
 
 	if (repack_filter && *repack_filter)
 		strvec_pushf(&repack, "--filter=%s", repack_filter);
+	if (repack_filter_to && *repack_filter_to)
+		strvec_pushf(&repack, "--filter-to=%s", repack_filter_to);
 }
 
 static void add_repack_incremental_option(void)
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 5b89faf505..37056a824b 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -203,7 +203,6 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
 '
 
 test_expect_success 'gc.repackFilter launches repack with a filter' '
-	test_when_finished "rm -rf bare.git" &&
 	git clone --no-local --bare . bare.git &&
 
 	git -C bare.git -c gc.cruftPacks=false gc &&
@@ -214,6 +213,18 @@ test_expect_success 'gc.repackFilter launches repack with a filter' '
 	grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
 '
 
+test_expect_success 'gc.repackFilterTo store filtered out objects' '
+	test_when_finished "rm -rf bare.git filtered.git" &&
+
+	git init --bare filtered.git &&
+	git -C bare.git -c gc.repackFilter=blob:none \
+		-c gc.repackFilterTo=../filtered.git/objects/pack/pack \
+		-c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&
+
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+	test_stdout_line_count = 1 ls filtered.git/objects/pack/*.pack
+'
+
 prepare_cruft_history () {
 	test_commit base &&
 
-- 
2.41.0.244.g8cb3faa74c


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH 1/9] pack-objects: allow `--filter` without `--stdout`
  2023-06-21 10:49   ` Taylor Blau
@ 2023-07-05  6:16     ` Christian Couder
  0 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-07-05  6:16 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jun 21, 2023 at 12:49 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Wed, Jun 14, 2023 at 09:25:33PM +0200, Christian Couder wrote:
> > 9535ce7337 (pack-objects: add list-objects filtering, 2017-11-21)
> > taught `git pack-objects` to use `--filter`, but required the use of
> > `--stdout` since a partial clone mechanism was not yet in place to
> > handle missing objects. Since then, changes like 9e27beaa23
> > (promisor-remote: implement promisor_remote_get_direct(), 2019-06-25)
> > and others added support to dynamically fetch objects that were missing.
> >
> > Even without a promisor remote, filtering out objects can also be useful
> > if we can put the filtered out objects in a separate pack, and in this
> > case it also makes sense for pack-objects to write the packfile directly
> > to an actual file rather than on stdout.
> >
> > Remove the `--stdout` requirement when using `--filter`, so that in a
> > follow-up commit, repack can pass `--filter` to pack-objects to omit
> > certain objects from the resulting packfile.
>
> Makes sense.
>
> Is there any situation in which using --stdout with --filter would be a
> potential foot-gun? I am not as familiar with the partial clone
> mechanism as others CC'd, so I have no idea one way or the other.

This patch allows `--filter` without `--stdout`, so `--stdout` with
`--filter` was already allowed before this patch and is still allowed
after it.

Besides, using `--stdout` or not using it is just a convenience. It
doesn't change much what users can do with `git pack-objects`.

> If it is unsafe in certain situations (or, at the very least, could
> produce surprising behavior), it may be worthwhile to only allow
> `--filter=<filter> --stdout` with some kind of
> `--filter-to-stdout-is-ok` flag to indicate that the caller knows what
> they are doing.

`git pack-objects` with `--filter` can be unsafe in some cases if
users do stupid things with packfiles afterwards, but allowing it to
run without --stdout doesn't significantly change that. In any case by
itself it doesn't delete any data. It just creates a pack containing
fewer objects.

Anyway I added a small test that checks that --filter can be used
without --stdout.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 8/9] repack: implement `--filter-to` for storing filtered out objects
  2023-06-21 11:49   ` Taylor Blau
  2023-06-21 12:08     ` Christian Couder
@ 2023-07-05  6:19     ` Christian Couder
  1 sibling, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-07-05  6:19 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jun 21, 2023 at 1:49 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Wed, Jun 14, 2023 at 09:25:40PM +0200, Christian Couder wrote:
> > diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
> > index aa29c7e648..070dd22610 100644
> > --- a/Documentation/git-repack.txt
> > +++ b/Documentation/git-repack.txt
> > @@ -148,6 +148,12 @@ depth is 4095.
> >       resulting packfile and put them into a separate packfile. See
> >       linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
> >
> > +--filter-to=<dir>::
> > +     Write the pack containing filtered out objects to the
> > +     directory `<dir>`. This can be used for putting the pack on a
> > +     separate object directory that is accessed through the Git
> > +     alternates mechanism. Only useful with `--filter`.
>
> Here you say "only useful with --filter", but...
>
> > @@ -1073,8 +1077,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
> >               strvec_push(&cmd.args, "--incremental");
> >       }
> >
> > -     if (po_args.filter)
> > -             prepare_pack_filtered_cmd(&pack_filtered_cmd, &po_args, packtmp);
> > +     if (po_args.filter) {
> > +             if (!filter_to)
> > +                     filter_to = packtmp;
> > +             prepare_pack_filtered_cmd(&pack_filtered_cmd, &po_args, filter_to);
> > +     }
>
> Would you want an "} else if (filter_to)" here to die and show the usage
> message, since --filter-to needs --filter? Or maybe it should imply
> --filter-to.

I have added such a check in the version 2 I just sent. We now die()
if --filter-to is used without --filter.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 6/9] repack: add `--filter=<filter-spec>` option
  2023-06-21 11:17   ` Taylor Blau
@ 2023-07-05  7:18     ` Christian Couder
  0 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-07-05  7:18 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jun 21, 2023 at 1:17 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Wed, Jun 14, 2023 at 09:25:38PM +0200, Christian Couder wrote:
> > ---
> >  Documentation/git-repack.txt |  5 +++
> >  builtin/repack.c             | 75 ++++++++++++++++++++++++++++++++++--
> >  t/t7700-repack.sh            | 16 ++++++++
> >  3 files changed, 93 insertions(+), 3 deletions(-)
>
> Having read through the implementation in the repack builtin, I am
> almost certain that my suggestion earlier in the thread to implement
> this in terms of 'git pack-objects --filter' and 'git pack-objects
> --stdin-packs' would work.

Yeah, it works, thanks. That's what is used in the version 2 I just sent.

> > diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
> > index 4017157949..aa29c7e648 100644
> > --- a/Documentation/git-repack.txt
> > +++ b/Documentation/git-repack.txt
> > @@ -143,6 +143,11 @@ depth is 4095.
> >       a larger and slower repository; see the discussion in
> >       `pack.packSizeLimit`.
> >
> > +--filter=<filter-spec>::
> > +     Remove objects matching the filter specification from the
> > +     resulting packfile and put them into a separate packfile. See
> > +     linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
> > +
>
> This documentation leaves me with a handful of questions about how it
> interacts with other options.

I have improved this doc in version 2. It now says:

--filter=<filter-spec>::
       Remove objects matching the filter specification from the
       resulting packfile and put them into a separate packfile. Note
       that objects used in the working directory are not filtered
       out. So for the split to fully work, it's best to perform it
       in a bare repo and to use the `-a` and `-d` options along with
       this option.  See linkgit:git-rev-list[1] for valid
       `<filter-spec>` forms.

For comparison the doc about --cruft is:

--cruft::
       Same as `-a`, unless `-d` is used. Then any unreachable objects
       are packed into a separate cruft pack. Unreachable objects can
       be pruned using the normal expiry rules with the next `git gc`
       invocation (see linkgit:git-gc[1]). Incompatible with `-k`.

> Here are some:
>
>   - What happens when you pass it with "-d"? Does it delete objects that
>     didn't match the filter? Leave them alone?

With or without "-d", no object is deleted, objects are just put into
2 different packfiles depending on whether or not they match the
filter.

> If the latter, should
>     this combination be declared invalid instead of silently ignoring
>     the user's request to delete redundant packs?

"-d" is indeed about removing redundant packs. If you want to split
the objects into different packs according to a filter, then you
probably don't want to keep packs that contain both kinds of objects
that you want to separate into different packs, so the doc now
recommends using "-d" (along with "-a") when using "--filter". That's
also what the tests are using.

I am not sure it's worth erroring out when "-d", or "-a", is not used.
Perhaps there are some use cases where it's interesting to keep old
packs, or, in case of not using "-a", to split only loose objects into
separate packs, but not objects from existing packs.

>   - What happens with --max-pack-size? Does the filtered pack get split
>     into multiple packs (as I think we would expect from such a
>     combination)?

Yes, in version 2 I have added a test (in the commit introducing
--filter-to) that checks that objects that are filtered out and that
are larger than --max-pack-size get split into their own packs.

>   - What about with `--cruft`? Does it split the cruft pack into two
>     based on whether or not the unreachable object(s) matched or didn't
>     match the filter?

I am not sure as I don't know well how cruft works. I would need to
check the code and/or test it.

>   - What happens when passed with "--geometric"? I don't think there is
>     a sensible interpretation (at least, I can't think of what it would
>     mean to do "--filter=<spec> --geometric=<factor>" off the top of my
>     head).

Not sure either, as I also don't know well how --geometric works.

>   - What about with "--write-bitmap-index"? Do we write one bitmap
>     index? Two? If the latter, do we combine the packs into a MIDX
>     before writing the bitmap? Should we?

In the tests, repack --filter is used with "-c
repack.writebitmaps=false" as otherwise the bitmap writing code tends
to complain that a pack is not complete or something like that. I
think it would make sense to have options that would make the bitmap
writing code write a regular bitmap index if all the objects are in a
single pack, but a MIDX if they are in multiple packs. I don't know
what --cruft or --max-pack-size are doing about this, but I think such
a feature could be useful in case those options are used.

In the meantime I would be Ok with disallowing using both
--write-bitmap-index and --filter. The issue is that writing a bitmap
index is the default behavior, even without --write-bitmap-index, so
it might be a bit more complex, but I plan to do something about it in
version 3.

> I think it may be worth spelling out answers to some of these questions
> in the documentation, and codifying those answers in the form of tests.

In version 2, the doc has been improved and now it seems comparable to
the --cruft doc. About the tests, I added one in version 2 related to
--max-pack-size, and I am open to adding a few others, but I don't
think it's a good idea to add a lot of them. It just doesn't scale
when commands have more than a few options.

The new code is reusing existing features (like --stdin-packs) that
are already used by other code (like --cruft code). It is not adding
new intricate mechanisms, nor a lot of new code. It is not changing
code used by other features except for a few clean refactorings. This
should give a good indication that it will work well with other
features.

> This makes me wonder whether or not this option should belong in repack
> at all, or whether there should be some new special-purpose builtin that
> is designed to split existing pack(s) based on whether or not they meet
> some filter criteria.

--cruft is also splitting existing packs based on some criteria. So if
we go this way, perhaps --cruft code should be moved first to this new
special purpose builtin?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option
  2023-07-05  6:08   ` [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option Christian Couder
@ 2023-07-05 17:53     ` Junio C Hamano
  2023-07-24  9:01       ` Christian Couder
  2023-07-05 18:12     ` Junio C Hamano
  1 sibling, 1 reply; 161+ messages in thread
From: Junio C Hamano @ 2023-07-05 17:53 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> This could be useful if, for example, some large blobs take a lot of
> precious space on fast storage while they are rarely accessed. It could
> make sense to move them into a separate cheaper, though slower, storage.
>
> In other use cases it might make sense to put all the blobs into
> separate storage.

Minor nit.  Aren't the above two the same use case?

> This is done by running two `git pack-objects` commands. The first one
> is run with `--filter=<filter-spec>`, using the specified filter. It
> packs objects while omitting the objects specified by the filter.
> Then another `git pack-objects` command is launched using
> `--stdin-packs`. We pass it all the previously existing packs into its
> stdin, so that it will pack all the objects in the previously existing
> packs. But we also pass into its stdin, the pack created by the previous
> `git pack-objects --filter=<filter-spec>` command as well as the kept
> packs, all prefixed with '^', so that the objects in these packs will be
> omitted from the resulting pack.

When I started reading the paragraph, the first question that came
to my mind was if these two pack-objects processes can and should be
run in parallel, which is answered in the part near the end of the
paragraph.  It may be a good idea to start the paragraph with "by
running `git pack-objects` command twice in a row" or something to
make it clear that one should (and cannot be) run before the other
completes.

In fact, isn't the call site of write_filtered_pack() in this patch
a bit too early?  The subprocess that runs with "--stdin-packs" is
started and told about the names of the pack we are going to create,
and it does not start processing until it reads everything (i.e. we
run fclose(in) in the write_filtered_pack() function), but the loop
over "names" string list in the caller that moves the tempfiles to
their final filenames comes after the call to close_object_store()
we see in the post context of the call to write_filtered_pack() that
is new in this patch.

The "--stdin-packs" one is told to exclude objects that appear in
these packs, so if the main process is a bit slow to finalize the
packfiles it created (and told the "--stdin-packs" process about),
it will not lead to repository corruption---just some objects are
included in the packfiles "--stdin-packs" one creates even though
they do not have to.  So it does not sound like a huge problem to
me, but still it somehow looks wrong.  Am I misreading the code?

Thanks.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option
  2023-07-05  6:08   ` [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option Christian Couder
  2023-07-05 17:53     ` Junio C Hamano
@ 2023-07-05 18:12     ` Junio C Hamano
  2023-07-24  9:02       ` Christian Couder
  1 sibling, 1 reply; 161+ messages in thread
From: Junio C Hamano @ 2023-07-05 18:12 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> +--filter=<filter-spec>::
> +	Remove objects matching the filter specification from the
> +	resulting packfile and put them into a separate packfile. Note
> +	that objects used in the working directory are not filtered
> +	out. So for the split to fully work, it's best to perform it
> +	in a bare repo and to use the `-a` and `-d` options along with
> +	this option.  See linkgit:git-rev-list[1] for valid
> +	`<filter-spec>` forms.

After running the command with this option once, we will have two
packfiles, one with objects that match and the other with objects
that do not match the filter spec.  Then what is the next step for
the user of this feature?  Moving the former to a slower storage
was cited as a motivation for the feature, but can the user tell
which one of these two packfiles is the one that consists of the
filtered out objects?  If there is no mechansim to do so, shouldn't
we have one to make this feature more usable?

At the level of "pack-objects" command, we report the new packfiles
so that the user does not have to take "ls .git/objects/pack" before
and after the operation to compare and learn which ones are new.
I do not think "repack" that is a Porcelain should do such a
reporting on its standard output, but that means either the feature
should probably be done at the plumbing level (i.e. "pack-objects"),
or the marking of the new packfiles needs to be done in a way that
tools can later find them out, e.g. on the filesystem, similar to
the way ".keep" marker tells which ones are not to be repacked, etc.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 7/8] repack: implement `--filter-to` for storing filtered out objects
  2023-07-05  6:08   ` [PATCH v2 7/8] repack: implement `--filter-to` for storing filtered out objects Christian Couder
@ 2023-07-05 18:26     ` Junio C Hamano
  2023-07-24  9:00       ` Christian Couder
  0 siblings, 1 reply; 161+ messages in thread
From: Junio C Hamano @ 2023-07-05 18:26 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> A previous commit has implemented `git repack --filter=<filter-spec>` to
> allow users to filter out some objects from the main pack and move them
> into a new different pack.

OK, this sidesteps the question I had on an earlier step rather
nicely.  Instead of having to find out which ones are to be moved
away, just generating them in a separate location would be more
straight forward.

The implementation does not seem to restrict where --filter-to
directory can be placed, but shouldn't it make sure that it is one
of the already specified alternates directories?  Otherwise the user
will end up corrupting the repository, no?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v3 0/8] Repack objects into separate packfiles based on a filter
  2023-07-05  6:08 ` [PATCH v2 0/8] " Christian Couder
                     ` (7 preceding siblings ...)
  2023-07-05  6:08   ` [PATCH v2 8/8] gc: add `gc.repackFilterTo` config option Christian Couder
@ 2023-07-24  8:59   ` Christian Couder
  2023-07-24  8:59     ` [PATCH v3 1/8] pack-objects: allow `--filter` without `--stdout` Christian Couder
                       ` (9 more replies)
  8 siblings, 10 replies; 161+ messages in thread
From: Christian Couder @ 2023-07-24  8:59 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

# Intro

Last year, John Cai sent 2 versions of a patch series to implement
`git repack --filter=<filter-spec>` and later I sent 4 versions of a
patch series trying to do it a bit differently:

  - https://lore.kernel.org/git/pull.1206.git.git.1643248180.gitgitgadget@gmail.com/
  - https://lore.kernel.org/git/20221012135114.294680-1-christian.couder@gmail.com/

In these patch series, the `--filter=<filter-spec>` removed the
filtered out objects altogether which was considered very dangerous
even though we implemented different safety checks in some of the
latter series.

In some discussions, it was mentioned that such a feature, or a
similar feature in `git gc`, or in a new standalone command (perhaps
called `git prune-filtered`), should put the filtered out objects into
a new packfile instead of deleting them.

Recently there were internal discussions at GitLab about either moving
blobs from inactive repos onto cheaper storage, or moving large blobs
onto cheaper storage. This lead us to rethink at repacking using a
filter, but moving the filtered out objects into a separate packfile
instead of deleting them.

So here is a new patch series doing that while implementing the
`--filter=<filter-spec>` option in `git repack`.

# Use cases for the new feature

This could be useful for example for the following purposes:

  1) As a way for servers to save storage costs by for example moving
     large blobs, or all the blobs, or all the blobs in inactive
     repos, to separate storage (while still making them accessible
     using for example the alternates mechanism).

  2) As a way to use partial clone on a Git server to offload large
     blobs to, for example, an http server, while using multiple
     promisor remotes (to be able to access everything) on the client
     side. (In this case the packfile that contains the filtered out
     object can be manualy removed after checking that all the objects
     it contains are available through the promisor remote.)

  3) As a way for clients to reclaim some space when they cloned with
     a filter to save disk space but then fetched a lot of unwanted
     objects (for example when checking out old branches) and now want
     to remove these unwanted objects. (In this case they can first
     move the packfile that contains filtered out objects to a
     separate directory or storage, then check that everything works
     well, and then manually remove the packfile after some time.)

As the features and the code are quite different from those in the
previous series, I decided to start a new series instead of continuing
a previous one.

Also since version 2 of this new series, commit messages, don't
mention uses cases like 2) or 3) above, as people have different
opinions on how it should be done. How it should be done could depend
a lot on the way promisor remotes are used, the software and hardware
setups used, etc, so it seems more difficult to "sell" this series by
talking about such use cases. As use case 1) seems simpler and more
appealing, it makes more sense to only talk about it in the commit
messages.

# Changes since version 2

Thanks to Junio who reviewed both version 1 and 2, and to Taylor who
reviewed version 1! The changes are the following:

- In patch 5/8, which introduces `--filter=<filter-spec>` option, some
  explanations about how to find which new packfile contains the
  filtered out objects have been added to the commit message following
  Junio's comments.

- In patch 5/8, it was clarified in the commit message that `git
  pack-objects` is run twice in row (and not in parallel) to implement
  the new option according to Junio's comments.

- In patch 5/8 also, the documentaion of the new option says that
  `--no-write-bitmap-index` (or the ++ `repack.writebitmaps` config
  option set to `false`) should be used along with the option as
  otherwise writing bitmap index will fail. And a corresponding new
  test called '--filter fails with --write-bitmap-index' has been
  added to t/t7700-repack.sh. This should address Taylor's comments
  about v1 that were not addressed by v2.

- In patch 7/8, which implements the `--filter-to=<dir>` option, the
  commit message now recommends using Git alternates mechanism before
  this option is used to make sure the directory specified by the new
  option is accessible by the repo as it could otherwise corrupt the
  repo. It also says that in some cases it might not be necessary to
  use such a mechanism, which is why the feature doesn't check that
  directory specified is accessible. The documentation of the new
  option also loudly warns that the repo could be corrupted if the Git
  alternates mechanism, and has a new link to that mechanism's
  documentation. This is to address Junio's comments.

- In patch 8/8, which implements the `gc.repackFilterTo` config
  option, a similar loud warning has been added, and similar doc
  changes have been made, to the documentation of the new config
  option (which corresponds to the `--filter-to=<dir>` command line
  option).

# Commit overview

* 1/8 pack-objects: allow `--filter` without `--stdout`

  This patch is the same as in v1 and v2. To be able to later repack
  with a filter we need `git pack-objects` to write packfiles when
  it's filtering instead of just writing the pack without the filtered
  out objects to stdout.

* 2/8 t/helper: add 'find-pack' test-tool

  No change in this patch compared to v1 and v2. For testing `git
  repack --filter=...` that we are going to implement, it's useful to
  have a test helper that can tell which packfiles contain a specific
  object.

* 3/8 repack: refactor finishing pack-objects command

  No change in this patch compared to v2. This is a small refactoring
  creating a new useful function, so that `git repack --filter=...`
  will be able to reuse it.

* 4/8 repack: refactor finding pack prefix

  No change in this patch compared to v2. This is another small
  refactoring creating a small function that will be reused in the
  next patch.

* 5/8 repack: add `--filter=<filter-spec>` option

  This actually adds the `--filter=<filter-spec>` option. It uses one
  `git pack-objects` process with the `--filter` option. And then
  another `git pack-objects` process with the `--stdin-packs`
  option. Only the commit message, documentation and tests have been
  changed a bit since v2.

* 6/8 gc: add `gc.repackFilter` config option

  No change in this patch compared to v2 and v1. This is a gc config
  option so that `git gc` can also repack using a filter and put the
  filtered out objects into a separate packfile.

* 7/8 repack: implement `--filter-to` for storing filtered out objects

  For some use cases, it's interesting to create the packfile that
  contains the filtered out objects into a separate location. This is
  similar to the `--expire-to` option for cruft packfiles. Only the
  commit message and the documentation have changed since version
  2. They now explain and discuss the risks of using this option
  without making sure the specified directory is not accessible by the
  repo.

* 8/8 gc: add `gc.repackFilterTo` config option

  This allows specifying the location of the packfile that contains
  the filtered out objects when using `gc.repackFilter`. As with the
  previous commit, since v2, the doc now explain and discuss the risks
  of using this option without making sure the specified directory is
  not accessible by the repo.

# Range-diff since v2

1:  0bd1ad3071 = 1:  4d75a1d7c3 pack-objects: allow `--filter` without `--stdout`
2:  e49cd723c7 = 2:  fdf9b6e8cc t/helper: add 'find-pack' test-tool
3:  3f87772ea6 = 3:  e7cfdebc78 repack: refactor finishing pack-objects command
4:  9997efaf33 = 4:  9c51063795 repack: refactor finding pack prefix
5:  da27ecb91b ! 5:  a90e8045c3 repack: add `--filter=<filter-spec>` option
    @@ Commit message
         This new option puts the objects specified by `<filter-spec>` into a
         separate packfile.
     
    -    This could be useful if, for example, some large blobs take a lot of
    +    This could be useful if, for example, some large blobs take up a lot of
         precious space on fast storage while they are rarely accessed. It could
         make sense to move them into a separate cheaper, though slower, storage.
     
         In other use cases it might make sense to put all the blobs into
         separate storage.
     
    -    This is done by running two `git pack-objects` commands. The first one
    -    is run with `--filter=<filter-spec>`, using the specified filter. It
    -    packs objects while omitting the objects specified by the filter.
    -    Then another `git pack-objects` command is launched using
    +    It's possible to find which new packfile contains the filtered out
    +    objects using one of the following:
    +
    +      - `git verify-pack -v ...`,
    +      - `test-tool find-pack ...`, which a previous commit added,
    +      - `--filter-to=<dir>`, which a following commit will add to specify
    +        where the pack containing the filtered out objects will be.
    +
    +    This feature is implemented by running `git pack-objects` twice in a
    +    row. The first command is run with `--filter=<filter-spec>`, using the
    +    specified filter. It packs objects while omitting the objects specified
    +    by the filter. Then another `git pack-objects` command is launched using
         `--stdin-packs`. We pass it all the previously existing packs into its
         stdin, so that it will pack all the objects in the previously existing
         packs. But we also pass into its stdin, the pack created by the previous
    @@ Documentation/git-repack.txt: depth is 4095.
     +  that objects used in the working directory are not filtered
     +  out. So for the split to fully work, it's best to perform it
     +  in a bare repo and to use the `-a` and `-d` options along with
    -+  this option.  See linkgit:git-rev-list[1] for valid
    -+  `<filter-spec>` forms.
    ++  this option.  Also `--no-write-bitmap-index` (or the
    ++  `repack.writebitmaps` config option set to `false`) should be
    ++  used otherwise writing bitmap index will fail, as it supposes
    ++  a single packfile containing all the objects. See
    ++  linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
     +
      -b::
      --write-bitmap-index::
    @@ t/t7700-repack.sh: test_expect_success 'auto-bitmaps do not complain if unavaila
     +  blob_pack2=$(test-tool -C bare.git find-pack HEAD:file2) &&
     +  test "$blob_pack2" = "$blob_pack"
     +'
    ++
    ++test_expect_success '--filter fails with --write-bitmap-index' '
    ++  test_must_fail git -C bare.git repack -a -d --write-bitmap-index \
    ++          --filter=blob:none &&
    ++
    ++  git -C bare.git repack -a -d --no-write-bitmap-index \
    ++          --filter=blob:none
    ++'
     +
      objdir=.git/objects
      midx=$objdir/pack/multi-pack-index
6:  49e4a184b4 = 6:  335b7f614d gc: add `gc.repackFilter` config option
7:  243c93aad3 ! 7:  b1be7f60b7 repack: implement `--filter-to` for storing filtered out objects
    @@ Commit message
         It would be nice if this new different pack could be created in a
         different directory than the regular pack. This would make it possible
         to move large blobs into a pack on a different kind of storage, for
    -    example cheaper storage. Even in a different directory this pack can be
    -    accessible if, for example, the Git alternates mechanism is used to
    -    point to it.
    +    example cheaper storage.
    +
    +    Even in a different directory, this pack can be accessible if, for
    +    example, the Git alternates mechanism is used to point to it. In fact
    +    not using the Git alternates mechanism can corrupt a repo as the
    +    generated pack containing the filtered objects might not be accessible
    +    from the repo any more. So setting up the Git alternates mechanism
    +    should be done before using this feature if the user wants the repo to
    +    be fully usable while this feature is used.
    +
    +    In some cases, like when a repo has just been cloned or when there is no
    +    other activity in the repo, it's Ok to setup the Git alternates
    +    mechanism afterwards though. It's also Ok to just inspect the generated
    +    packfile containing the filtered objects and then just move it into the
    +    '.git/objects/pack/' directory manually. That's why it's not necessary
    +    for this command to check that the Git alternates mechanism has been
    +    already setup.
     
         While at it, as an example to show that `--filter` and `--filter-to`
         work well with other options, let's also add a test to check that these
    @@ Commit message
     
      ## Documentation/git-repack.txt ##
     @@ Documentation/git-repack.txt: depth is 4095.
    -   this option.  See linkgit:git-rev-list[1] for valid
    -   `<filter-spec>` forms.
    +   a single packfile containing all the objects. See
    +   linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
      
     +--filter-to=<dir>::
     +  Write the pack containing filtered out objects to the
    -+  directory `<dir>`. This can be used for putting the pack on a
    -+  separate object directory that is accessed through the Git
    -+  alternates mechanism. Only useful with `--filter`.
    ++  directory `<dir>`. Only useful with `--filter`. This can be
    ++  used for putting the pack on a separate object directory that
    ++  is accessed through the Git alternates mechanism. **WARNING:**
    ++  If the packfile containing the filtered out objects is not
    ++  accessible, the repo could be considered corrupt by Git as it
    ++  migh not be able to access the objects in that packfile. See
    ++  the `objects` and `objects/info/alternates` sections of
    ++  linkgit:gitrepository-layout[5].
     +
      -b::
      --write-bitmap-index::
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
                                          &existing_nonkept_packs,
     
      ## t/t7700-repack.sh ##
    -@@ t/t7700-repack.sh: test_expect_success 'repacking with a filter works' '
    -   test "$blob_pack2" = "$blob_pack"
    +@@ t/t7700-repack.sh: test_expect_success '--filter fails with --write-bitmap-index' '
    +           --filter=blob:none
      '
      
     +test_expect_success '--filter-to stores filtered out objects' '
8:  8cb3faa74c ! 8:  ed66511823 gc: add `gc.repackFilterTo` config option
    @@ Documentation/config/gc.txt: gc.repackFilter::
     +gc.repackFilterTo::
     +  When repacking and using a filter, see `gc.repackFilter`, the
     +  specified location will be used to create the packfile
    -+  containing the filtered out objects.  See the
    -+  `--filter-to=<dir>` option of linkgit:git-repack[1].
    ++  containing the filtered out objects. **WARNING:** The
    ++  specified location should be accessible, using for example the
    ++  Git alternates mechanism, otherwise the repo could be
    ++  considered corrupt by Git as it migh not be able to access the
    ++  objects in that packfile. See the `--filter-to=<dir>` option
    ++  of linkgit:git-repack[1] and the `objects/info/alternates`
    ++  section of linkgit:gitrepository-layout[5].
     +
      gc.rerereResolved::
        Records of conflicted merge you resolved earlier are


Christian Couder (8):
  pack-objects: allow `--filter` without `--stdout`
  t/helper: add 'find-pack' test-tool
  repack: refactor finishing pack-objects command
  repack: refactor finding pack prefix
  repack: add `--filter=<filter-spec>` option
  gc: add `gc.repackFilter` config option
  repack: implement `--filter-to` for storing filtered out objects
  gc: add `gc.repackFilterTo` config option

 Documentation/config/gc.txt            |  16 +++
 Documentation/git-pack-objects.txt     |   4 +-
 Documentation/git-repack.txt           |  23 ++++
 Makefile                               |   1 +
 builtin/gc.c                           |  10 ++
 builtin/pack-objects.c                 |   8 +-
 builtin/repack.c                       | 162 ++++++++++++++++++-------
 t/helper/test-find-pack.c              |  35 ++++++
 t/helper/test-tool.c                   |   1 +
 t/helper/test-tool.h                   |   1 +
 t/t5317-pack-objects-filter-objects.sh |   8 ++
 t/t6500-gc.sh                          |  23 ++++
 t/t7700-repack.sh                      |  90 ++++++++++++++
 13 files changed, 332 insertions(+), 50 deletions(-)
 create mode 100644 t/helper/test-find-pack.c

-- 
2.41.0.384.ged66511823


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v3 1/8] pack-objects: allow `--filter` without `--stdout`
  2023-07-24  8:59   ` [PATCH v3 0/8] Repack objects into separate packfiles based on a filter Christian Couder
@ 2023-07-24  8:59     ` Christian Couder
  2023-07-25 22:38       ` Taylor Blau
  2023-07-24  8:59     ` [PATCH v3 2/8] t/helper: add 'find-pack' test-tool Christian Couder
                       ` (8 subsequent siblings)
  9 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-07-24  8:59 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

9535ce7337 (pack-objects: add list-objects filtering, 2017-11-21)
taught `git pack-objects` to use `--filter`, but required the use of
`--stdout` since a partial clone mechanism was not yet in place to
handle missing objects. Since then, changes like 9e27beaa23
(promisor-remote: implement promisor_remote_get_direct(), 2019-06-25)
and others added support to dynamically fetch objects that were missing.

Even without a promisor remote, filtering out objects can also be useful
if we can put the filtered out objects in a separate pack, and in this
case it also makes sense for pack-objects to write the packfile directly
to an actual file rather than on stdout.

Remove the `--stdout` requirement when using `--filter`, so that in a
follow-up commit, repack can pass `--filter` to pack-objects to omit
certain objects from the resulting packfile.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-pack-objects.txt     | 4 ++--
 builtin/pack-objects.c                 | 8 ++------
 t/t5317-pack-objects-filter-objects.sh | 8 ++++++++
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index a9995a932c..583270a85f 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -298,8 +298,8 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it creates a bundle.
 	nevertheless.
 
 --filter=<filter-spec>::
-	Requires `--stdout`.  Omits certain objects (usually blobs) from
-	the resulting packfile.  See linkgit:git-rev-list[1] for valid
+	Omits certain objects (usually blobs) from the resulting
+	packfile.  See linkgit:git-rev-list[1] for valid
 	`<filter-spec>` forms.
 
 --no-filter::
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 06b33d49e9..7fca27ffbe 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4387,12 +4387,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (filter_options.choice) {
-		if (!pack_to_stdout)
-			die(_("cannot use --filter without --stdout"));
-		if (stdin_packs)
-			die(_("cannot use --filter with --stdin-packs"));
-	}
+	if (stdin_packs && filter_options.choice)
+		die(_("cannot use --filter with --stdin-packs"));
 
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
diff --git a/t/t5317-pack-objects-filter-objects.sh b/t/t5317-pack-objects-filter-objects.sh
index b26d476c64..2ff3eef9a3 100755
--- a/t/t5317-pack-objects-filter-objects.sh
+++ b/t/t5317-pack-objects-filter-objects.sh
@@ -53,6 +53,14 @@ test_expect_success 'verify blob:none packfile has no blobs' '
 	! grep blob verify_result
 '
 
+test_expect_success 'verify blob:none packfile without --stdout' '
+	git -C r1 pack-objects --revs --filter=blob:none mypackname >packhash <<-EOF &&
+	HEAD
+	EOF
+	git -C r1 verify-pack -v "mypackname-$(cat packhash).pack" >verify_result &&
+	! grep blob verify_result
+'
+
 test_expect_success 'verify normal and blob:none packfiles have same commits/trees' '
 	git -C r1 verify-pack -v ../all.pack >verify_result &&
 	grep -E "commit|tree" verify_result |
-- 
2.41.0.384.ged66511823


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 2/8] t/helper: add 'find-pack' test-tool
  2023-07-24  8:59   ` [PATCH v3 0/8] Repack objects into separate packfiles based on a filter Christian Couder
  2023-07-24  8:59     ` [PATCH v3 1/8] pack-objects: allow `--filter` without `--stdout` Christian Couder
@ 2023-07-24  8:59     ` Christian Couder
  2023-07-25 22:44       ` Taylor Blau
  2023-07-24  8:59     ` [PATCH v3 3/8] repack: refactor finishing pack-objects command Christian Couder
                       ` (7 subsequent siblings)
  9 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-07-24  8:59 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

In a following commit, we will make it possible to separate objects in
different packfiles depending on a filter.

To make sure that the right objects are in the right packs, let's add a
new test-tool that can display which packfile(s) a given object is in.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Makefile                  |  1 +
 t/helper/test-find-pack.c | 35 +++++++++++++++++++++++++++++++++++
 t/helper/test-tool.c      |  1 +
 t/helper/test-tool.h      |  1 +
 4 files changed, 38 insertions(+)
 create mode 100644 t/helper/test-find-pack.c

diff --git a/Makefile b/Makefile
index fb541dedc9..14ee0c45d4 100644
--- a/Makefile
+++ b/Makefile
@@ -800,6 +800,7 @@ TEST_BUILTINS_OBJS += test-dump-untracked-cache.o
 TEST_BUILTINS_OBJS += test-env-helper.o
 TEST_BUILTINS_OBJS += test-example-decorate.o
 TEST_BUILTINS_OBJS += test-fast-rebase.o
+TEST_BUILTINS_OBJS += test-find-pack.o
 TEST_BUILTINS_OBJS += test-fsmonitor-client.o
 TEST_BUILTINS_OBJS += test-genrandom.o
 TEST_BUILTINS_OBJS += test-genzeros.o
diff --git a/t/helper/test-find-pack.c b/t/helper/test-find-pack.c
new file mode 100644
index 0000000000..1928fe7329
--- /dev/null
+++ b/t/helper/test-find-pack.c
@@ -0,0 +1,35 @@
+#include "test-tool.h"
+#include "object-name.h"
+#include "object-store.h"
+#include "packfile.h"
+#include "setup.h"
+
+/*
+ * Display the path(s), one per line, of the packfile(s) containing
+ * the given object.
+ */
+
+static const char *find_pack_usage = "\n"
+"  test-tool find-pack <object>";
+
+
+int cmd__find_pack(int argc, const char **argv)
+{
+	struct object_id oid;
+	struct packed_git *p;
+
+	setup_git_directory();
+
+	if (argc != 2)
+		usage(find_pack_usage);
+
+	if (repo_get_oid(the_repository, argv[1], &oid))
+		die("cannot parse %s as an object name", argv[1]);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		if (find_pack_entry_one(oid.hash, p))
+			printf("%s\n", p->pack_name);
+	}
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index abe8a785eb..41da40c296 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -31,6 +31,7 @@ static struct test_cmd cmds[] = {
 	{ "env-helper", cmd__env_helper },
 	{ "example-decorate", cmd__example_decorate },
 	{ "fast-rebase", cmd__fast_rebase },
+	{ "find-pack", cmd__find_pack },
 	{ "fsmonitor-client", cmd__fsmonitor_client },
 	{ "genrandom", cmd__genrandom },
 	{ "genzeros", cmd__genzeros },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index ea2672436c..411dbf2db4 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -25,6 +25,7 @@ int cmd__dump_reftable(int argc, const char **argv);
 int cmd__env_helper(int argc, const char **argv);
 int cmd__example_decorate(int argc, const char **argv);
 int cmd__fast_rebase(int argc, const char **argv);
+int cmd__find_pack(int argc, const char **argv);
 int cmd__fsmonitor_client(int argc, const char **argv);
 int cmd__genrandom(int argc, const char **argv);
 int cmd__genzeros(int argc, const char **argv);
-- 
2.41.0.384.ged66511823


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 3/8] repack: refactor finishing pack-objects command
  2023-07-24  8:59   ` [PATCH v3 0/8] Repack objects into separate packfiles based on a filter Christian Couder
  2023-07-24  8:59     ` [PATCH v3 1/8] pack-objects: allow `--filter` without `--stdout` Christian Couder
  2023-07-24  8:59     ` [PATCH v3 2/8] t/helper: add 'find-pack' test-tool Christian Couder
@ 2023-07-24  8:59     ` Christian Couder
  2023-07-25 22:45       ` Taylor Blau
  2023-07-24  8:59     ` [PATCH v3 4/8] repack: refactor finding pack prefix Christian Couder
                       ` (6 subsequent siblings)
  9 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-07-24  8:59 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

Create a new finish_pack_objects_cmd() to refactor duplicated code
that handles reading the packfile names from the output of a
`git pack-objects` command and putting it into a string_list, as well as
calling finish_command().

While at it, beautify a code comment a bit in the new function.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org
---
 builtin/repack.c | 70 +++++++++++++++++++++++-------------------------
 1 file changed, 33 insertions(+), 37 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index aea5ca9d44..96af2d1caf 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -696,6 +696,36 @@ static void remove_redundant_bitmaps(struct string_list *include,
 	strbuf_release(&path);
 }
 
+static int finish_pack_objects_cmd(struct child_process *cmd,
+				   struct string_list *names,
+				   int local)
+{
+	FILE *out;
+	struct strbuf line = STRBUF_INIT;
+
+	out = xfdopen(cmd->out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		struct string_list_item *item;
+
+		if (line.len != the_hash_algo->hexsz)
+			die(_("repack: Expecting full hex object ID lines only "
+			      "from pack-objects."));
+		/*
+		 * Avoid putting packs written outside of the repository in the
+		 * list of names.
+		 */
+		if (local) {
+			item = string_list_append(names, line.buf);
+			item->util = populate_pack_exts(line.buf);
+		}
+	}
+	fclose(out);
+
+	strbuf_release(&line);
+
+	return finish_command(cmd);
+}
+
 static int write_cruft_pack(const struct pack_objects_args *args,
 			    const char *destination,
 			    const char *pack_prefix,
@@ -705,9 +735,8 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 			    struct string_list *existing_kept_packs)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
-	struct strbuf line = STRBUF_INIT;
 	struct string_list_item *item;
-	FILE *in, *out;
+	FILE *in;
 	int ret;
 	const char *scratch;
 	int local = skip_prefix(destination, packdir, &scratch);
@@ -751,27 +780,7 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 		fprintf(in, "%s.pack\n", item->string);
 	fclose(in);
 
-	out = xfdopen(cmd.out, "r");
-	while (strbuf_getline_lf(&line, out) != EOF) {
-		struct string_list_item *item;
-
-		if (line.len != the_hash_algo->hexsz)
-			die(_("repack: Expecting full hex object ID lines only "
-			      "from pack-objects."));
-		/*
-		 * avoid putting packs written outside of the repository in the
-		 * list of names
-		 */
-		if (local) {
-			item = string_list_append(names, line.buf);
-			item->util = populate_pack_exts(line.buf);
-		}
-	}
-	fclose(out);
-
-	strbuf_release(&line);
-
-	return finish_command(&cmd);
+	return finish_pack_objects_cmd(&cmd, names, local);
 }
 
 int cmd_repack(int argc, const char **argv, const char *prefix)
@@ -782,10 +791,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list existing_nonkept_packs = STRING_LIST_INIT_DUP;
 	struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
 	struct pack_geometry *geometry = NULL;
-	struct strbuf line = STRBUF_INIT;
 	struct tempfile *refs_snapshot = NULL;
 	int i, ext, ret;
-	FILE *out;
 	int show_progress;
 
 	/* variables to be filled by option parsing */
@@ -1016,18 +1023,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		fclose(in);
 	}
 
-	out = xfdopen(cmd.out, "r");
-	while (strbuf_getline_lf(&line, out) != EOF) {
-		struct string_list_item *item;
-
-		if (line.len != the_hash_algo->hexsz)
-			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
-		item = string_list_append(&names, line.buf);
-		item->util = populate_pack_exts(item->string);
-	}
-	strbuf_release(&line);
-	fclose(out);
-	ret = finish_command(&cmd);
+	ret = finish_pack_objects_cmd(&cmd, &names, 1);
 	if (ret)
 		goto cleanup;
 
-- 
2.41.0.384.ged66511823


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 4/8] repack: refactor finding pack prefix
  2023-07-24  8:59   ` [PATCH v3 0/8] Repack objects into separate packfiles based on a filter Christian Couder
                       ` (2 preceding siblings ...)
  2023-07-24  8:59     ` [PATCH v3 3/8] repack: refactor finishing pack-objects command Christian Couder
@ 2023-07-24  8:59     ` Christian Couder
  2023-07-25 22:47       ` Taylor Blau
  2023-07-24  8:59     ` [PATCH v3 5/8] repack: add `--filter=<filter-spec>` option Christian Couder
                       ` (5 subsequent siblings)
  9 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-07-24  8:59 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

Create a new find_pack_prefix() to refactor code that handles finding
the pack prefix from the packtmp and packdir global variables, as we are
going to need this feature again in following commit.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org
---
 builtin/repack.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 96af2d1caf..21e3b89f27 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -783,6 +783,17 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 	return finish_pack_objects_cmd(&cmd, names, local);
 }
 
+static const char *find_pack_prefix(void)
+{
+	const char *pack_prefix;
+	if (!skip_prefix(packtmp, packdir, &pack_prefix))
+		die(_("pack prefix %s does not begin with objdir %s"),
+		    packtmp, packdir);
+	if (*pack_prefix == '/')
+		pack_prefix++;
+	return pack_prefix;
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -1031,12 +1042,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		printf_ln(_("Nothing new to pack."));
 
 	if (pack_everything & PACK_CRUFT) {
-		const char *pack_prefix;
-		if (!skip_prefix(packtmp, packdir, &pack_prefix))
-			die(_("pack prefix %s does not begin with objdir %s"),
-			    packtmp, packdir);
-		if (*pack_prefix == '/')
-			pack_prefix++;
+		const char *pack_prefix = find_pack_prefix();
 
 		if (!cruft_po_args.window)
 			cruft_po_args.window = po_args.window;
-- 
2.41.0.384.ged66511823


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 5/8] repack: add `--filter=<filter-spec>` option
  2023-07-24  8:59   ` [PATCH v3 0/8] Repack objects into separate packfiles based on a filter Christian Couder
                       ` (3 preceding siblings ...)
  2023-07-24  8:59     ` [PATCH v3 4/8] repack: refactor finding pack prefix Christian Couder
@ 2023-07-24  8:59     ` Christian Couder
  2023-07-25 23:04       ` Taylor Blau
  2023-07-24  8:59     ` [PATCH v3 6/8] gc: add `gc.repackFilter` config option Christian Couder
                       ` (4 subsequent siblings)
  9 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-07-24  8:59 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

This new option puts the objects specified by `<filter-spec>` into a
separate packfile.

This could be useful if, for example, some large blobs take up a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.

In other use cases it might make sense to put all the blobs into
separate storage.

It's possible to find which new packfile contains the filtered out
objects using one of the following:

  - `git verify-pack -v ...`,
  - `test-tool find-pack ...`, which a previous commit added,
  - `--filter-to=<dir>`, which a following commit will add to specify
    where the pack containing the filtered out objects will be.

This feature is implemented by running `git pack-objects` twice in a
row. The first command is run with `--filter=<filter-spec>`, using the
specified filter. It packs objects while omitting the objects specified
by the filter. Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt | 12 +++++++
 builtin/repack.c             | 67 ++++++++++++++++++++++++++++++++++++
 t/t7700-repack.sh            | 24 +++++++++++++
 3 files changed, 103 insertions(+)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 4017157949..6d5bec7716 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -143,6 +143,18 @@ depth is 4095.
 	a larger and slower repository; see the discussion in
 	`pack.packSizeLimit`.
 
+--filter=<filter-spec>::
+	Remove objects matching the filter specification from the
+	resulting packfile and put them into a separate packfile. Note
+	that objects used in the working directory are not filtered
+	out. So for the split to fully work, it's best to perform it
+	in a bare repo and to use the `-a` and `-d` options along with
+	this option.  Also `--no-write-bitmap-index` (or the
+	`repack.writebitmaps` config option set to `false`) should be
+	used otherwise writing bitmap index will fail, as it supposes
+	a single packfile containing all the objects. See
+	linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index 21e3b89f27..2c81b7738e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -53,6 +53,7 @@ struct pack_objects_args {
 	const char *depth;
 	const char *threads;
 	const char *max_pack_size;
+	const char *filter;
 	int no_reuse_delta;
 	int no_reuse_object;
 	int quiet;
@@ -166,6 +167,8 @@ static void prepare_pack_objects(struct child_process *cmd,
 		strvec_pushf(&cmd->args, "--threads=%s", args->threads);
 	if (args->max_pack_size)
 		strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
+	if (args->filter)
+		strvec_pushf(&cmd->args, "--filter=%s", args->filter);
 	if (args->no_reuse_delta)
 		strvec_pushf(&cmd->args, "--no-reuse-delta");
 	if (args->no_reuse_object)
@@ -726,6 +729,57 @@ static int finish_pack_objects_cmd(struct child_process *cmd,
 	return finish_command(cmd);
 }
 
+static int write_filtered_pack(const struct pack_objects_args *args,
+			       const char *destination,
+			       const char *pack_prefix,
+			       struct string_list *names,
+			       struct string_list *existing_packs,
+			       struct string_list *existing_kept_packs)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct string_list_item *item;
+	FILE *in;
+	int ret;
+	const char *scratch;
+	int local = skip_prefix(destination, packdir, &scratch);
+
+	/* We need to copy 'args' to modify it */
+	struct pack_objects_args new_args = *args;
+
+	/* No need to filter again */
+	new_args.filter = NULL;
+
+	prepare_pack_objects(&cmd, &new_args, destination);
+
+	strvec_push(&cmd.args, "--stdin-packs");
+
+	cmd.in = -1;
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	/*
+	 * names has a confusing double use: it both provides the list
+	 * of just-written new packs, and accepts the name of the
+	 * filtered pack we are writing.
+	 *
+	 * By the time it is read here, it contains only the pack(s)
+	 * that were just written, which is exactly the set of packs we
+	 * want to consider kept.
+	 */
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, names)
+		fprintf(in, "^%s-%s.pack\n", pack_prefix, item->string);
+	for_each_string_list_item(item, existing_packs)
+		fprintf(in, "%s.pack\n", item->string);
+	for_each_string_list_item(item, existing_kept_packs)
+		fprintf(in, "^%s.pack\n", item->string);
+	fclose(in);
+
+	return finish_pack_objects_cmd(&cmd, names, local);
+}
+
 static int write_cruft_pack(const struct pack_objects_args *args,
 			    const char *destination,
 			    const char *pack_prefix,
@@ -858,6 +912,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("limits the maximum number of threads")),
 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
 				N_("maximum size of each packfile")),
+		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
+				N_("object filtering")),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
@@ -1097,6 +1153,17 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		}
 	}
 
+	if (po_args.filter) {
+		ret = write_filtered_pack(&po_args,
+					  packtmp,
+					  find_pack_prefix(),
+					  &names,
+					  &existing_nonkept_packs,
+					  &existing_kept_packs);
+		if (ret)
+			goto cleanup;
+	}
+
 	string_list_sort(&names);
 
 	close_object_store(the_repository->objects);
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 27b66807cd..0a2c73bca7 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -327,6 +327,30 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_must_be_empty actual
 '
 
+test_expect_success 'repacking with a filter works' '
+	git -C bare.git repack -a -d &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+	git -C bare.git -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
+	test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
+	commit_pack=$(test-tool -C bare.git find-pack HEAD) &&
+	test -n "$commit_pack" &&
+	blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
+	test -n "$blob_pack" &&
+	test "$commit_pack" != "$blob_pack" &&
+	tree_pack=$(test-tool -C bare.git find-pack HEAD^{tree}) &&
+	test "$tree_pack" = "$commit_pack" &&
+	blob_pack2=$(test-tool -C bare.git find-pack HEAD:file2) &&
+	test "$blob_pack2" = "$blob_pack"
+'
+
+test_expect_success '--filter fails with --write-bitmap-index' '
+	test_must_fail git -C bare.git repack -a -d --write-bitmap-index \
+		--filter=blob:none &&
+
+	git -C bare.git repack -a -d --no-write-bitmap-index \
+		--filter=blob:none
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.41.0.384.ged66511823


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 6/8] gc: add `gc.repackFilter` config option
  2023-07-24  8:59   ` [PATCH v3 0/8] Repack objects into separate packfiles based on a filter Christian Couder
                       ` (4 preceding siblings ...)
  2023-07-24  8:59     ` [PATCH v3 5/8] repack: add `--filter=<filter-spec>` option Christian Couder
@ 2023-07-24  8:59     ` Christian Couder
  2023-07-25 23:07       ` Taylor Blau
  2023-07-24  8:59     ` [PATCH v3 7/8] repack: implement `--filter-to` for storing filtered out objects Christian Couder
                       ` (3 subsequent siblings)
  9 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-07-24  8:59 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit has implemented `git repack --filter=<filter-spec>` to
allow users to filter out some objects from the main pack and move them
into a new different pack.

Users might want to perform such a cleanup regularly at the same time as
they perform other repacks and cleanups, so as part of `git gc`.

Let's allow them to configure a <filter-spec> for that purpose using a
new gc.repackFilter config option.

Now when `git gc` will perform a repack with a <filter-spec> configured
through this option and not empty, the repack process will be passed a
corresponding `--filter=<filter-spec>` argument.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/gc.txt |  5 +++++
 builtin/gc.c                |  6 ++++++
 t/t6500-gc.sh               | 12 ++++++++++++
 3 files changed, 23 insertions(+)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index ca47eb2008..2153bde7ac 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -145,6 +145,11 @@ Multiple hooks are supported, but all must exit successfully, else the
 operation (either generating a cruft pack or unpacking unreachable
 objects) will be halted.
 
+gc.repackFilter::
+	When repacking, use the specified filter to move certain
+	objects into a separate packfile.  See the
+	`--filter=<filter-spec>` option of linkgit:git-repack[1].
+
 gc.rerereResolved::
 	Records of conflicted merge you resolved earlier are
 	kept for this many days when 'git rerere gc' is run.
diff --git a/builtin/gc.c b/builtin/gc.c
index 19d73067aa..9b0984f301 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -61,6 +61,7 @@ static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
+static char *repack_filter;
 static unsigned long big_pack_threshold;
 static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 
@@ -170,6 +171,8 @@ static void gc_config(void)
 	git_config_get_ulong("gc.bigpackthreshold", &big_pack_threshold);
 	git_config_get_ulong("pack.deltacachesize", &max_delta_cache_size);
 
+	git_config_get_string("gc.repackfilter", &repack_filter);
+
 	git_config(git_default_config, NULL);
 }
 
@@ -355,6 +358,9 @@ static void add_repack_all_option(struct string_list *keep_pack)
 
 	if (keep_pack)
 		for_each_string_list(keep_pack, keep_one_pack, NULL);
+
+	if (repack_filter && *repack_filter)
+		strvec_pushf(&repack, "--filter=%s", repack_filter);
 }
 
 static void add_repack_incremental_option(void)
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 69509d0c11..5b89faf505 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -202,6 +202,18 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
 	grep -E "^trace: (built-in|exec|run_command): git reflog expire --" trace.out
 '
 
+test_expect_success 'gc.repackFilter launches repack with a filter' '
+	test_when_finished "rm -rf bare.git" &&
+	git clone --no-local --bare . bare.git &&
+
+	git -C bare.git -c gc.cruftPacks=false gc &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+
+	GIT_TRACE=$(pwd)/trace.out git -C bare.git -c gc.repackFilter=blob:none -c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&
+	test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
+	grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
+'
+
 prepare_cruft_history () {
 	test_commit base &&
 
-- 
2.41.0.384.ged66511823


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 7/8] repack: implement `--filter-to` for storing filtered out objects
  2023-07-24  8:59   ` [PATCH v3 0/8] Repack objects into separate packfiles based on a filter Christian Couder
                       ` (5 preceding siblings ...)
  2023-07-24  8:59     ` [PATCH v3 6/8] gc: add `gc.repackFilter` config option Christian Couder
@ 2023-07-24  8:59     ` Christian Couder
  2023-07-24  8:59     ` [PATCH v3 8/8] gc: add `gc.repackFilterTo` config option Christian Couder
                       ` (2 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-07-24  8:59 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit has implemented `git repack --filter=<filter-spec>` to
allow users to filter out some objects from the main pack and move them
into a new different pack.

It would be nice if this new different pack could be created in a
different directory than the regular pack. This would make it possible
to move large blobs into a pack on a different kind of storage, for
example cheaper storage.

Even in a different directory, this pack can be accessible if, for
example, the Git alternates mechanism is used to point to it. In fact
not using the Git alternates mechanism can corrupt a repo as the
generated pack containing the filtered objects might not be accessible
from the repo any more. So setting up the Git alternates mechanism
should be done before using this feature if the user wants the repo to
be fully usable while this feature is used.

In some cases, like when a repo has just been cloned or when there is no
other activity in the repo, it's Ok to setup the Git alternates
mechanism afterwards though. It's also Ok to just inspect the generated
packfile containing the filtered objects and then just move it into the
'.git/objects/pack/' directory manually. That's why it's not necessary
for this command to check that the Git alternates mechanism has been
already setup.

While at it, as an example to show that `--filter` and `--filter-to`
work well with other options, let's also add a test to check that these
options work well with `--max-pack-size`.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>

repack: add test with --max-pack-size
---
 Documentation/git-repack.txt | 11 ++++++
 builtin/repack.c             | 11 +++++-
 t/t7700-repack.sh            | 66 ++++++++++++++++++++++++++++++++++++
 3 files changed, 87 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 6d5bec7716..c0fbb0ed0c 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -155,6 +155,17 @@ depth is 4095.
 	a single packfile containing all the objects. See
 	linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
 
+--filter-to=<dir>::
+	Write the pack containing filtered out objects to the
+	directory `<dir>`. Only useful with `--filter`. This can be
+	used for putting the pack on a separate object directory that
+	is accessed through the Git alternates mechanism. **WARNING:**
+	If the packfile containing the filtered out objects is not
+	accessible, the repo could be considered corrupt by Git as it
+	migh not be able to access the objects in that packfile. See
+	the `objects` and `objects/info/alternates` sections of
+	linkgit:gitrepository-layout[5].
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index 2c81b7738e..626284191b 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -871,6 +871,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int write_midx = 0;
 	const char *cruft_expiration = NULL;
 	const char *expire_to = NULL;
+	const char *filter_to = NULL;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -924,6 +925,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			   N_("write a multi-pack index of the resulting packs")),
 		OPT_STRING(0, "expire-to", &expire_to, N_("dir"),
 			   N_("pack prefix to store a pack containing pruned objects")),
+		OPT_STRING(0, "filter-to", &filter_to, N_("dir"),
+			   N_("pack prefix to store a pack containing filtered out objects")),
 		OPT_END()
 	};
 
@@ -1067,6 +1070,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		strvec_push(&cmd.args, "--incremental");
 	}
 
+	if (filter_to && !po_args.filter)
+		die(_("option '%s' can only be used along with '%s'"), "--filter-to", "--filter");
+
 	if (geometry)
 		cmd.in = -1;
 	else
@@ -1154,8 +1160,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 
 	if (po_args.filter) {
+		if (!filter_to)
+			filter_to = packtmp;
+
 		ret = write_filtered_pack(&po_args,
-					  packtmp,
+					  filter_to,
 					  find_pack_prefix(),
 					  &names,
 					  &existing_nonkept_packs,
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 0a2c73bca7..2bf237ba3a 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -351,6 +351,72 @@ test_expect_success '--filter fails with --write-bitmap-index' '
 		--filter=blob:none
 '
 
+test_expect_success '--filter-to stores filtered out objects' '
+	git -C bare.git repack -a -d &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+
+	git init --bare filtered.git &&
+	git -C bare.git -c repack.writebitmaps=false repack -a -d \
+		--filter=blob:none \
+		--filter-to=../filtered.git/objects/pack/pack &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/pack-*.pack &&
+	test_stdout_line_count = 1 ls filtered.git/objects/pack/pack-*.pack &&
+
+	commit_pack=$(test-tool -C bare.git find-pack HEAD) &&
+	test -n "$commit_pack" &&
+	blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
+	test -z "$blob_pack" &&
+	blob_hash=$(git -C bare.git rev-parse HEAD:file1) &&
+	test -n "$blob_hash" &&
+	blob_pack=$(test-tool -C filtered.git find-pack $blob_hash) &&
+	test -n "$blob_pack" &&
+
+	echo $(pwd)/filtered.git/objects >bare.git/objects/info/alternates &&
+	blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
+	test -n "$blob_pack" &&
+	blob_content=$(git -C bare.git show $blob_hash) &&
+	test "$blob_content" = "content1"
+'
+
+test_expect_success '--filter works with --max-pack-size' '
+	rm -rf filtered.git &&
+	git init --bare filtered.git &&
+	git init max-pack-size &&
+	(
+		cd max-pack-size &&
+		test_commit base &&
+		# two blobs which exceed the maximum pack size
+		test-tool genrandom foo 1048576 >foo &&
+		git hash-object -w foo &&
+		test-tool genrandom bar 1048576 >bar &&
+		git hash-object -w bar &&
+		git add foo bar &&
+		git commit -m "adding foo and bar"
+	) &&
+	git clone --no-local --bare max-pack-size max-pack-size.git &&
+	(
+		cd max-pack-size.git &&
+		git -c repack.writebitmaps=false repack -a -d --filter=blob:none \
+			--max-pack-size=1M \
+			--filter-to=../filtered.git/objects/pack/pack &&
+		echo $(cd .. && pwd)/filtered.git/objects >objects/info/alternates &&
+
+		# Check that the 3 blobs are in different packfiles in filtered.git
+		test_stdout_line_count = 3 ls ../filtered.git/objects/pack/pack-*.pack &&
+		test_stdout_line_count = 1 ls objects/pack/pack-*.pack &&
+		foo_pack=$(test-tool find-pack HEAD:foo) &&
+		bar_pack=$(test-tool find-pack HEAD:bar) &&
+		base_pack=$(test-tool find-pack HEAD:base.t) &&
+		test "$foo_pack" != "$bar_pack" &&
+		test "$foo_pack" != "$base_pack" &&
+		test "$bar_pack" != "$base_pack" &&
+		for pack in "$foo_pack" "$bar_pack" "$base_pack"
+		do
+			case "$foo_pack" in */filtered.git/objects/pack/*) true ;; *) return 1 ;; esac
+		done
+	)
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.41.0.384.ged66511823


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 8/8] gc: add `gc.repackFilterTo` config option
  2023-07-24  8:59   ` [PATCH v3 0/8] Repack objects into separate packfiles based on a filter Christian Couder
                       ` (6 preceding siblings ...)
  2023-07-24  8:59     ` [PATCH v3 7/8] repack: implement `--filter-to` for storing filtered out objects Christian Couder
@ 2023-07-24  8:59     ` Christian Couder
  2023-07-25 23:10     ` [PATCH v3 0/8] Repack objects into separate packfiles based on a filter Taylor Blau
  2023-08-08  8:26     ` [PATCH v4 " Christian Couder
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-07-24  8:59 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit implemented the `gc.repackFilter` config option
to specify a filter that should be used by `git gc` when
performing repacks.

Another previous commit has implemented
`git repack --filter-to=<dir>` to specify the location of the
packfile containing filtered out objects when using a filter.

Let's implement the `gc.repackFilterTo` config option to specify
that location in the config when `gc.repackFilter` is used.

Now when `git gc` will perform a repack with a <dir> configured
through this option and not empty, the repack process will be
passed a corresponding `--filter-to=<dir>` argument.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/gc.txt | 11 +++++++++++
 builtin/gc.c                |  4 ++++
 t/t6500-gc.sh               | 13 ++++++++++++-
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index 2153bde7ac..466466d6cc 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -150,6 +150,17 @@ gc.repackFilter::
 	objects into a separate packfile.  See the
 	`--filter=<filter-spec>` option of linkgit:git-repack[1].
 
+gc.repackFilterTo::
+	When repacking and using a filter, see `gc.repackFilter`, the
+	specified location will be used to create the packfile
+	containing the filtered out objects. **WARNING:** The
+	specified location should be accessible, using for example the
+	Git alternates mechanism, otherwise the repo could be
+	considered corrupt by Git as it migh not be able to access the
+	objects in that packfile. See the `--filter-to=<dir>` option
+	of linkgit:git-repack[1] and the `objects/info/alternates`
+	section of linkgit:gitrepository-layout[5].
+
 gc.rerereResolved::
 	Records of conflicted merge you resolved earlier are
 	kept for this many days when 'git rerere gc' is run.
diff --git a/builtin/gc.c b/builtin/gc.c
index 9b0984f301..1b7c775d94 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -62,6 +62,7 @@ static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
 static char *repack_filter;
+static char *repack_filter_to;
 static unsigned long big_pack_threshold;
 static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 
@@ -172,6 +173,7 @@ static void gc_config(void)
 	git_config_get_ulong("pack.deltacachesize", &max_delta_cache_size);
 
 	git_config_get_string("gc.repackfilter", &repack_filter);
+	git_config_get_string("gc.repackfilterto", &repack_filter_to);
 
 	git_config(git_default_config, NULL);
 }
@@ -361,6 +363,8 @@ static void add_repack_all_option(struct string_list *keep_pack)
 
 	if (repack_filter && *repack_filter)
 		strvec_pushf(&repack, "--filter=%s", repack_filter);
+	if (repack_filter_to && *repack_filter_to)
+		strvec_pushf(&repack, "--filter-to=%s", repack_filter_to);
 }
 
 static void add_repack_incremental_option(void)
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 5b89faf505..37056a824b 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -203,7 +203,6 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
 '
 
 test_expect_success 'gc.repackFilter launches repack with a filter' '
-	test_when_finished "rm -rf bare.git" &&
 	git clone --no-local --bare . bare.git &&
 
 	git -C bare.git -c gc.cruftPacks=false gc &&
@@ -214,6 +213,18 @@ test_expect_success 'gc.repackFilter launches repack with a filter' '
 	grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
 '
 
+test_expect_success 'gc.repackFilterTo store filtered out objects' '
+	test_when_finished "rm -rf bare.git filtered.git" &&
+
+	git init --bare filtered.git &&
+	git -C bare.git -c gc.repackFilter=blob:none \
+		-c gc.repackFilterTo=../filtered.git/objects/pack/pack \
+		-c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&
+
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+	test_stdout_line_count = 1 ls filtered.git/objects/pack/*.pack
+'
+
 prepare_cruft_history () {
 	test_commit base &&
 
-- 
2.41.0.384.ged66511823


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 7/8] repack: implement `--filter-to` for storing filtered out objects
  2023-07-05 18:26     ` Junio C Hamano
@ 2023-07-24  9:00       ` Christian Couder
  2023-07-24 18:18         ` Junio C Hamano
  0 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-07-24  9:00 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jul 5, 2023 at 8:26 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > A previous commit has implemented `git repack --filter=<filter-spec>` to
> > allow users to filter out some objects from the main pack and move them
> > into a new different pack.
>
> OK, this sidesteps the question I had on an earlier step rather
> nicely.  Instead of having to find out which ones are to be moved
> away, just generating them in a separate location would be more
> straight forward.
>
> The implementation does not seem to restrict where --filter-to
> directory can be placed, but shouldn't it make sure that it is one
> of the already specified alternates directories?  Otherwise the user
> will end up corrupting the repository, no?

I don't think it should make sure that the implementation should
restrict where the --filter-to directory can be placed.

In version 3, that I just sent, I have written the following in the
commit message to explain this:

"
   Even in a different directory, this pack can be accessible if, for
   example, the Git alternates mechanism is used to point to it. In fact
   not using the Git alternates mechanism can corrupt a repo as the
   generated pack containing the filtered objects might not be accessible
   from the repo any more. So setting up the Git alternates mechanism
   should be done before using this feature if the user wants the repo to
   be fully usable while this feature is used.

   In some cases, like when a repo has just been cloned or when there is no
   other activity in the repo, it's Ok to setup the Git alternates
   mechanism afterwards though. It's also Ok to just inspect the generated
   packfile containing the filtered objects and then just move it into the
   '.git/objects/pack/' directory manually. That's why it's not necessary
   for this command to check that the Git alternates mechanism has been
   already setup.
"

I haven't mentioned cases related to promisor remotes, but I think in
some of those cases the feature can be very useful too while there is
no need to check that the Git alternates mechanism has been set up.

In version 3, the doc for the --filter-to option and the corresponding
gc.repackFilterTo config flag look like this:

+--filter-to=<dir>::
+       Write the pack containing filtered out objects to the
+       directory `<dir>`. Only useful with `--filter`. This can be
+       used for putting the pack on a separate object directory that
+       is accessed through the Git alternates mechanism. **WARNING:**
+       If the packfile containing the filtered out objects is not
+       accessible, the repo could be considered corrupt by Git as it
+       migh not be able to access the objects in that packfile. See
+       the `objects` and `objects/info/alternates` sections of
+       linkgit:gitrepository-layout[5].

+gc.repackFilterTo::
+       When repacking and using a filter, see `gc.repackFilter`, the
+       specified location will be used to create the packfile
+       containing the filtered out objects. **WARNING:** The
+       specified location should be accessible, using for example the
+       Git alternates mechanism, otherwise the repo could be
+       considered corrupt by Git as it might not be able to access the
+       objects in that packfile. See the `--filter-to=<dir>` option
+       of linkgit:git-repack[1] and the `objects/info/alternates`
+       section of linkgit:gitrepository-layout[5].

So they warn about possible issues with the feature and link to some
relevant doc.

Now if we think that it's not enough, I would implement a check in the
code that would warn users loudly if the directory specified by those
options is not accessible using the Git alternates mechanism. It would
be annoying I think that it would be too restrictive to error out in
that case though.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option
  2023-07-05 17:53     ` Junio C Hamano
@ 2023-07-24  9:01       ` Christian Couder
  2023-07-24 18:28         ` Junio C Hamano
  0 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-07-24  9:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jul 5, 2023 at 7:53 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > This could be useful if, for example, some large blobs take a lot of
> > precious space on fast storage while they are rarely accessed. It could
> > make sense to move them into a separate cheaper, though slower, storage.
> >
> > In other use cases it might make sense to put all the blobs into
> > separate storage.
>
> Minor nit.  Aren't the above two the same use case?

In the first case only some large blobs are moved to slower storage
and in the other case all the blobs are moved to slower storage. So
yeah the use cases are very similar. Not sure if and how I can improve
the above wording though.

> > This is done by running two `git pack-objects` commands. The first one
> > is run with `--filter=<filter-spec>`, using the specified filter. It
> > packs objects while omitting the objects specified by the filter.
> > Then another `git pack-objects` command is launched using
> > `--stdin-packs`. We pass it all the previously existing packs into its
> > stdin, so that it will pack all the objects in the previously existing
> > packs. But we also pass into its stdin, the pack created by the previous
> > `git pack-objects --filter=<filter-spec>` command as well as the kept
> > packs, all prefixed with '^', so that the objects in these packs will be
> > omitted from the resulting pack.
>
> When I started reading the paragraph, the first question that came
> to my mind was if these two pack-objects processes can and should be
> run in parallel, which is answered in the part near the end of the
> paragraph.  It may be a good idea to start the paragraph with "by
> running `git pack-objects` command twice in a row" or something to
> make it clear that one should (and cannot be) run before the other
> completes.

Ok, in version 3 that I just sent, that paragraph starts with:

"
   This is done by running `git pack-objects` twice in a row. The first
   command is run with `--filter=<filter-spec>`, using the specified
   filter.
"

> In fact, isn't the call site of write_filtered_pack() in this patch
> a bit too early?  The subprocess that runs with "--stdin-packs" is
> started and told about the names of the pack we are going to create,
> and it does not start processing until it reads everything (i.e. we
> run fclose(in) in the write_filtered_pack() function), but the loop
> over "names" string list in the caller that moves the tempfiles to
> their final filenames comes after the call to close_object_store()
> we see in the post context of the call to write_filtered_pack() that
> is new in this patch.

I think it can work if the call to write_filtered_pack() is either
before the call to close_object_store() or after it. It would just use
the tempfiles with their temporary name in the first case and with
their final name in the second case.

write_filtered_pack() is very similar to write_cruft_pack() which is
called before the call to close_object_store(), so I prefer to keep it
before that call too, if possible, for consistency.

> The "--stdin-packs" one is told to exclude objects that appear in
> these packs, so if the main process is a bit slow to finalize the
> packfiles it created (and told the "--stdin-packs" process about),
> it will not lead to repository corruption---just some objects are
> included in the packfiles "--stdin-packs" one creates even though
> they do not have to.  So it does not sound like a huge problem to
> me, but still it somehow looks wrong.  Am I misreading the code?

I would have thought that as finish_pack_objects_cmd() calls
finish_command() the first pack-objects command (which is called
without --stdout) should be completely finished and the packfiles
fully created when write_filtered_pack() (or write_cruft_pack()) is
called, even if the object store is not closed, but you might be
right.

Perhaps this could be dealt with separately though, as I think we
might want to fix write_cruft_pack() first then.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option
  2023-07-05 18:12     ` Junio C Hamano
@ 2023-07-24  9:02       ` Christian Couder
  0 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-07-24  9:02 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jul 5, 2023 at 8:12 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > +--filter=<filter-spec>::
> > +     Remove objects matching the filter specification from the
> > +     resulting packfile and put them into a separate packfile. Note
> > +     that objects used in the working directory are not filtered
> > +     out. So for the split to fully work, it's best to perform it
> > +     in a bare repo and to use the `-a` and `-d` options along with
> > +     this option.  See linkgit:git-rev-list[1] for valid
> > +     `<filter-spec>` forms.
>
> After running the command with this option once, we will have two
> packfiles, one with objects that match and the other with objects
> that do not match the filter spec.  Then what is the next step for
> the user of this feature?  Moving the former to a slower storage
> was cited as a motivation for the feature, but can the user tell
> which one of these two packfiles is the one that consists of the
> filtered out objects?  If there is no mechansim to do so, shouldn't
> we have one to make this feature more usable?
>
> At the level of "pack-objects" command, we report the new packfiles
> so that the user does not have to take "ls .git/objects/pack" before
> and after the operation to compare and learn which ones are new.
> I do not think "repack" that is a Porcelain should do such a
> reporting on its standard output, but that means either the feature
> should probably be done at the plumbing level (i.e. "pack-objects"),
> or the marking of the new packfiles needs to be done in a way that
> tools can later find them out, e.g. on the filesystem, similar to
> the way ".keep" marker tells which ones are not to be repacked, etc.

I think commands like `git verify-pack -v ...` can already tell a bit
about the content of a packfile.

Also this patch series adds `test-tool find-pack` which can help too.
It could maybe be converted into a new `git verify-pack --find-pack`
option if users want it.

Then, as you later found out, there is the --filter-to=<dir> option
added later by this series.

To clarify this, I have added the following to the commit message in
the version 3 I just sent:

"
   It's possible to find which new packfile contains the filtered out
   objects using one of the following:

     - `git verify-pack -v ...`,
     - `test-tool find-pack ...`, which a previous commit added,
     - `--filter-to=<dir>`, which a following commit will add to specify
       where the pack containing the filtered out objects will be.
"

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 7/8] repack: implement `--filter-to` for storing filtered out objects
  2023-07-24  9:00       ` Christian Couder
@ 2023-07-24 18:18         ` Junio C Hamano
  2023-07-25 13:41           ` Robert Coup
  2023-07-25 15:45           ` Christian Couder
  0 siblings, 2 replies; 161+ messages in thread
From: Junio C Hamano @ 2023-07-24 18:18 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> In version 3, the doc for the --filter-to option and the corresponding
> gc.repackFilterTo config flag look like this:
>
> +--filter-to=<dir>::
> +       Write the pack containing filtered out objects to the
> +       directory `<dir>`. Only useful with `--filter`. This can be
> +       used for putting the pack on a separate object directory that
> +       is accessed through the Git alternates mechanism. **WARNING:**
> +       If the packfile containing the filtered out objects is not
> +       accessible, the repo could be considered corrupt by Git as it

"could be considered" -> "can become".

> +       migh not be able to access the objects in that packfile. See

"migh" -> "might".

> +       the `objects` and `objects/info/alternates` sections of
> +       linkgit:gitrepository-layout[5].
>
> +gc.repackFilterTo::
> +       When repacking and using a filter, see `gc.repackFilter`, the
> +       specified location will be used to create the packfile
> +       containing the filtered out objects. **WARNING:** The
> +       specified location should be accessible, using for example the
> +       Git alternates mechanism, otherwise the repo could be
> +       considered corrupt by Git as it might not be able to access the
> +       objects in that packfile. See the `--filter-to=<dir>` option
> +       of linkgit:git-repack[1] and the `objects/info/alternates`
> +       section of linkgit:gitrepository-layout[5].
>
> So they warn about possible issues with the feature and link to some
> relevant doc.

In all other parts of the system, we tend to avoid such an "unsafe
by default" desgin, especially when the risk is known before there
is an implementation, and instead allow an explicit end-user action
(ranging from command line option to interactive confirmation) to
opt-into more risky behaviour.  Should we consider --filter-to as
such an "always risky and prone to repository corruption" option
(just like "--hard" to "reset" is always loses changes in the
working tree without warning)?

I am OK with that myself, but others may disagree.

Come to think of it, we haven't seen much reviews from those other
than Taylor.  Are folks content with the direction this series is
going in general?

Thanks.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option
  2023-07-24  9:01       ` Christian Couder
@ 2023-07-24 18:28         ` Junio C Hamano
  2023-07-25 15:22           ` Christian Couder
  0 siblings, 1 reply; 161+ messages in thread
From: Junio C Hamano @ 2023-07-24 18:28 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

>> Minor nit.  Aren't the above two the same use case?
>
> In the first case only some large blobs are moved to slower storage
> and in the other case all the blobs are moved to slower storage. So
> yeah the use cases are very similar. Not sure if and how I can improve
> the above wording though.

Just by removing one or the other, it would be quite improved, no?
Moving away some blobs could move away all or just a selected
subset.

> I think it can work if the call to write_filtered_pack() is either
> before the call to close_object_store() or after it. It would just use
> the tempfiles with their temporary name in the first case and with
> their final name in the second case.
>
> write_filtered_pack() is very similar to write_cruft_pack() which is
> called before the call to close_object_store(), so I prefer to keep it
> before that call too, if possible, for consistency.

As long as the set-up is not racy, either would be OK, as the names
are not recorded in the end result.

If the upstream tells the downstream the temporary's name and then
finializes the temporary to the final name before the downstream
reacts to the input, however, then by the time downstream starts
working on the file, the file may not exist under its original,
temporary name, and that kind of race was what I was worried about.

> Perhaps this could be dealt with separately though, as I think we
> might want to fix write_cruft_pack() first then.

Sorry, I am not understanding this quite well.  Do you mean we
should add one more known-to-be-racy-and-incorrect codepath because
there is already a codepath that needs to be fixed anyway?

If write_cruft_pack() has a similar issue, then yeah, let's fix that
first (testing might be tricky for any racy bugs).  And let's use
the same technique as used to fix it in this series, too, so that we
do not reintroduce a similar bug due to racy setup.

Thanks.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 7/8] repack: implement `--filter-to` for storing filtered out objects
  2023-07-24 18:18         ` Junio C Hamano
@ 2023-07-25 13:41           ` Robert Coup
  2023-07-25 16:50             ` Junio C Hamano
  2023-07-25 15:45           ` Christian Couder
  1 sibling, 1 reply; 161+ messages in thread
From: Robert Coup @ 2023-07-25 13:41 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

Hi Junio,

On Mon, 24 Jul 2023 at 19:18, Junio C Hamano <gitster@pobox.com> wrote:
> Come to think of it, we haven't seen much reviews from those other
> than Taylor.  Are folks content with the direction this series is
> going in general?

For what it's worth I have a medium-term plan similar to Gitlab's with
respect to moving chunks of repositories onto lower cost storage media
& to promisor remotes. Like others, I wasn't at all sure about the
original approach (and commented at the time). What Christian is
proposing here seems much cleaner, is usable without complex
gymnastics or safety equipment, and provides a better building block
for future work.

Rob :)

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option
  2023-07-24 18:28         ` Junio C Hamano
@ 2023-07-25 15:22           ` Christian Couder
  2023-07-25 17:25             ` Junio C Hamano
  0 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-07-25 15:22 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Mon, Jul 24, 2023 at 8:28 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> >> Minor nit.  Aren't the above two the same use case?
> >
> > In the first case only some large blobs are moved to slower storage
> > and in the other case all the blobs are moved to slower storage. So
> > yeah the use cases are very similar. Not sure if and how I can improve
> > the above wording though.
>
> Just by removing one or the other, it would be quite improved, no?
> Moving away some blobs could move away all or just a selected
> subset.

Ok, I have done that in my current version.

> > I think it can work if the call to write_filtered_pack() is either
> > before the call to close_object_store() or after it. It would just use
> > the tempfiles with their temporary name in the first case and with
> > their final name in the second case.
> >
> > write_filtered_pack() is very similar to write_cruft_pack() which is
> > called before the call to close_object_store(), so I prefer to keep it
> > before that call too, if possible, for consistency.
>
> As long as the set-up is not racy, either would be OK, as the names
> are not recorded in the end result.
>
> If the upstream tells the downstream the temporary's name and then
> finializes the temporary to the final name before the downstream
> reacts to the input,

It doesn't seem to me that it's what happens.

We have the following order:

  - finish_pack_objects_cmd() is called for the first pack-objects
process. It populates the 'names' string_list with the temporary name
of the packfile it generated (which doesn't contain the filtered out
objects) and calls finish_command() to finish the first pack-objects
process. So as far as I understand nothing can be written anymore to
the packfile when finish_pack_objects_cmd() returns.

  - write_filtered_pack() is called. It starts the second pack-objects
process and passes it the temporary name of the packfile that was just
written, taking it from the 'names' string_list. It then calls
finish_pack_objects_cmd() for the second process which populates the
'names' string_list with the temporary name of the packfile created by
the second process and finishes the second process. So nothing can
then be written in the second packfile anymore.

  - close_object_store() is called which renames the packfiles from
the 'names' string_list giving them their final name.

So the final names are given only once both processes are finished and
both packfiles have been fully written.

> however, then by the time downstream starts
> working on the file, the file may not exist under its original,
> temporary name, and that kind of race was what I was worried about.
>
> > Perhaps this could be dealt with separately though, as I think we
> > might want to fix write_cruft_pack() first then.
>
> Sorry, I am not understanding this quite well.  Do you mean we
> should add one more known-to-be-racy-and-incorrect codepath because
> there is already a codepath that needs to be fixed anyway?

No.

> If write_cruft_pack() has a similar issue, then yeah, let's fix that
> first (testing might be tricky for any racy bugs).  And let's use
> the same technique as used to fix it in this series, too, so that we
> do not reintroduce a similar bug due to racy setup.

Yeah, that's what I mean.

I am not sure the race actually exists though. I have tried to explain
why it seems to me that things look correct, but from previous
experience I know that you are very often right, and I might have
missed something.

Thanks.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 7/8] repack: implement `--filter-to` for storing filtered out objects
  2023-07-24 18:18         ` Junio C Hamano
  2023-07-25 13:41           ` Robert Coup
@ 2023-07-25 15:45           ` Christian Couder
  1 sibling, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-07-25 15:45 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Mon, Jul 24, 2023 at 8:18 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > In version 3, the doc for the --filter-to option and the corresponding
> > gc.repackFilterTo config flag look like this:
> >
> > +--filter-to=<dir>::
> > +       Write the pack containing filtered out objects to the
> > +       directory `<dir>`. Only useful with `--filter`. This can be
> > +       used for putting the pack on a separate object directory that
> > +       is accessed through the Git alternates mechanism. **WARNING:**
> > +       If the packfile containing the filtered out objects is not
> > +       accessible, the repo could be considered corrupt by Git as it
>
> "could be considered" -> "can become".
>
> > +       migh not be able to access the objects in that packfile. See
>
> "migh" -> "might".
>
> > +       the `objects` and `objects/info/alternates` sections of
> > +       linkgit:gitrepository-layout[5].

Thanks for catching these, they are fixed in my current version.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 7/8] repack: implement `--filter-to` for storing filtered out objects
  2023-07-25 13:41           ` Robert Coup
@ 2023-07-25 16:50             ` Junio C Hamano
  0 siblings, 0 replies; 161+ messages in thread
From: Junio C Hamano @ 2023-07-25 16:50 UTC (permalink / raw)
  To: Robert Coup
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

Robert Coup <robert.coup@koordinates.com> writes:

> ... Like others, I wasn't at all sure about the
> original approach (and commented at the time). What Christian is
> proposing here seems much cleaner, is usable without complex
> gymnastics or safety equipment, and provides a better building block
> for future work.

Nice to hear a positive feedback [*].

Thanks.

[Footnote]

 * Of course negative ones as long as they are consturctive are also
   welcome;-)

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option
  2023-07-25 15:22           ` Christian Couder
@ 2023-07-25 17:25             ` Junio C Hamano
  2023-07-25 23:08               ` Junio C Hamano
  0 siblings, 1 reply; 161+ messages in thread
From: Junio C Hamano @ 2023-07-25 17:25 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> We have the following order:
>
>   - finish_pack_objects_cmd() is called for the first pack-objects
> process. It populates the 'names' string_list with the temporary name
> of the packfile it generated (which doesn't contain the filtered out
> objects) and calls finish_command() to finish the first pack-objects
> process. So as far as I understand nothing can be written anymore to
> the packfile when finish_pack_objects_cmd() returns.
>
>   - write_filtered_pack() is called. It starts the second pack-objects
> process and passes it the temporary name of the packfile that was just
> written, taking it from the 'names' string_list. It then calls
> finish_pack_objects_cmd() for the second process which populates the
> 'names' string_list with the temporary name of the packfile created by
> the second process and finishes the second process. So nothing can
> then be written in the second packfile anymore.
>
>   - close_object_store() is called which renames the packfiles from
> the 'names' string_list giving them their final name.

"which renames" -> "and then we enter a loop to rename" and
close_object_store() itself and its callees do not do much, but yes,
you are right.  As finish_pack_objects_cmd() is synchronous, there
cannot be such race as a feared (if we were feeding the pack objects
process that collects the objects that would have filtered out with
the final packfile paths, and if we were only renaming them to the
final paths after that close_object_store() call, then the process
would want to see the final names that are not there yet, but that's
not a race but a bug that would reliably trigger).

> So the final names are given only once both processes are finished and
> both packfiles have been fully written.

Thanks for walking through the codepaths involved.  We are good
then.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 1/8] pack-objects: allow `--filter` without `--stdout`
  2023-07-24  8:59     ` [PATCH v3 1/8] pack-objects: allow `--filter` without `--stdout` Christian Couder
@ 2023-07-25 22:38       ` Taylor Blau
  2023-07-25 23:51         ` Junio C Hamano
  0 siblings, 1 reply; 161+ messages in thread
From: Taylor Blau @ 2023-07-25 22:38 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Mon, Jul 24, 2023 at 10:59:02AM +0200, Christian Couder wrote:
> diff --git a/t/t5317-pack-objects-filter-objects.sh b/t/t5317-pack-objects-filter-objects.sh
> index b26d476c64..2ff3eef9a3 100755
> --- a/t/t5317-pack-objects-filter-objects.sh
> +++ b/t/t5317-pack-objects-filter-objects.sh
> @@ -53,6 +53,14 @@ test_expect_success 'verify blob:none packfile has no blobs' '
>  	! grep blob verify_result
>  '
>
> +test_expect_success 'verify blob:none packfile without --stdout' '
> +	git -C r1 pack-objects --revs --filter=blob:none mypackname >packhash <<-EOF &&
> +	HEAD
> +	EOF
> +	git -C r1 verify-pack -v "mypackname-$(cat packhash).pack" >verify_result &&
> +	! grep blob verify_result
> +'

Just a couple of style nits here. It's a little strange (for me, at
least) to see the heredoc into a git process. I wonder if it might be
clearer to write something like:

    echo HEAD >in &&
    git -C r1 pack-objects --revs --filter=blob:none $packdir/pack <in

, but I could certainly go either way on that one. I am less certain
about redirecting the output into a file "packhash", only to cat it back
out.

Do later tests depend on the existence of this file? If so, then what
you have makes sense. If not, I would recommend storing the output in a
variable, which avoids both the I/O operation, and the unnecessary "cat"
sub-process.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 2/8] t/helper: add 'find-pack' test-tool
  2023-07-24  8:59     ` [PATCH v3 2/8] t/helper: add 'find-pack' test-tool Christian Couder
@ 2023-07-25 22:44       ` Taylor Blau
  2023-08-08  8:28         ` Christian Couder
  0 siblings, 1 reply; 161+ messages in thread
From: Taylor Blau @ 2023-07-25 22:44 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Mon, Jul 24, 2023 at 10:59:03AM +0200, Christian Couder wrote:
> ---
>  Makefile                  |  1 +
>  t/helper/test-find-pack.c | 35 +++++++++++++++++++++++++++++++++++
>  t/helper/test-tool.c      |  1 +
>  t/helper/test-tool.h      |  1 +
>  4 files changed, 38 insertions(+)
>  create mode 100645 t/helper/test-find-pack.c

Everything that you wrote here seems reasonable to me, and the
implementation of the new test tool is very straightforward.

I'm pretty sure that everything here is correct, and we'll implicitly
test the behavior of the new helper in following patches.

That said, I think that it might be prudent here to "test the tests" and
write a simple test script that exercises this test helper over a more
trivial case. There is definitely prior art for testing our helpers
directly in the t00?? tests.

Among the test helpers that I can think of off the top of my head, I
think a good handful of them have tests:

  - t0011-hashmap.sh
  - t0015-hash.sh
  - t0016-oidmap.sh
  - t0019-json-writer.sh
  - t0052-simple-ipc.sh
  - t0060-path-utils.sh
  - t0061-run-command.sh
  - t0063-string-list.sh
  - t0064-oid-array.sh
  - t0066-dir-iterator.sh
  - t0095-bloom.sh

I would definitely recommend adding a test here, too. Like I said
earlier, I think that you are implicitly testing the new behavior here,
but it's going to happen in much more complicated environments than
something you could construct synthetically here.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 3/8] repack: refactor finishing pack-objects command
  2023-07-24  8:59     ` [PATCH v3 3/8] repack: refactor finishing pack-objects command Christian Couder
@ 2023-07-25 22:45       ` Taylor Blau
  0 siblings, 0 replies; 161+ messages in thread
From: Taylor Blau @ 2023-07-25 22:45 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

On Mon, Jul 24, 2023 at 10:59:04AM +0200, Christian Couder wrote:
> Create a new finish_pack_objects_cmd() to refactor duplicated code
> that handles reading the packfile names from the output of a
> `git pack-objects` command and putting it into a string_list, as well as
> calling finish_command().
>
> While at it, beautify a code comment a bit in the new function.

Everything here looks good to me. Thanks for cleaning this up into its
own function and DRY-ing things up a little bit.

> Signed-off-by: Christian Couder <chriscool@tuxfamily.org
> ---
>  builtin/repack.c | 70 +++++++++++++++++++++++-------------------------
>  1 file changed, 33 insertions(+), 37 deletions(-)
>
> diff --git a/builtin/repack.c b/builtin/repack.c
> index aea5ca9d44..96af2d1caf 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -696,6 +696,36 @@ static void remove_redundant_bitmaps(struct string_list *include,
>  	strbuf_release(&path);
>  }
>
> +static int finish_pack_objects_cmd(struct child_process *cmd,
> +				   struct string_list *names,
> +				   int local)

I'm glad to see "local" in the arguments list ;-). I think that the
implementation came out nice and clean here.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 4/8] repack: refactor finding pack prefix
  2023-07-24  8:59     ` [PATCH v3 4/8] repack: refactor finding pack prefix Christian Couder
@ 2023-07-25 22:47       ` Taylor Blau
  2023-08-08  8:29         ` Christian Couder
  0 siblings, 1 reply; 161+ messages in thread
From: Taylor Blau @ 2023-07-25 22:47 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

On Mon, Jul 24, 2023 at 10:59:05AM +0200, Christian Couder wrote:
> diff --git a/builtin/repack.c b/builtin/repack.c
> index 96af2d1caf..21e3b89f27 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -783,6 +783,17 @@ static int write_cruft_pack(const struct pack_objects_args *args,
>  	return finish_pack_objects_cmd(&cmd, names, local);
>  }
>
> +static const char *find_pack_prefix(void)
> +{
> +	const char *pack_prefix;
> +	if (!skip_prefix(packtmp, packdir, &pack_prefix))

I wonder if this might be a good opportunity to pass "packtmp" and
"packdir" as arguments to the function. I know that these are globals,
but it at least nudges us in the right direction away from adding more
global variables.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 5/8] repack: add `--filter=<filter-spec>` option
  2023-07-24  8:59     ` [PATCH v3 5/8] repack: add `--filter=<filter-spec>` option Christian Couder
@ 2023-07-25 23:04       ` Taylor Blau
  2023-08-08  8:34         ` Christian Couder
  0 siblings, 1 reply; 161+ messages in thread
From: Taylor Blau @ 2023-07-25 23:04 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Mon, Jul 24, 2023 at 10:59:06AM +0200, Christian Couder wrote:
> This feature is implemented by running `git pack-objects` twice in a
> row. The first command is run with `--filter=<filter-spec>`, using the
> specified filter. It packs objects while omitting the objects specified
> by the filter. Then another `git pack-objects` command is launched using
> `--stdin-packs`. We pass it all the previously existing packs into its
> stdin, so that it will pack all the objects in the previously existing
> packs. But we also pass into its stdin, the pack created by the previous
> `git pack-objects --filter=<filter-spec>` command as well as the kept
> packs, all prefixed with '^', so that the objects in these packs will be
> omitted from the resulting pack. The result is that only the objects
> filtered out by the first `git pack-objects` command are in the pack
> resulting from the second `git pack-objects` command.

Very nice. I appreciate you taking my suggestion here; I'm hopeful that
it simplified things and resulted in fewer new lines of code.

> Signed-off-by: John Cai <johncai86@gmail.com>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
>  Documentation/git-repack.txt | 12 +++++++
>  builtin/repack.c             | 67 ++++++++++++++++++++++++++++++++++++
>  t/t7700-repack.sh            | 24 +++++++++++++
>  3 files changed, 103 insertions(+)
>
> diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
> index 4017157949..6d5bec7716 100644
> --- a/Documentation/git-repack.txt
> +++ b/Documentation/git-repack.txt
> @@ -143,6 +143,18 @@ depth is 4095.
>  	a larger and slower repository; see the discussion in
>  	`pack.packSizeLimit`.
>
> +--filter=<filter-spec>::
> +	Remove objects matching the filter specification from the
> +	resulting packfile and put them into a separate packfile. Note
> +	that objects used in the working directory are not filtered
> +	out. So for the split to fully work, it's best to perform it
> +	in a bare repo and to use the `-a` and `-d` options along with
> +	this option.  Also `--no-write-bitmap-index` (or the
> +	`repack.writebitmaps` config option set to `false`) should be
> +	used otherwise writing bitmap index will fail, as it supposes
> +	a single packfile containing all the objects. See
> +	linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
> +
>  -b::
>  --write-bitmap-index::
>  	Write a reachability bitmap index as part of the repack. This
> diff --git a/builtin/repack.c b/builtin/repack.c
> index 21e3b89f27..2c81b7738e 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -53,6 +53,7 @@ struct pack_objects_args {
>  	const char *depth;
>  	const char *threads;
>  	const char *max_pack_size;
> +	const char *filter;
>  	int no_reuse_delta;
>  	int no_reuse_object;
>  	int quiet;
> @@ -166,6 +167,8 @@ static void prepare_pack_objects(struct child_process *cmd,
>  		strvec_pushf(&cmd->args, "--threads=%s", args->threads);
>  	if (args->max_pack_size)
>  		strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
> +	if (args->filter)
> +		strvec_pushf(&cmd->args, "--filter=%s", args->filter);
>  	if (args->no_reuse_delta)
>  		strvec_pushf(&cmd->args, "--no-reuse-delta");
>  	if (args->no_reuse_object)
> @@ -726,6 +729,57 @@ static int finish_pack_objects_cmd(struct child_process *cmd,
>  	return finish_command(cmd);
>  }
>
> +static int write_filtered_pack(const struct pack_objects_args *args,
> +			       const char *destination,
> +			       const char *pack_prefix,
> +			       struct string_list *names,
> +			       struct string_list *existing_packs,
> +			       struct string_list *existing_kept_packs)
> +{
> +	struct child_process cmd = CHILD_PROCESS_INIT;
> +	struct string_list_item *item;
> +	FILE *in;
> +	int ret;
> +	const char *scratch;
> +	int local = skip_prefix(destination, packdir, &scratch);
> +
> +	/* We need to copy 'args' to modify it */
> +	struct pack_objects_args new_args = *args;
> +
> +	/* No need to filter again */
> +	new_args.filter = NULL;
> +
> +	prepare_pack_objects(&cmd, &new_args, destination);
> +
> +	strvec_push(&cmd.args, "--stdin-packs");
> +
> +	cmd.in = -1;
> +
> +	ret = start_command(&cmd);
> +	if (ret)
> +		return ret;


> +	/*
> +	 * names has a confusing double use: it both provides the list
> +	 * of just-written new packs, and accepts the name of the
> +	 * filtered pack we are writing.
> +	 *
> +	 * By the time it is read here, it contains only the pack(s)
> +	 * that were just written, which is exactly the set of packs we
> +	 * want to consider kept.
> +	 */

I think that this comment partially comes from the cruft pack code,
where we use the `names` string list both to reference existing packs at
the start of the repack, and to keep track of the pack we just wrote (to
exclude its contents from the cruft pack).

But I think we only write into "names" via finish_pack_objects_cmd() to
record the name of the pack we just wrote containing objects which
didn't meet the filter's conditions.

So I think that leaving this comment in is OK, but TBH I was on the
fence when I wrote that back in f9825d1cf75 (builtin/repack.c: support
generating a cruft pack, 2022-05-20), so I would just as soon drop it.

> +	in = xfdopen(cmd.in, "w");
> +	for_each_string_list_item(item, names)
> +		fprintf(in, "^%s-%s.pack\n", pack_prefix, item->string);
> +	for_each_string_list_item(item, existing_packs)
> +		fprintf(in, "%s.pack\n", item->string);

> +	for_each_string_list_item(item, existing_kept_packs)
> +		fprintf(in, "^%s.pack\n", item->string);

I think we may only want to do this if `honor_pack_keep` is zero.
Otherwise we'd avoid packing objects that appear in kept packs, even if
the caller told us to include objects found in kept packs.

> +	fclose(in);
> +
> +	return finish_pack_objects_cmd(&cmd, names, local);
> +}
> +
>  static int write_cruft_pack(const struct pack_objects_args *args,
>  			    const char *destination,
>  			    const char *pack_prefix,
> @@ -858,6 +912,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  				N_("limits the maximum number of threads")),
>  		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
>  				N_("maximum size of each packfile")),
> +		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
> +				N_("object filtering")),

I suppose we're storing the filter as a string here because we're just
going to pass it down to pack-objects directly. That part makes sense,
but I think we are producing subtly inconsistent behavior when
specifying multiple --filter options.

IIRC, passing --filter more than once down to pack-objects produces a
filter whose objects match all of the individually specified
sub-filters. But IIUC, using OPT_STRING here means that later
`--filter`'s override earlier ones.

So I think at minimum we'd want to store the filter arguments in a
strvec. But I would probably just as soon parse them into a bona-fide
list_objects_filter_options struct, and then reconstruct the arguments
to pack-objects based on that.

> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
> index 27b66807cd..0a2c73bca7 100755
> --- a/t/t7700-repack.sh
> +++ b/t/t7700-repack.sh
> @@ -327,6 +327,30 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
>  	test_must_be_empty actual
>  '
>
> +test_expect_success 'repacking with a filter works' '
> +	git -C bare.git repack -a -d &&
> +	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&

Huh! I never knew about the test_stdout_line_count function, I thought
that we always just had test_line_count. Neat!

> +	git -C bare.git -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
> +	test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
> +	commit_pack=$(test-tool -C bare.git find-pack HEAD) &&
> +	test -n "$commit_pack" &&

I wonder if the test-tool itself should exit with a non-zero code if it
can't find the given object in any pack. It would at least allow us to
drop the "test -n $foo" after every invocation of the test-helper in
this test.

Arguably callers may want to ensure that an object doesn't exist in any
pack, and this would be inconvenient for them, since they'd have to
write something like:

    test_must_fail test-tool find-pack $obj

but I think a more direct test like

    test_must_fail git cat-file -t $obj

would do just as well.

> +	blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
> +	test -n "$blob_pack" &&
> +	test "$commit_pack" != "$blob_pack" &&
> +	tree_pack=$(test-tool -C bare.git find-pack HEAD^{tree}) &&
> +	test "$tree_pack" = "$commit_pack" &&
> +	blob_pack2=$(test-tool -C bare.git find-pack HEAD:file2) &&
> +	test "$blob_pack2" = "$blob_pack"
> +'

This all looks good, but I think there are a couple of more things that
we'd want to test for here:

  - That the list of all objects appears the same before and after all
    of the repacking. I think that this is tested implicitly already in
    your test, but having it written down explicitly would harden this
    against regressions that cause us to inadvertently delete an object
    we shouldn't have.

    (FWIW, I think this would be limited to running something like "git
    cat-file --batch-check='%(objectname)' --batch-all-objects" before
    and after all of the repacking, and ensuring that the two test_cmp
    without failure).

  - Another thing that I don't think we're testing here is that objects
    that *don't* match the filter don't appear in one of the filtered
    packs. I think we'd probably want to assert on the exact contents of
    the pack by dumping the list of objects into a file like "expect",
    and then dumping the actual set of objects with "git show-index
    <$idx | cut -d' ' -f2" or something.

Another thought from the OPT_STRING business above is that we probably
want to test this with non-trivial filter arguments. There are probably
a handful of interesting cases here, like passing `--no-filter`, passing
`--filter` multiple times, passing invalid values for `--filter`, etc.

> +test_expect_success '--filter fails with --write-bitmap-index' '
> +	test_must_fail git -C bare.git repack -a -d --write-bitmap-index \
> +		--filter=blob:none &&

Do we want to ensure that we get the exit code corresponding with
showing the usage text? I could go either way, but I do think that we
should grep through the output on stderr to ensure that we get the
appropriate error message.

> +	git -C bare.git repack -a -d --no-write-bitmap-index \
> +		--filter=blob:none

I don't think that this test is adding anything that the above
"repacking with a filter works" test isn't covering already.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 6/8] gc: add `gc.repackFilter` config option
  2023-07-24  8:59     ` [PATCH v3 6/8] gc: add `gc.repackFilter` config option Christian Couder
@ 2023-07-25 23:07       ` Taylor Blau
  2023-08-08  8:38         ` Christian Couder
  0 siblings, 1 reply; 161+ messages in thread
From: Taylor Blau @ 2023-07-25 23:07 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Mon, Jul 24, 2023 at 10:59:07AM +0200, Christian Couder wrote:
> A previous commit has implemented `git repack --filter=<filter-spec>` to
> allow users to filter out some objects from the main pack and move them
> into a new different pack.
>
> Users might want to perform such a cleanup regularly at the same time as
> they perform other repacks and cleanups, so as part of `git gc`.
>
> Let's allow them to configure a <filter-spec> for that purpose using a
> new gc.repackFilter config option.

Makes sense.

> Now when `git gc` will perform a repack with a <filter-spec> configured
> through this option and not empty, the repack process will be passed a
> corresponding `--filter=<filter-spec>` argument.

I may be missing something, but what happens if the user has configured
gc.repackFilter, but passes additional filters over the command-line
arguments? I'm not sure whether these should be AND'd with the existing
filters in config, or if they should reset them to zero, or something
else.

Regardless, I think it would be beneficial to users if we spelled this
out in git-gc(1) instead of just this patch message here.

> diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
> index 69509d0c11..5b89faf505 100755
> --- a/t/t6500-gc.sh
> +++ b/t/t6500-gc.sh
> @@ -202,6 +202,18 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
>  	grep -E "^trace: (built-in|exec|run_command): git reflog expire --" trace.out
>  '
>
> +test_expect_success 'gc.repackFilter launches repack with a filter' '
> +	test_when_finished "rm -rf bare.git" &&
> +	git clone --no-local --bare . bare.git &&
> +
> +	git -C bare.git -c gc.cruftPacks=false gc &&
> +	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
> +
> +	GIT_TRACE=$(pwd)/trace.out git -C bare.git -c gc.repackFilter=blob:none -c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&

Nit: can we wrap this across multiple lines?

> +	test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
> +	grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
> +'

I think the `test_subcommand` helper might work here, and it would allow
you to avoid writing a long grep invocation.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option
  2023-07-25 17:25             ` Junio C Hamano
@ 2023-07-25 23:08               ` Junio C Hamano
  2023-08-08  8:45                 ` Christian Couder
  0 siblings, 1 reply; 161+ messages in thread
From: Junio C Hamano @ 2023-07-25 23:08 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Junio C Hamano <gitster@pobox.com> writes:

> Thanks for walking through the codepaths involved.  We are good
> then.

Sorry, but not so fast.

https://github.com/git/git/actions/runs/5661445152 (seen with this topic)
https://github.com/git/git/actions/runs/5662517690 (seen w/o this topic)

The former fails t7700 in the linux-TEST-vars job, while the latter
passes the same job.

Thanks.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 0/8] Repack objects into separate packfiles based on a filter
  2023-07-24  8:59   ` [PATCH v3 0/8] Repack objects into separate packfiles based on a filter Christian Couder
                       ` (7 preceding siblings ...)
  2023-07-24  8:59     ` [PATCH v3 8/8] gc: add `gc.repackFilterTo` config option Christian Couder
@ 2023-07-25 23:10     ` Taylor Blau
  2023-08-08  8:26     ` [PATCH v4 " Christian Couder
  9 siblings, 0 replies; 161+ messages in thread
From: Taylor Blau @ 2023-07-25 23:10 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

On Mon, Jul 24, 2023 at 10:59:01AM +0200, Christian Couder wrote:
> # Changes since version 2
>
> Thanks to Junio who reviewed both version 1 and 2, and to Taylor who
> reviewed version 1! The changes are the following:

Apologies for not getting to the second version sooner! It fell off of
my post-vacation to-do list, and I'm only just getting to the third
round.

Overall I am happy with the direction here and think that this is on
the right track. The major points (using --stdin-packs to implement the
complementary pack, the behavior of --filter-to, etc.) all look good to
me.

But I think there are a handful of smaller issues that we may want to at
least discuss first. If so, I think that a handful of them merit a
reroll. But I imagine that that rerolled version would be ready to get
picked up.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 1/8] pack-objects: allow `--filter` without `--stdout`
  2023-07-25 22:38       ` Taylor Blau
@ 2023-07-25 23:51         ` Junio C Hamano
  0 siblings, 0 replies; 161+ messages in thread
From: Junio C Hamano @ 2023-07-25 23:51 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Taylor Blau <me@ttaylorr.com> writes:

> On Mon, Jul 24, 2023 at 10:59:02AM +0200, Christian Couder wrote:
>> diff --git a/t/t5317-pack-objects-filter-objects.sh b/t/t5317-pack-objects-filter-objects.sh
>> index b26d476c64..2ff3eef9a3 100755
>> --- a/t/t5317-pack-objects-filter-objects.sh
>> +++ b/t/t5317-pack-objects-filter-objects.sh
>> @@ -53,6 +53,14 @@ test_expect_success 'verify blob:none packfile has no blobs' '
>>  	! grep blob verify_result
>>  '
>>
>> +test_expect_success 'verify blob:none packfile without --stdout' '
>> +	git -C r1 pack-objects --revs --filter=blob:none mypackname >packhash <<-EOF &&
>> +	HEAD
>> +	EOF
>> +	git -C r1 verify-pack -v "mypackname-$(cat packhash).pack" >verify_result &&
>> +	! grep blob verify_result
>> +'
>
> Just a couple of style nits here. It's a little strange (for me, at
> least) to see the heredoc into a git process.

FWIW, "git" is after all no different from any other command and
redirecting here-doc, especially with an option like "--stdin" is in
effect, is perfectly sensible, I think.  Preparing an input in a
file (e.g. "cat >file <<EOF" followed by "git cmd <file") might give
you slightly a better debuggability, but I do not sense that it is
what you are worried about.

> ... I am less certain
> about redirecting the output into a file "packhash", only to cat it back
> out.

But that would make the syntax awkward.  Do you mean something along
this line?

	var=$(git ... <<-EOF
		here text
	EOF
	) &&
	git ... mypackname-$var.pack &&
	...

Somehow here-doc and $(command subsitution) does not visually mix
well.

Also, $var will not be inspectable when running this test under "-i
-v", so it hurts debuggability without taking the  output in a
temporary file.  You could do "-x", of course, but that would make
everything ultra verbose, so...

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v4 0/8] Repack objects into separate packfiles based on a filter
  2023-07-24  8:59   ` [PATCH v3 0/8] Repack objects into separate packfiles based on a filter Christian Couder
                       ` (8 preceding siblings ...)
  2023-07-25 23:10     ` [PATCH v3 0/8] Repack objects into separate packfiles based on a filter Taylor Blau
@ 2023-08-08  8:26     ` Christian Couder
  2023-08-08  8:26       ` [PATCH v4 1/8] pack-objects: allow `--filter` without `--stdout` Christian Couder
                         ` (9 more replies)
  9 siblings, 10 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-08  8:26 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

# Intro

Last year, John Cai sent 2 versions of a patch series to implement
`git repack --filter=<filter-spec>` and later I sent 4 versions of a
patch series trying to do it a bit differently:

  - https://lore.kernel.org/git/pull.1206.git.git.1643248180.gitgitgadget@gmail.com/
  - https://lore.kernel.org/git/20221012135114.294680-1-christian.couder@gmail.com/

In these patch series, the `--filter=<filter-spec>` removed the
filtered out objects altogether which was considered very dangerous
even though we implemented different safety checks in some of the
latter series.

In some discussions, it was mentioned that such a feature, or a
similar feature in `git gc`, or in a new standalone command (perhaps
called `git prune-filtered`), should put the filtered out objects into
a new packfile instead of deleting them.

Recently there were internal discussions at GitLab about either moving
blobs from inactive repos onto cheaper storage, or moving large blobs
onto cheaper storage. This lead us to rethink at repacking using a
filter, but moving the filtered out objects into a separate packfile
instead of deleting them.

So here is a new patch series doing that while implementing the
`--filter=<filter-spec>` option in `git repack`.

# Use cases for the new feature

This could be useful for example for the following purposes:

  1) As a way for servers to save storage costs by for example moving
     large blobs, or all the blobs, or all the blobs in inactive
     repos, to separate storage (while still making them accessible
     using for example the alternates mechanism).

  2) As a way to use partial clone on a Git server to offload large
     blobs to, for example, an http server, while using multiple
     promisor remotes (to be able to access everything) on the client
     side. (In this case the packfile that contains the filtered out
     object can be manualy removed after checking that all the objects
     it contains are available through the promisor remote.)

  3) As a way for clients to reclaim some space when they cloned with
     a filter to save disk space but then fetched a lot of unwanted
     objects (for example when checking out old branches) and now want
     to remove these unwanted objects. (In this case they can first
     move the packfile that contains filtered out objects to a
     separate directory or storage, then check that everything works
     well, and then manually remove the packfile after some time.)

As the features and the code are quite different from those in the
previous series, I decided to start a new series instead of continuing
a previous one.

Also since version 2 of this new series, commit messages, don't
mention uses cases like 2) or 3) above, as people have different
opinions on how it should be done. How it should be done could depend
a lot on the way promisor remotes are used, the software and hardware
setups used, etc, so it seems more difficult to "sell" this series by
talking about such use cases. As use case 1) seems simpler and more
appealing, it makes more sense to only talk about it in the commit
messages.

# Changes since version 3

Thanks to Junio who reviewed both version 1, 2 and 3, and to Taylor
who reviewed version 1 and 3! The changes are the following:

- In patch 2/8, which introduces `test-tool find-pack`, a new
  `--check-count <n>` option has been added to check the number of
  packfiles an object is in. To keep things simple and extendable, the
  parse-options API is now used to parse arguments and options.

- Also in patch 2/8, a test script 't0080-find-pack.sh' has been
  introduced to test `test-tool find-pack`, as suggested by Taylor.

- In patch 4/8, which refactors code into a find_pack_prefix()
  function, this function has been changed to accept a `packdir` and a
  `packtmp` argument, instead of using the global variables with the
  same names, as suggested by Taylor.

- In patch 5/8, which introduces `--filter=<filter-spec>` option, a
  `struct list_objects_filter_option` and some related functions and
  macros are now used to handle these options, instead of a character
  string. This allows more than one `--filter=<filter-spec>` option to
  be passed, and a new test has been added to check that this works,
  as suggested by Taylor.

- In patch 5/8, some changes have been made to better handle kept
  packfiles and related tests have been added to check that this works
  well, as suggested by Taylor.

- In patch 5/8, a comment about the 'names' variable has been
  shortened a lot and improved a bit with additional useful
  information, as suggested by Taylor.

- Also in patch 5/8, tests have been improved and shortened by using
  the new `--check-count <n>` option of `test-tool find-pack`.

- Also in patch 5/8, the test that checks that `--filter=...` fails
  with `--write-bitmap-index` has been changed to use
  GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 which should fix a CI test
  that sets this variable to 1. This test has also been simplified by
  removing a useless call to `repack --filter=...` as suggested by
  Taylor.

- Also in patch 5/8, the commit message has been improved a bit and
  now only talks about the use case of moving some blobs that take up
  precious space to a cheaper storage, as suggested by Junio.

- In patch 6/8, which implements the `gc.repackFilter` config option,
  a line in the tests that was too long has been split over 2 lines,
  as suggested by Taylor.

- In patch 7/8, which implements the `--filter-to=<dir>` option, the
  documentation of that option talking about possible corruption has
  been clarified a bit, as suggested by Junio.

- Also in patch 7/8, tests have been improved and shortened by using
  the new `--check-count <n>` option of `test-tool find-pack`.

# Commit overview

* 1/8 pack-objects: allow `--filter` without `--stdout`

  This patch is the same as in v1, v2 and v3. To be able to later
  repack with a filter we need `git pack-objects` to write packfiles
  when it's filtering instead of just writing the pack without the
  filtered out objects to stdout.

* 2/8 t/helper: add 'find-pack' test-tool

  For testing `git repack --filter=...` that we are going to
  implement, it's useful to have a test helper that can tell which
  packfiles contain a specific object. Since v3 the new
  `--check-count <n>` option has been added, and tests have been added
  in a new 't0080-find-pack.sh' test script.

* 3/8 repack: refactor finishing pack-objects command

  No change in this patch compared to v2 and v3. This is a small
  refactoring creating a new useful function, so that `git repack
  --filter=...` will be able to reuse it.

* 4/8 repack: refactor finding pack prefix

  This is another small refactoring creating a small function that
  will be reused in the next patch. Since v3 the new function
  introduced in this patch has been changed to accept a `packdir` and
  a `packtmp` argument, instead of using the global variables with the
  same names.

* 5/8 repack: add `--filter=<filter-spec>` option

  This actually adds the `--filter=<filter-spec>` option. It uses one
  `git pack-objects` process with the `--filter` option. And then
  another `git pack-objects` process with the `--stdin-packs`
  option. A lot of changes have been made since v3:

    - The `list_objects_filter_option` struct and some related
      functions and macros are used to handle the new
      `--filter=<filter-spec>` option. A new test has been added to
      check that using multiple such options works.

    - Handling of kept packfiles has been improved and related tests
      have been added.

    - A comment about the 'names' variable has been shortened a lot
      and improved a bit.

    - Tests have been improved and shortened by using the new
      `--check-count <n>` option of `test-tool find-pack`.

    - The test that checks that `--filter=...` fails with
      `--write-bitmap-index` has been improved to pass a CI test and
      shortened.

    - The commit message has been improved a bit.

* 6/8 gc: add `gc.repackFilter` config option

  This is a gc config option so that `git gc` can also repack using a
  filter and put the filtered out objects into a separate
  packfile. Since v3, a line in the tests that was too long has been
  split over 2 lines.

* 7/8 repack: implement `--filter-to` for storing filtered out objects

  For some use cases, it's interesting to create the packfile that
  contains the filtered out objects into a separate location. This is
  similar to the `--expire-to` option for cruft packfiles. Since v3,
  documentation of that option talking about possible corruption has
  been clarified a bit, and tests have been improved and shortened by
  using the new `--check-count <n>` option of `test-tool find-pack`.

* 8/8 gc: add `gc.repackFilterTo` config option

  No change in this patch compared to v3. This allows specifying the
  location of the packfile that contains the filtered out objects when
  using `gc.repackFilter`.

# Range-diff since v3

(Sorry, but the range-diff doesn't show changes in patches 2/8 and 5/8
as there has been a lot of changes in them. Instead it shows that the
old commit has been removed and a new one added.)

1:  4d75a1d7c3 = 1:  4d75a1d7c3 pack-objects: allow `--filter` without `--stdout`
2:  fdf9b6e8cc < -:  ---------- t/helper: add 'find-pack' test-tool
-:  ---------- > 2:  0bf9f53158 t/helper: add 'find-pack' test-tool
3:  e7cfdebc78 = 3:  54060d775e repack: refactor finishing pack-objects command
4:  9c51063795 ! 4:  948ea541ae repack: refactor finding pack prefix
    @@ builtin/repack.c: static int write_cruft_pack(const struct pack_objects_args *ar
        return finish_pack_objects_cmd(&cmd, names, local);
      }
      
    -+static const char *find_pack_prefix(void)
    ++static const char *find_pack_prefix(char *packdir, char *packtmp)
     +{
     +  const char *pack_prefix;
     +  if (!skip_prefix(packtmp, packdir, &pack_prefix))
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
     -                      packtmp, packdir);
     -          if (*pack_prefix == '/')
     -                  pack_prefix++;
    -+          const char *pack_prefix = find_pack_prefix();
    ++          const char *pack_prefix = find_pack_prefix(packdir, packtmp);
      
                if (!cruft_po_args.window)
                        cruft_po_args.window = po_args.window;
5:  a90e8045c3 < -:  ---------- repack: add `--filter=<filter-spec>` option
-:  ---------- > 5:  0635425289 repack: add `--filter=<filter-spec>` option
6:  335b7f614d ! 6:  bf8be2c812 gc: add `gc.repackFilter` config option
    @@ t/t6500-gc.sh: test_expect_success 'one of gc.reflogExpire{Unreachable,}=never d
     +  git -C bare.git -c gc.cruftPacks=false gc &&
     +  test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
     +
    -+  GIT_TRACE=$(pwd)/trace.out git -C bare.git -c gc.repackFilter=blob:none -c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&
    ++  GIT_TRACE=$(pwd)/trace.out git -C bare.git -c gc.repackFilter=blob:none \
    ++          -c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&
     +  test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
     +  grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
     +'
7:  b1be7f60b7 ! 7:  abe7526222 repack: implement `--filter-to` for storing filtered out objects
    @@ Commit message
     
         Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
     
    -    repack: add test with --max-pack-size
    -
      ## Documentation/git-repack.txt ##
     @@ Documentation/git-repack.txt: depth is 4095.
        a single packfile containing all the objects. See
    @@ Documentation/git-repack.txt: depth is 4095.
     +  used for putting the pack on a separate object directory that
     +  is accessed through the Git alternates mechanism. **WARNING:**
     +  If the packfile containing the filtered out objects is not
    -+  accessible, the repo could be considered corrupt by Git as it
    -+  migh not be able to access the objects in that packfile. See
    -+  the `objects` and `objects/info/alternates` sections of
    ++  accessible, the repo can become corrupt as it might not be
    ++  possible to access the objects in that packfile. See the
    ++  `objects` and `objects/info/alternates` sections of
     +  linkgit:gitrepository-layout[5].
     +
      -b::
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
        };
      
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    -           strvec_push(&cmd.args, "--incremental");
    -   }
    - 
    -+  if (filter_to && !po_args.filter)
    +   if (po_args.filter_options.choice)
    +           strvec_pushf(&cmd.args, "--filter=%s",
    +                        expand_list_objects_filter_spec(&po_args.filter_options));
    ++  else if (filter_to)
     +          die(_("option '%s' can only be used along with '%s'"), "--filter-to", "--filter");
    -+
    + 
        if (geometry)
                cmd.in = -1;
    -   else
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
        }
      
    -   if (po_args.filter) {
    +   if (po_args.filter_options.choice) {
     +          if (!filter_to)
     +                  filter_to = packtmp;
     +
                ret = write_filtered_pack(&po_args,
     -                                    packtmp,
     +                                    filter_to,
    -                                     find_pack_prefix(),
    +                                     find_pack_prefix(packdir, packtmp),
    +                                     &keep_pack_list,
                                          &names,
    -                                     &existing_nonkept_packs,
     
      ## t/t7700-repack.sh ##
    -@@ t/t7700-repack.sh: test_expect_success '--filter fails with --write-bitmap-index' '
    -           --filter=blob:none
    +@@ t/t7700-repack.sh: test_expect_success '--filter works with --pack-kept-objects and .keep packs' '
    +   )
      '
      
     +test_expect_success '--filter-to stores filtered out objects' '
    @@ t/t7700-repack.sh: test_expect_success '--filter fails with --write-bitmap-index
     +  test_stdout_line_count = 1 ls bare.git/objects/pack/pack-*.pack &&
     +  test_stdout_line_count = 1 ls filtered.git/objects/pack/pack-*.pack &&
     +
    -+  commit_pack=$(test-tool -C bare.git find-pack HEAD) &&
    -+  test -n "$commit_pack" &&
    -+  blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
    -+  test -z "$blob_pack" &&
    ++  commit_pack=$(test-tool -C bare.git find-pack -c 1 HEAD) &&
    ++  blob_pack=$(test-tool -C bare.git find-pack -c 0 HEAD:file1) &&
     +  blob_hash=$(git -C bare.git rev-parse HEAD:file1) &&
     +  test -n "$blob_hash" &&
    -+  blob_pack=$(test-tool -C filtered.git find-pack $blob_hash) &&
    -+  test -n "$blob_pack" &&
    ++  blob_pack=$(test-tool -C filtered.git find-pack -c 1 $blob_hash) &&
     +
     +  echo $(pwd)/filtered.git/objects >bare.git/objects/info/alternates &&
    -+  blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
    -+  test -n "$blob_pack" &&
    ++  blob_pack=$(test-tool -C bare.git find-pack -c 1 HEAD:file1) &&
     +  blob_content=$(git -C bare.git show $blob_hash) &&
     +  test "$blob_content" = "content1"
     +'
    @@ t/t7700-repack.sh: test_expect_success '--filter fails with --write-bitmap-index
     +          # Check that the 3 blobs are in different packfiles in filtered.git
     +          test_stdout_line_count = 3 ls ../filtered.git/objects/pack/pack-*.pack &&
     +          test_stdout_line_count = 1 ls objects/pack/pack-*.pack &&
    -+          foo_pack=$(test-tool find-pack HEAD:foo) &&
    -+          bar_pack=$(test-tool find-pack HEAD:bar) &&
    -+          base_pack=$(test-tool find-pack HEAD:base.t) &&
    ++          foo_pack=$(test-tool find-pack -c 1 HEAD:foo) &&
    ++          bar_pack=$(test-tool find-pack -c 1 HEAD:bar) &&
    ++          base_pack=$(test-tool find-pack -c 1 HEAD:base.t) &&
     +          test "$foo_pack" != "$bar_pack" &&
     +          test "$foo_pack" != "$base_pack" &&
     +          test "$bar_pack" != "$base_pack" &&
8:  ed66511823 = 8:  ccdc858f73 gc: add `gc.repackFilterTo` config option


Christian Couder (8):
  pack-objects: allow `--filter` without `--stdout`
  t/helper: add 'find-pack' test-tool
  repack: refactor finishing pack-objects command
  repack: refactor finding pack prefix
  repack: add `--filter=<filter-spec>` option
  gc: add `gc.repackFilter` config option
  repack: implement `--filter-to` for storing filtered out objects
  gc: add `gc.repackFilterTo` config option

 Documentation/config/gc.txt            |  16 ++
 Documentation/git-pack-objects.txt     |   4 +-
 Documentation/git-repack.txt           |  23 +++
 Makefile                               |   1 +
 builtin/gc.c                           |  10 ++
 builtin/pack-objects.c                 |   8 +-
 builtin/repack.c                       | 169 +++++++++++++++------
 t/helper/test-find-pack.c              |  50 +++++++
 t/helper/test-tool.c                   |   1 +
 t/helper/test-tool.h                   |   1 +
 t/t0080-find-pack.sh                   |  82 +++++++++++
 t/t5317-pack-objects-filter-objects.sh |   8 +
 t/t6500-gc.sh                          |  24 +++
 t/t7700-repack.sh                      | 196 +++++++++++++++++++++++++
 14 files changed, 543 insertions(+), 50 deletions(-)
 create mode 100644 t/helper/test-find-pack.c
 create mode 100755 t/t0080-find-pack.sh

-- 
2.42.0.rc0.8.g76fac86b0e


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v4 1/8] pack-objects: allow `--filter` without `--stdout`
  2023-08-08  8:26     ` [PATCH v4 " Christian Couder
@ 2023-08-08  8:26       ` Christian Couder
  2023-08-08  8:26       ` [PATCH v4 2/8] t/helper: add 'find-pack' test-tool Christian Couder
                         ` (8 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-08  8:26 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

9535ce7337 (pack-objects: add list-objects filtering, 2017-11-21)
taught `git pack-objects` to use `--filter`, but required the use of
`--stdout` since a partial clone mechanism was not yet in place to
handle missing objects. Since then, changes like 9e27beaa23
(promisor-remote: implement promisor_remote_get_direct(), 2019-06-25)
and others added support to dynamically fetch objects that were missing.

Even without a promisor remote, filtering out objects can also be useful
if we can put the filtered out objects in a separate pack, and in this
case it also makes sense for pack-objects to write the packfile directly
to an actual file rather than on stdout.

Remove the `--stdout` requirement when using `--filter`, so that in a
follow-up commit, repack can pass `--filter` to pack-objects to omit
certain objects from the resulting packfile.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-pack-objects.txt     | 4 ++--
 builtin/pack-objects.c                 | 8 ++------
 t/t5317-pack-objects-filter-objects.sh | 8 ++++++++
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index a9995a932c..583270a85f 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -298,8 +298,8 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it creates a bundle.
 	nevertheless.
 
 --filter=<filter-spec>::
-	Requires `--stdout`.  Omits certain objects (usually blobs) from
-	the resulting packfile.  See linkgit:git-rev-list[1] for valid
+	Omits certain objects (usually blobs) from the resulting
+	packfile.  See linkgit:git-rev-list[1] for valid
 	`<filter-spec>` forms.
 
 --no-filter::
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index d2a162d528..000ebec7ab 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4400,12 +4400,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (filter_options.choice) {
-		if (!pack_to_stdout)
-			die(_("cannot use --filter without --stdout"));
-		if (stdin_packs)
-			die(_("cannot use --filter with --stdin-packs"));
-	}
+	if (stdin_packs && filter_options.choice)
+		die(_("cannot use --filter with --stdin-packs"));
 
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
diff --git a/t/t5317-pack-objects-filter-objects.sh b/t/t5317-pack-objects-filter-objects.sh
index b26d476c64..2ff3eef9a3 100755
--- a/t/t5317-pack-objects-filter-objects.sh
+++ b/t/t5317-pack-objects-filter-objects.sh
@@ -53,6 +53,14 @@ test_expect_success 'verify blob:none packfile has no blobs' '
 	! grep blob verify_result
 '
 
+test_expect_success 'verify blob:none packfile without --stdout' '
+	git -C r1 pack-objects --revs --filter=blob:none mypackname >packhash <<-EOF &&
+	HEAD
+	EOF
+	git -C r1 verify-pack -v "mypackname-$(cat packhash).pack" >verify_result &&
+	! grep blob verify_result
+'
+
 test_expect_success 'verify normal and blob:none packfiles have same commits/trees' '
 	git -C r1 verify-pack -v ../all.pack >verify_result &&
 	grep -E "commit|tree" verify_result |
-- 
2.42.0.rc0.8.g76fac86b0e


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 2/8] t/helper: add 'find-pack' test-tool
  2023-08-08  8:26     ` [PATCH v4 " Christian Couder
  2023-08-08  8:26       ` [PATCH v4 1/8] pack-objects: allow `--filter` without `--stdout` Christian Couder
@ 2023-08-08  8:26       ` Christian Couder
  2023-08-09 21:18         ` Taylor Blau
  2023-08-08  8:26       ` [PATCH v4 3/8] repack: refactor finishing pack-objects command Christian Couder
                         ` (7 subsequent siblings)
  9 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-08-08  8:26 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

In a following commit, we will make it possible to separate objects in
different packfiles depending on a filter.

To make sure that the right objects are in the right packs, let's add a
new test-tool that can display which packfile(s) a given object is in.

Let's also make it possible to check if a given object is in the
expected number of packfiles with a `--check-count <n>` option.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Makefile                  |  1 +
 t/helper/test-find-pack.c | 50 ++++++++++++++++++++++++
 t/helper/test-tool.c      |  1 +
 t/helper/test-tool.h      |  1 +
 t/t0080-find-pack.sh      | 82 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 135 insertions(+)
 create mode 100644 t/helper/test-find-pack.c
 create mode 100755 t/t0080-find-pack.sh

diff --git a/Makefile b/Makefile
index fb541dedc9..14ee0c45d4 100644
--- a/Makefile
+++ b/Makefile
@@ -800,6 +800,7 @@ TEST_BUILTINS_OBJS += test-dump-untracked-cache.o
 TEST_BUILTINS_OBJS += test-env-helper.o
 TEST_BUILTINS_OBJS += test-example-decorate.o
 TEST_BUILTINS_OBJS += test-fast-rebase.o
+TEST_BUILTINS_OBJS += test-find-pack.o
 TEST_BUILTINS_OBJS += test-fsmonitor-client.o
 TEST_BUILTINS_OBJS += test-genrandom.o
 TEST_BUILTINS_OBJS += test-genzeros.o
diff --git a/t/helper/test-find-pack.c b/t/helper/test-find-pack.c
new file mode 100644
index 0000000000..4b9e09ce25
--- /dev/null
+++ b/t/helper/test-find-pack.c
@@ -0,0 +1,50 @@
+#include "test-tool.h"
+#include "object-name.h"
+#include "object-store.h"
+#include "packfile.h"
+#include "parse-options.h"
+#include "setup.h"
+
+/*
+ * Display the path(s), one per line, of the packfile(s) containing
+ * the given object.
+ *
+ * If '--check-count <n>' is passed, then error out if the number of
+ * packfiles containing the object is not <n>.
+ */
+
+static const char *find_pack_usage[] = {
+	"test-tool find-pack [--check-count <n>] <object>",
+	NULL
+};
+
+int cmd__find_pack(int argc, const char **argv)
+{
+	struct object_id oid;
+	struct packed_git *p;
+	int count = -1, actual_count = 0;
+	const char *prefix = setup_git_directory();
+
+	struct option options[] = {
+		OPT_INTEGER('c', "check-count", &count, "expected number of packs"),
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, prefix, options, find_pack_usage, 0);
+	if (argc != 1)
+		usage(find_pack_usage[0]);
+
+	if (repo_get_oid(the_repository, argv[0], &oid))
+		die("cannot parse %s as an object name", argv[0]);
+
+	for (p = get_all_packs(the_repository); p; p = p->next)
+		if (find_pack_entry_one(oid.hash, p)) {
+			printf("%s\n", p->pack_name);
+			actual_count++;
+		}
+
+	if (count > -1 && count != actual_count)
+		die ("bad packfile count %d instead of %d", actual_count, count);
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index abe8a785eb..41da40c296 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -31,6 +31,7 @@ static struct test_cmd cmds[] = {
 	{ "env-helper", cmd__env_helper },
 	{ "example-decorate", cmd__example_decorate },
 	{ "fast-rebase", cmd__fast_rebase },
+	{ "find-pack", cmd__find_pack },
 	{ "fsmonitor-client", cmd__fsmonitor_client },
 	{ "genrandom", cmd__genrandom },
 	{ "genzeros", cmd__genzeros },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index ea2672436c..411dbf2db4 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -25,6 +25,7 @@ int cmd__dump_reftable(int argc, const char **argv);
 int cmd__env_helper(int argc, const char **argv);
 int cmd__example_decorate(int argc, const char **argv);
 int cmd__fast_rebase(int argc, const char **argv);
+int cmd__find_pack(int argc, const char **argv);
 int cmd__fsmonitor_client(int argc, const char **argv);
 int cmd__genrandom(int argc, const char **argv);
 int cmd__genzeros(int argc, const char **argv);
diff --git a/t/t0080-find-pack.sh b/t/t0080-find-pack.sh
new file mode 100755
index 0000000000..67b11216a3
--- /dev/null
+++ b/t/t0080-find-pack.sh
@@ -0,0 +1,82 @@
+#!/bin/sh
+
+test_description='test `test-tool find-pack`'
+
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_commit one &&
+	test_commit two &&
+	test_commit three &&
+	test_commit four &&
+	test_commit five
+'
+
+test_expect_success 'repack everything into a single packfile' '
+	git repack -a -d --no-write-bitmap-index &&
+
+	head_commit_pack=$(test-tool find-pack HEAD) &&
+	head_tree_pack=$(test-tool find-pack HEAD^{tree}) &&
+	one_pack=$(test-tool find-pack HEAD:one.t) &&
+	three_pack=$(test-tool find-pack HEAD:three.t) &&
+	old_commit_pack=$(test-tool find-pack HEAD~4) &&
+
+	test-tool find-pack --check-count 1 HEAD &&
+	test-tool find-pack --check-count=1 HEAD^{tree} &&
+	! test-tool find-pack --check-count=0 HEAD:one.t &&
+	! test-tool find-pack -c 2 HEAD:one.t &&
+	test-tool find-pack -c 1 HEAD:three.t &&
+
+	# Packfile exists at the right path
+	case "$head_commit_pack" in
+		".git/objects/pack/pack-"*".pack") true ;;
+		*) false ;;
+	esac &&
+	test -f "$head_commit_pack" &&
+
+	# Everything is in the same pack
+	test "$head_commit_pack" = "$head_tree_pack" &&
+	test "$head_commit_pack" = "$one_pack" &&
+	test "$head_commit_pack" = "$three_pack" &&
+	test "$head_commit_pack" = "$old_commit_pack"
+'
+
+test_expect_success 'add more packfiles' '
+	git rev-parse HEAD^{tree} HEAD:two.t HEAD:four.t >objects &&
+	git pack-objects .git/objects/pack/mypackname1 >packhash1 <objects &&
+
+	git rev-parse HEAD~ HEAD~^{tree} HEAD:five.t >objects &&
+	git pack-objects .git/objects/pack/mypackname2 >packhash2 <objects &&
+
+	head_commit_pack=$(test-tool find-pack HEAD) &&
+
+	# HEAD^{tree} is in 2 packfiles
+	test-tool find-pack HEAD^{tree} >head_tree_packs &&
+	grep "$head_commit_pack" head_tree_packs &&
+	grep mypackname1 head_tree_packs &&
+	! grep mypackname2 head_tree_packs &&
+	test-tool find-pack --check-count 2 HEAD^{tree} &&
+	! test-tool find-pack --check-count 1 HEAD^{tree} &&
+
+	# HEAD:five.t is also in 2 packfiles
+	test-tool find-pack HEAD:five.t >five_packs &&
+	grep "$head_commit_pack" five_packs &&
+	! grep mypackname1 five_packs &&
+	grep mypackname2 five_packs &&
+	test-tool find-pack -c 2 HEAD:five.t &&
+	! test-tool find-pack --check-count=0 HEAD:five.t
+'
+
+test_expect_success 'add more commits (as loose objects)' '
+	test_commit six &&
+	test_commit seven &&
+
+	test -z "$(test-tool find-pack HEAD)" &&
+	test -z "$(test-tool find-pack HEAD:six.t)" &&
+	test-tool find-pack --check-count 0 HEAD &&
+	test-tool find-pack -c 0 HEAD:six.t &&
+	! test-tool find-pack -c 1 HEAD:seven.t
+'
+
+test_done
-- 
2.42.0.rc0.8.g76fac86b0e


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 3/8] repack: refactor finishing pack-objects command
  2023-08-08  8:26     ` [PATCH v4 " Christian Couder
  2023-08-08  8:26       ` [PATCH v4 1/8] pack-objects: allow `--filter` without `--stdout` Christian Couder
  2023-08-08  8:26       ` [PATCH v4 2/8] t/helper: add 'find-pack' test-tool Christian Couder
@ 2023-08-08  8:26       ` Christian Couder
  2023-08-08  8:26       ` [PATCH v4 4/8] repack: refactor finding pack prefix Christian Couder
                         ` (6 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-08  8:26 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

Create a new finish_pack_objects_cmd() to refactor duplicated code
that handles reading the packfile names from the output of a
`git pack-objects` command and putting it into a string_list, as well as
calling finish_command().

While at it, beautify a code comment a bit in the new function.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org
---
 builtin/repack.c | 70 +++++++++++++++++++++++-------------------------
 1 file changed, 33 insertions(+), 37 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index aea5ca9d44..96af2d1caf 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -696,6 +696,36 @@ static void remove_redundant_bitmaps(struct string_list *include,
 	strbuf_release(&path);
 }
 
+static int finish_pack_objects_cmd(struct child_process *cmd,
+				   struct string_list *names,
+				   int local)
+{
+	FILE *out;
+	struct strbuf line = STRBUF_INIT;
+
+	out = xfdopen(cmd->out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		struct string_list_item *item;
+
+		if (line.len != the_hash_algo->hexsz)
+			die(_("repack: Expecting full hex object ID lines only "
+			      "from pack-objects."));
+		/*
+		 * Avoid putting packs written outside of the repository in the
+		 * list of names.
+		 */
+		if (local) {
+			item = string_list_append(names, line.buf);
+			item->util = populate_pack_exts(line.buf);
+		}
+	}
+	fclose(out);
+
+	strbuf_release(&line);
+
+	return finish_command(cmd);
+}
+
 static int write_cruft_pack(const struct pack_objects_args *args,
 			    const char *destination,
 			    const char *pack_prefix,
@@ -705,9 +735,8 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 			    struct string_list *existing_kept_packs)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
-	struct strbuf line = STRBUF_INIT;
 	struct string_list_item *item;
-	FILE *in, *out;
+	FILE *in;
 	int ret;
 	const char *scratch;
 	int local = skip_prefix(destination, packdir, &scratch);
@@ -751,27 +780,7 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 		fprintf(in, "%s.pack\n", item->string);
 	fclose(in);
 
-	out = xfdopen(cmd.out, "r");
-	while (strbuf_getline_lf(&line, out) != EOF) {
-		struct string_list_item *item;
-
-		if (line.len != the_hash_algo->hexsz)
-			die(_("repack: Expecting full hex object ID lines only "
-			      "from pack-objects."));
-		/*
-		 * avoid putting packs written outside of the repository in the
-		 * list of names
-		 */
-		if (local) {
-			item = string_list_append(names, line.buf);
-			item->util = populate_pack_exts(line.buf);
-		}
-	}
-	fclose(out);
-
-	strbuf_release(&line);
-
-	return finish_command(&cmd);
+	return finish_pack_objects_cmd(&cmd, names, local);
 }
 
 int cmd_repack(int argc, const char **argv, const char *prefix)
@@ -782,10 +791,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list existing_nonkept_packs = STRING_LIST_INIT_DUP;
 	struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
 	struct pack_geometry *geometry = NULL;
-	struct strbuf line = STRBUF_INIT;
 	struct tempfile *refs_snapshot = NULL;
 	int i, ext, ret;
-	FILE *out;
 	int show_progress;
 
 	/* variables to be filled by option parsing */
@@ -1016,18 +1023,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		fclose(in);
 	}
 
-	out = xfdopen(cmd.out, "r");
-	while (strbuf_getline_lf(&line, out) != EOF) {
-		struct string_list_item *item;
-
-		if (line.len != the_hash_algo->hexsz)
-			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
-		item = string_list_append(&names, line.buf);
-		item->util = populate_pack_exts(item->string);
-	}
-	strbuf_release(&line);
-	fclose(out);
-	ret = finish_command(&cmd);
+	ret = finish_pack_objects_cmd(&cmd, &names, 1);
 	if (ret)
 		goto cleanup;
 
-- 
2.42.0.rc0.8.g76fac86b0e


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 4/8] repack: refactor finding pack prefix
  2023-08-08  8:26     ` [PATCH v4 " Christian Couder
                         ` (2 preceding siblings ...)
  2023-08-08  8:26       ` [PATCH v4 3/8] repack: refactor finishing pack-objects command Christian Couder
@ 2023-08-08  8:26       ` Christian Couder
  2023-08-09 21:20         ` Taylor Blau
  2023-08-08  8:26       ` [PATCH v4 5/8] repack: add `--filter=<filter-spec>` option Christian Couder
                         ` (5 subsequent siblings)
  9 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-08-08  8:26 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

Create a new find_pack_prefix() to refactor code that handles finding
the pack prefix from the packtmp and packdir global variables, as we are
going to need this feature again in following commit.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org
---
 builtin/repack.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 96af2d1caf..4e40f4c04e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -783,6 +783,17 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 	return finish_pack_objects_cmd(&cmd, names, local);
 }
 
+static const char *find_pack_prefix(char *packdir, char *packtmp)
+{
+	const char *pack_prefix;
+	if (!skip_prefix(packtmp, packdir, &pack_prefix))
+		die(_("pack prefix %s does not begin with objdir %s"),
+		    packtmp, packdir);
+	if (*pack_prefix == '/')
+		pack_prefix++;
+	return pack_prefix;
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -1031,12 +1042,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		printf_ln(_("Nothing new to pack."));
 
 	if (pack_everything & PACK_CRUFT) {
-		const char *pack_prefix;
-		if (!skip_prefix(packtmp, packdir, &pack_prefix))
-			die(_("pack prefix %s does not begin with objdir %s"),
-			    packtmp, packdir);
-		if (*pack_prefix == '/')
-			pack_prefix++;
+		const char *pack_prefix = find_pack_prefix(packdir, packtmp);
 
 		if (!cruft_po_args.window)
 			cruft_po_args.window = po_args.window;
-- 
2.42.0.rc0.8.g76fac86b0e


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 5/8] repack: add `--filter=<filter-spec>` option
  2023-08-08  8:26     ` [PATCH v4 " Christian Couder
                         ` (3 preceding siblings ...)
  2023-08-08  8:26       ` [PATCH v4 4/8] repack: refactor finding pack prefix Christian Couder
@ 2023-08-08  8:26       ` Christian Couder
  2023-08-09 21:40         ` Taylor Blau
  2023-08-08  8:26       ` [PATCH v4 6/8] gc: add `gc.repackFilter` config option Christian Couder
                         ` (4 subsequent siblings)
  9 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-08-08  8:26 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

This new option puts the objects specified by `<filter-spec>` into a
separate packfile.

This could be useful if, for example, some blobs take up a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.

It's possible to find which new packfile contains the filtered out
objects using one of the following:

  - `git verify-pack -v ...`,
  - `test-tool find-pack ...`, which a previous commit added,
  - `--filter-to=<dir>`, which a following commit will add to specify
    where the pack containing the filtered out objects will be.

This feature is implemented by running `git pack-objects` twice in a
row. The first command is run with `--filter=<filter-spec>`, using the
specified filter. It packs objects while omitting the objects specified
by the filter. Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.

As the interactions with kept packs are a bit tricky, a few related
tests are added.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt |  12 ++++
 builtin/repack.c             |  75 ++++++++++++++++++++
 t/t7700-repack.sh            | 134 +++++++++++++++++++++++++++++++++++
 3 files changed, 221 insertions(+)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 4017157949..6d5bec7716 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -143,6 +143,18 @@ depth is 4095.
 	a larger and slower repository; see the discussion in
 	`pack.packSizeLimit`.
 
+--filter=<filter-spec>::
+	Remove objects matching the filter specification from the
+	resulting packfile and put them into a separate packfile. Note
+	that objects used in the working directory are not filtered
+	out. So for the split to fully work, it's best to perform it
+	in a bare repo and to use the `-a` and `-d` options along with
+	this option.  Also `--no-write-bitmap-index` (or the
+	`repack.writebitmaps` config option set to `false`) should be
+	used otherwise writing bitmap index will fail, as it supposes
+	a single packfile containing all the objects. See
+	linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index 4e40f4c04e..876c115cdc 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -21,6 +21,7 @@
 #include "pack.h"
 #include "pack-bitmap.h"
 #include "refs.h"
+#include "list-objects-filter-options.h"
 
 #define ALL_INTO_ONE 1
 #define LOOSEN_UNREACHABLE 2
@@ -57,6 +58,7 @@ struct pack_objects_args {
 	int no_reuse_object;
 	int quiet;
 	int local;
+	struct list_objects_filter_options filter_options;
 };
 
 static int repack_config(const char *var, const char *value,
@@ -726,6 +728,57 @@ static int finish_pack_objects_cmd(struct child_process *cmd,
 	return finish_command(cmd);
 }
 
+static int write_filtered_pack(const struct pack_objects_args *args,
+			       const char *destination,
+			       const char *pack_prefix,
+			       struct string_list *keep_pack_list,
+			       struct string_list *names,
+			       struct string_list *existing_packs,
+			       struct string_list *existing_kept_packs)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct string_list_item *item;
+	FILE *in;
+	int ret, i;
+	const char *caret;
+	const char *scratch;
+	int local = skip_prefix(destination, packdir, &scratch);
+
+	prepare_pack_objects(&cmd, args, destination);
+
+	strvec_push(&cmd.args, "--stdin-packs");
+
+	if (!pack_kept_objects)
+		strvec_push(&cmd.args, "--honor-pack-keep");
+	for (i = 0; i < keep_pack_list->nr; i++)
+		strvec_pushf(&cmd.args, "--keep-pack=%s",
+			     keep_pack_list->items[i].string);
+
+	cmd.in = -1;
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	/*
+	 * Here 'names' contains only the pack(s) that were just
+	 * written, which is exactly the packs we want to keep. Also
+	 * 'existing_kept_packs' already contains the packs in
+	 * 'keep_pack_list'.
+	 */
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, names)
+		fprintf(in, "^%s-%s.pack\n", pack_prefix, item->string);
+	for_each_string_list_item(item, existing_packs)
+		fprintf(in, "%s.pack\n", item->string);
+	caret = pack_kept_objects ? "" : "^";
+	for_each_string_list_item(item, existing_kept_packs)
+		fprintf(in, "%s%s.pack\n", caret, item->string);
+	fclose(in);
+
+	return finish_pack_objects_cmd(&cmd, names, local);
+}
+
 static int write_cruft_pack(const struct pack_objects_args *args,
 			    const char *destination,
 			    const char *pack_prefix,
@@ -858,6 +911,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("limits the maximum number of threads")),
 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
 				N_("maximum size of each packfile")),
+		OPT_PARSE_LIST_OBJECTS_FILTER(&po_args.filter_options),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
@@ -871,6 +925,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_END()
 	};
 
+	list_objects_filter_init(&po_args.filter_options);
+	list_objects_filter_init(&cruft_po_args.filter_options);
+
 	git_config(repack_config, &cruft_po_args);
 
 	argc = parse_options(argc, argv, prefix, builtin_repack_options,
@@ -1011,6 +1068,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		strvec_push(&cmd.args, "--incremental");
 	}
 
+	if (po_args.filter_options.choice)
+		strvec_pushf(&cmd.args, "--filter=%s",
+			     expand_list_objects_filter_spec(&po_args.filter_options));
+
 	if (geometry)
 		cmd.in = -1;
 	else
@@ -1097,6 +1158,18 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		}
 	}
 
+	if (po_args.filter_options.choice) {
+		ret = write_filtered_pack(&po_args,
+					  packtmp,
+					  find_pack_prefix(packdir, packtmp),
+					  &keep_pack_list,
+					  &names,
+					  &existing_nonkept_packs,
+					  &existing_kept_packs);
+		if (ret)
+			goto cleanup;
+	}
+
 	string_list_sort(&names);
 
 	close_object_store(the_repository->objects);
@@ -1231,6 +1304,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	string_list_clear(&existing_nonkept_packs, 0);
 	string_list_clear(&existing_kept_packs, 0);
 	clear_pack_geometry(geometry);
+	list_objects_filter_release(&po_args.filter_options);
+	list_objects_filter_release(&cruft_po_args.filter_options);
 
 	return ret;
 }
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 27b66807cd..5d3e53134c 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -327,6 +327,140 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_must_be_empty actual
 '
 
+test_expect_success 'repacking with a filter works' '
+	git -C bare.git repack -a -d &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+	git -C bare.git -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
+	test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
+	commit_pack=$(test-tool -C bare.git find-pack -c 1 HEAD) &&
+	blob_pack=$(test-tool -C bare.git find-pack -c 1 HEAD:file1) &&
+	test "$commit_pack" != "$blob_pack" &&
+	tree_pack=$(test-tool -C bare.git find-pack -c 1 HEAD^{tree}) &&
+	test "$tree_pack" = "$commit_pack" &&
+	blob_pack2=$(test-tool -C bare.git find-pack -c 1 HEAD:file2) &&
+	test "$blob_pack2" = "$blob_pack"
+'
+
+test_expect_success '--filter fails with --write-bitmap-index' '
+	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 test_must_fail git -C bare.git repack \
+		-a -d --write-bitmap-index --filter=blob:none
+'
+
+test_expect_success 'repacking with two filters works' '
+	git init two-filters &&
+	(
+		cd two-filters &&
+		mkdir subdir &&
+		test_commit foo &&
+		test_commit subdir_bar subdir/bar &&
+		test_commit subdir_baz subdir/baz
+	) &&
+	git clone --no-local --bare two-filters two-filters.git &&
+	(
+		cd two-filters.git &&
+		test_stdout_line_count = 1 ls objects/pack/*.pack &&
+		git -c repack.writebitmaps=false repack -a -d \
+			--filter=blob:none --filter=tree:1 &&
+		test_stdout_line_count = 2 ls objects/pack/*.pack &&
+		commit_pack=$(test-tool find-pack -c 1 HEAD) &&
+		blob_pack=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		root_tree_pack=$(test-tool find-pack -c 1 HEAD^{tree}) &&
+		subdir_tree_hash=$(git ls-tree --object-only HEAD -- subdir) &&
+		subdir_tree_pack=$(test-tool find-pack -c 1 "$subdir_tree_hash") &&
+
+		# Root tree and subdir tree are not in the same packfiles
+		test "$commit_pack" != "$blob_pack" &&
+		test "$commit_pack" = "$root_tree_pack" &&
+		test "$blob_pack" = "$subdir_tree_pack"
+	)
+'
+
+prepare_for_keep_packs () {
+	git init keep-packs &&
+	(
+		cd keep-packs &&
+		test_commit foo &&
+		test_commit bar
+	) &&
+	git clone --no-local --bare keep-packs keep-packs.git &&
+	(
+		cd keep-packs.git &&
+
+		# Create two packs
+		# The first pack will contain all of the objects except one blob
+		git rev-list --objects --all >objs &&
+		grep -v "bar.t" objs | git pack-objects pack &&
+		# The second pack will contain the excluded object and be kept
+		packid=$(grep "bar.t" objs | git pack-objects pack) &&
+		>pack-$packid.keep &&
+
+		# Replace the existing pack with the 2 new ones
+		rm -f objects/pack/pack* &&
+		mv pack-* objects/pack/
+	)
+}
+
+test_expect_success '--filter works with .keep packs' '
+	prepare_for_keep_packs &&
+	(
+		cd keep-packs.git &&
+
+		foo_pack=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		bar_pack=$(test-tool find-pack -c 1 HEAD:bar.t) &&
+		head_pack=$(test-tool find-pack -c 1 HEAD) &&
+
+		test "$foo_pack" != "$bar_pack" &&
+		test "$foo_pack" = "$head_pack" &&
+
+		git -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
+
+		foo_pack_1=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		bar_pack_1=$(test-tool find-pack -c 1 HEAD:bar.t) &&
+		head_pack_1=$(test-tool find-pack -c 1 HEAD) &&
+
+		# Object bar is still only in the old .keep pack
+		test "$foo_pack_1" != "$foo_pack" &&
+		test "$bar_pack_1" = "$bar_pack" &&
+		test "$head_pack_1" != "$head_pack" &&
+
+		test "$foo_pack_1" != "$bar_pack_1" &&
+		test "$foo_pack_1" != "$head_pack_1" &&
+		test "$bar_pack_1" != "$head_pack_1"
+	)
+'
+
+test_expect_success '--filter works with --pack-kept-objects and .keep packs' '
+	rm -rf keep-packs keep-packs.git &&
+	prepare_for_keep_packs &&
+	(
+		cd keep-packs.git &&
+
+		foo_pack=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		bar_pack=$(test-tool find-pack -c 1 HEAD:bar.t) &&
+		head_pack=$(test-tool find-pack -c 1 HEAD) &&
+
+		test "$foo_pack" != "$bar_pack" &&
+		test "$foo_pack" = "$head_pack" &&
+
+		git -c repack.writebitmaps=false repack -a -d --filter=blob:none \
+			--pack-kept-objects &&
+
+		foo_pack_1=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		test-tool find-pack -c 2 HEAD:bar.t >bar_pack_1 &&
+		head_pack_1=$(test-tool find-pack -c 1 HEAD) &&
+
+		test "$foo_pack_1" != "$foo_pack" &&
+		test "$foo_pack_1" != "$bar_pack" &&
+		test "$head_pack_1" != "$head_pack" &&
+
+		# Object bar is in both the old .keep pack and the new
+		# pack that contained the filtered out objects
+		grep "$bar_pack" bar_pack_1 &&
+		grep "$foo_pack_1" bar_pack_1 &&
+		test "$foo_pack_1" != "$head_pack_1"
+	)
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.42.0.rc0.8.g76fac86b0e


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 6/8] gc: add `gc.repackFilter` config option
  2023-08-08  8:26     ` [PATCH v4 " Christian Couder
                         ` (4 preceding siblings ...)
  2023-08-08  8:26       ` [PATCH v4 5/8] repack: add `--filter=<filter-spec>` option Christian Couder
@ 2023-08-08  8:26       ` Christian Couder
  2023-08-08  8:26       ` [PATCH v4 7/8] repack: implement `--filter-to` for storing filtered out objects Christian Couder
                         ` (3 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-08  8:26 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit has implemented `git repack --filter=<filter-spec>` to
allow users to filter out some objects from the main pack and move them
into a new different pack.

Users might want to perform such a cleanup regularly at the same time as
they perform other repacks and cleanups, so as part of `git gc`.

Let's allow them to configure a <filter-spec> for that purpose using a
new gc.repackFilter config option.

Now when `git gc` will perform a repack with a <filter-spec> configured
through this option and not empty, the repack process will be passed a
corresponding `--filter=<filter-spec>` argument.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/gc.txt |  5 +++++
 builtin/gc.c                |  6 ++++++
 t/t6500-gc.sh               | 13 +++++++++++++
 3 files changed, 24 insertions(+)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index ca47eb2008..2153bde7ac 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -145,6 +145,11 @@ Multiple hooks are supported, but all must exit successfully, else the
 operation (either generating a cruft pack or unpacking unreachable
 objects) will be halted.
 
+gc.repackFilter::
+	When repacking, use the specified filter to move certain
+	objects into a separate packfile.  See the
+	`--filter=<filter-spec>` option of linkgit:git-repack[1].
+
 gc.rerereResolved::
 	Records of conflicted merge you resolved earlier are
 	kept for this many days when 'git rerere gc' is run.
diff --git a/builtin/gc.c b/builtin/gc.c
index 19d73067aa..9b0984f301 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -61,6 +61,7 @@ static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
+static char *repack_filter;
 static unsigned long big_pack_threshold;
 static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 
@@ -170,6 +171,8 @@ static void gc_config(void)
 	git_config_get_ulong("gc.bigpackthreshold", &big_pack_threshold);
 	git_config_get_ulong("pack.deltacachesize", &max_delta_cache_size);
 
+	git_config_get_string("gc.repackfilter", &repack_filter);
+
 	git_config(git_default_config, NULL);
 }
 
@@ -355,6 +358,9 @@ static void add_repack_all_option(struct string_list *keep_pack)
 
 	if (keep_pack)
 		for_each_string_list(keep_pack, keep_one_pack, NULL);
+
+	if (repack_filter && *repack_filter)
+		strvec_pushf(&repack, "--filter=%s", repack_filter);
 }
 
 static void add_repack_incremental_option(void)
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 69509d0c11..232e403b66 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -202,6 +202,19 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
 	grep -E "^trace: (built-in|exec|run_command): git reflog expire --" trace.out
 '
 
+test_expect_success 'gc.repackFilter launches repack with a filter' '
+	test_when_finished "rm -rf bare.git" &&
+	git clone --no-local --bare . bare.git &&
+
+	git -C bare.git -c gc.cruftPacks=false gc &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+
+	GIT_TRACE=$(pwd)/trace.out git -C bare.git -c gc.repackFilter=blob:none \
+		-c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&
+	test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
+	grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
+'
+
 prepare_cruft_history () {
 	test_commit base &&
 
-- 
2.42.0.rc0.8.g76fac86b0e


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 7/8] repack: implement `--filter-to` for storing filtered out objects
  2023-08-08  8:26     ` [PATCH v4 " Christian Couder
                         ` (5 preceding siblings ...)
  2023-08-08  8:26       ` [PATCH v4 6/8] gc: add `gc.repackFilter` config option Christian Couder
@ 2023-08-08  8:26       ` Christian Couder
  2023-08-08  8:26       ` [PATCH v4 8/8] gc: add `gc.repackFilterTo` config option Christian Couder
                         ` (2 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-08  8:26 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit has implemented `git repack --filter=<filter-spec>` to
allow users to filter out some objects from the main pack and move them
into a new different pack.

It would be nice if this new different pack could be created in a
different directory than the regular pack. This would make it possible
to move large blobs into a pack on a different kind of storage, for
example cheaper storage.

Even in a different directory, this pack can be accessible if, for
example, the Git alternates mechanism is used to point to it. In fact
not using the Git alternates mechanism can corrupt a repo as the
generated pack containing the filtered objects might not be accessible
from the repo any more. So setting up the Git alternates mechanism
should be done before using this feature if the user wants the repo to
be fully usable while this feature is used.

In some cases, like when a repo has just been cloned or when there is no
other activity in the repo, it's Ok to setup the Git alternates
mechanism afterwards though. It's also Ok to just inspect the generated
packfile containing the filtered objects and then just move it into the
'.git/objects/pack/' directory manually. That's why it's not necessary
for this command to check that the Git alternates mechanism has been
already setup.

While at it, as an example to show that `--filter` and `--filter-to`
work well with other options, let's also add a test to check that these
options work well with `--max-pack-size`.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt | 11 +++++++
 builtin/repack.c             | 10 +++++-
 t/t7700-repack.sh            | 62 ++++++++++++++++++++++++++++++++++++
 3 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 6d5bec7716..8545a32667 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -155,6 +155,17 @@ depth is 4095.
 	a single packfile containing all the objects. See
 	linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
 
+--filter-to=<dir>::
+	Write the pack containing filtered out objects to the
+	directory `<dir>`. Only useful with `--filter`. This can be
+	used for putting the pack on a separate object directory that
+	is accessed through the Git alternates mechanism. **WARNING:**
+	If the packfile containing the filtered out objects is not
+	accessible, the repo can become corrupt as it might not be
+	possible to access the objects in that packfile. See the
+	`objects` and `objects/info/alternates` sections of
+	linkgit:gitrepository-layout[5].
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index 876c115cdc..f5bc650c1e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -870,6 +870,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int write_midx = 0;
 	const char *cruft_expiration = NULL;
 	const char *expire_to = NULL;
+	const char *filter_to = NULL;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -922,6 +923,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			   N_("write a multi-pack index of the resulting packs")),
 		OPT_STRING(0, "expire-to", &expire_to, N_("dir"),
 			   N_("pack prefix to store a pack containing pruned objects")),
+		OPT_STRING(0, "filter-to", &filter_to, N_("dir"),
+			   N_("pack prefix to store a pack containing filtered out objects")),
 		OPT_END()
 	};
 
@@ -1071,6 +1074,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (po_args.filter_options.choice)
 		strvec_pushf(&cmd.args, "--filter=%s",
 			     expand_list_objects_filter_spec(&po_args.filter_options));
+	else if (filter_to)
+		die(_("option '%s' can only be used along with '%s'"), "--filter-to", "--filter");
 
 	if (geometry)
 		cmd.in = -1;
@@ -1159,8 +1164,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 
 	if (po_args.filter_options.choice) {
+		if (!filter_to)
+			filter_to = packtmp;
+
 		ret = write_filtered_pack(&po_args,
-					  packtmp,
+					  filter_to,
 					  find_pack_prefix(packdir, packtmp),
 					  &keep_pack_list,
 					  &names,
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 5d3e53134c..9b1e189a62 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -461,6 +461,68 @@ test_expect_success '--filter works with --pack-kept-objects and .keep packs' '
 	)
 '
 
+test_expect_success '--filter-to stores filtered out objects' '
+	git -C bare.git repack -a -d &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+
+	git init --bare filtered.git &&
+	git -C bare.git -c repack.writebitmaps=false repack -a -d \
+		--filter=blob:none \
+		--filter-to=../filtered.git/objects/pack/pack &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/pack-*.pack &&
+	test_stdout_line_count = 1 ls filtered.git/objects/pack/pack-*.pack &&
+
+	commit_pack=$(test-tool -C bare.git find-pack -c 1 HEAD) &&
+	blob_pack=$(test-tool -C bare.git find-pack -c 0 HEAD:file1) &&
+	blob_hash=$(git -C bare.git rev-parse HEAD:file1) &&
+	test -n "$blob_hash" &&
+	blob_pack=$(test-tool -C filtered.git find-pack -c 1 $blob_hash) &&
+
+	echo $(pwd)/filtered.git/objects >bare.git/objects/info/alternates &&
+	blob_pack=$(test-tool -C bare.git find-pack -c 1 HEAD:file1) &&
+	blob_content=$(git -C bare.git show $blob_hash) &&
+	test "$blob_content" = "content1"
+'
+
+test_expect_success '--filter works with --max-pack-size' '
+	rm -rf filtered.git &&
+	git init --bare filtered.git &&
+	git init max-pack-size &&
+	(
+		cd max-pack-size &&
+		test_commit base &&
+		# two blobs which exceed the maximum pack size
+		test-tool genrandom foo 1048576 >foo &&
+		git hash-object -w foo &&
+		test-tool genrandom bar 1048576 >bar &&
+		git hash-object -w bar &&
+		git add foo bar &&
+		git commit -m "adding foo and bar"
+	) &&
+	git clone --no-local --bare max-pack-size max-pack-size.git &&
+	(
+		cd max-pack-size.git &&
+		git -c repack.writebitmaps=false repack -a -d --filter=blob:none \
+			--max-pack-size=1M \
+			--filter-to=../filtered.git/objects/pack/pack &&
+		echo $(cd .. && pwd)/filtered.git/objects >objects/info/alternates &&
+
+		# Check that the 3 blobs are in different packfiles in filtered.git
+		test_stdout_line_count = 3 ls ../filtered.git/objects/pack/pack-*.pack &&
+		test_stdout_line_count = 1 ls objects/pack/pack-*.pack &&
+		foo_pack=$(test-tool find-pack -c 1 HEAD:foo) &&
+		bar_pack=$(test-tool find-pack -c 1 HEAD:bar) &&
+		base_pack=$(test-tool find-pack -c 1 HEAD:base.t) &&
+		test "$foo_pack" != "$bar_pack" &&
+		test "$foo_pack" != "$base_pack" &&
+		test "$bar_pack" != "$base_pack" &&
+		for pack in "$foo_pack" "$bar_pack" "$base_pack"
+		do
+			case "$foo_pack" in */filtered.git/objects/pack/*) true ;; *) return 1 ;; esac
+		done
+	)
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.42.0.rc0.8.g76fac86b0e


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 8/8] gc: add `gc.repackFilterTo` config option
  2023-08-08  8:26     ` [PATCH v4 " Christian Couder
                         ` (6 preceding siblings ...)
  2023-08-08  8:26       ` [PATCH v4 7/8] repack: implement `--filter-to` for storing filtered out objects Christian Couder
@ 2023-08-08  8:26       ` Christian Couder
  2023-08-09 21:45       ` [PATCH v4 0/8] Repack objects into separate packfiles based on a filter Taylor Blau
  2023-08-12  0:00       ` [PATCH v5 " Christian Couder
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-08  8:26 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit implemented the `gc.repackFilter` config option
to specify a filter that should be used by `git gc` when
performing repacks.

Another previous commit has implemented
`git repack --filter-to=<dir>` to specify the location of the
packfile containing filtered out objects when using a filter.

Let's implement the `gc.repackFilterTo` config option to specify
that location in the config when `gc.repackFilter` is used.

Now when `git gc` will perform a repack with a <dir> configured
through this option and not empty, the repack process will be
passed a corresponding `--filter-to=<dir>` argument.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/gc.txt | 11 +++++++++++
 builtin/gc.c                |  4 ++++
 t/t6500-gc.sh               | 13 ++++++++++++-
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index 2153bde7ac..466466d6cc 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -150,6 +150,17 @@ gc.repackFilter::
 	objects into a separate packfile.  See the
 	`--filter=<filter-spec>` option of linkgit:git-repack[1].
 
+gc.repackFilterTo::
+	When repacking and using a filter, see `gc.repackFilter`, the
+	specified location will be used to create the packfile
+	containing the filtered out objects. **WARNING:** The
+	specified location should be accessible, using for example the
+	Git alternates mechanism, otherwise the repo could be
+	considered corrupt by Git as it migh not be able to access the
+	objects in that packfile. See the `--filter-to=<dir>` option
+	of linkgit:git-repack[1] and the `objects/info/alternates`
+	section of linkgit:gitrepository-layout[5].
+
 gc.rerereResolved::
 	Records of conflicted merge you resolved earlier are
 	kept for this many days when 'git rerere gc' is run.
diff --git a/builtin/gc.c b/builtin/gc.c
index 9b0984f301..1b7c775d94 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -62,6 +62,7 @@ static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
 static char *repack_filter;
+static char *repack_filter_to;
 static unsigned long big_pack_threshold;
 static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 
@@ -172,6 +173,7 @@ static void gc_config(void)
 	git_config_get_ulong("pack.deltacachesize", &max_delta_cache_size);
 
 	git_config_get_string("gc.repackfilter", &repack_filter);
+	git_config_get_string("gc.repackfilterto", &repack_filter_to);
 
 	git_config(git_default_config, NULL);
 }
@@ -361,6 +363,8 @@ static void add_repack_all_option(struct string_list *keep_pack)
 
 	if (repack_filter && *repack_filter)
 		strvec_pushf(&repack, "--filter=%s", repack_filter);
+	if (repack_filter_to && *repack_filter_to)
+		strvec_pushf(&repack, "--filter-to=%s", repack_filter_to);
 }
 
 static void add_repack_incremental_option(void)
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 232e403b66..e412cf8daf 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -203,7 +203,6 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
 '
 
 test_expect_success 'gc.repackFilter launches repack with a filter' '
-	test_when_finished "rm -rf bare.git" &&
 	git clone --no-local --bare . bare.git &&
 
 	git -C bare.git -c gc.cruftPacks=false gc &&
@@ -215,6 +214,18 @@ test_expect_success 'gc.repackFilter launches repack with a filter' '
 	grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
 '
 
+test_expect_success 'gc.repackFilterTo store filtered out objects' '
+	test_when_finished "rm -rf bare.git filtered.git" &&
+
+	git init --bare filtered.git &&
+	git -C bare.git -c gc.repackFilter=blob:none \
+		-c gc.repackFilterTo=../filtered.git/objects/pack/pack \
+		-c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&
+
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+	test_stdout_line_count = 1 ls filtered.git/objects/pack/*.pack
+'
+
 prepare_cruft_history () {
 	test_commit base &&
 
-- 
2.42.0.rc0.8.g76fac86b0e


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 2/8] t/helper: add 'find-pack' test-tool
  2023-07-25 22:44       ` Taylor Blau
@ 2023-08-08  8:28         ` Christian Couder
  0 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-08  8:28 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jul 26, 2023 at 12:44 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Mon, Jul 24, 2023 at 10:59:03AM +0200, Christian Couder wrote:
> > ---
> >  Makefile                  |  1 +
> >  t/helper/test-find-pack.c | 35 +++++++++++++++++++++++++++++++++++
> >  t/helper/test-tool.c      |  1 +
> >  t/helper/test-tool.h      |  1 +
> >  4 files changed, 38 insertions(+)
> >  create mode 100645 t/helper/test-find-pack.c
>
> Everything that you wrote here seems reasonable to me, and the
> implementation of the new test tool is very straightforward.
>
> I'm pretty sure that everything here is correct, and we'll implicitly
> test the behavior of the new helper in following patches.
>
> That said, I think that it might be prudent here to "test the tests" and
> write a simple test script that exercises this test helper over a more
> trivial case. There is definitely prior art for testing our helpers
> directly in the t00?? tests.

Ok, I have written a new t0080-find-pack.sh test script for this in
the version 4 I just sent.

I have also changed `test-tool find-pack` so that it now accepts a
`--check-count <n>` option. This addresses some of your comments on
another patch in the previous version of this series. As the code is
now a bit more complex, there is more justification for a test script.

Thanks.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 4/8] repack: refactor finding pack prefix
  2023-07-25 22:47       ` Taylor Blau
@ 2023-08-08  8:29         ` Christian Couder
  0 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-08  8:29 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

On Wed, Jul 26, 2023 at 12:47 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Mon, Jul 24, 2023 at 10:59:05AM +0200, Christian Couder wrote:
> > diff --git a/builtin/repack.c b/builtin/repack.c
> > index 96af2d1caf..21e3b89f27 100644
> > --- a/builtin/repack.c
> > +++ b/builtin/repack.c
> > @@ -783,6 +783,17 @@ static int write_cruft_pack(const struct pack_objects_args *args,
> >       return finish_pack_objects_cmd(&cmd, names, local);
> >  }
> >
> > +static const char *find_pack_prefix(void)
> > +{
> > +     const char *pack_prefix;
> > +     if (!skip_prefix(packtmp, packdir, &pack_prefix))
>
> I wonder if this might be a good opportunity to pass "packtmp" and
> "packdir" as arguments to the function. I know that these are globals,
> but it at least nudges us in the right direction away from adding more
> global variables.

I have changed this in the version 4 I just sent. Now "packtmp" and
"packdir" are passed as arguments to the function as you suggest.
Thanks.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 5/8] repack: add `--filter=<filter-spec>` option
  2023-07-25 23:04       ` Taylor Blau
@ 2023-08-08  8:34         ` Christian Couder
  2023-08-09 21:12           ` Taylor Blau
  0 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-08-08  8:34 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jul 26, 2023 at 1:04 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Mon, Jul 24, 2023 at 10:59:06AM +0200, Christian Couder wrote:

> > +     /*
> > +      * names has a confusing double use: it both provides the list
> > +      * of just-written new packs, and accepts the name of the
> > +      * filtered pack we are writing.
> > +      *
> > +      * By the time it is read here, it contains only the pack(s)
> > +      * that were just written, which is exactly the set of packs we
> > +      * want to consider kept.
> > +      */
>
> I think that this comment partially comes from the cruft pack code,
> where we use the `names` string list both to reference existing packs at
> the start of the repack, and to keep track of the pack we just wrote (to
> exclude its contents from the cruft pack).
>
> But I think we only write into "names" via finish_pack_objects_cmd() to
> record the name of the pack we just wrote containing objects which
> didn't meet the filter's conditions.
>
> So I think that leaving this comment in is OK, but TBH I was on the
> fence when I wrote that back in f9825d1cf75 (builtin/repack.c: support
> generating a cruft pack, 2022-05-20), so I would just as soon drop it.

I made the comment smaller in version 4.

> > +     in = xfdopen(cmd.in, "w");
> > +     for_each_string_list_item(item, names)
> > +             fprintf(in, "^%s-%s.pack\n", pack_prefix, item->string);
> > +     for_each_string_list_item(item, existing_packs)
> > +             fprintf(in, "%s.pack\n", item->string);
>
> > +     for_each_string_list_item(item, existing_kept_packs)
> > +             fprintf(in, "^%s.pack\n", item->string);
>
> I think we may only want to do this if `honor_pack_keep` is zero.
> Otherwise we'd avoid packing objects that appear in kept packs, even if
> the caller told us to include objects found in kept packs.

In version 4 I have made changes to better support kept packfiles and
related options, including adding tests.

> > @@ -858,6 +912,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
> >                               N_("limits the maximum number of threads")),
> >               OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
> >                               N_("maximum size of each packfile")),
> > +             OPT_STRING(0, "filter", &po_args.filter, N_("args"),
> > +                             N_("object filtering")),
>
> I suppose we're storing the filter as a string here because we're just
> going to pass it down to pack-objects directly. That part makes sense,
> but I think we are producing subtly inconsistent behavior when
> specifying multiple --filter options.
>
> IIRC, passing --filter more than once down to pack-objects produces a
> filter whose objects match all of the individually specified
> sub-filters. But IIUC, using OPT_STRING here means that later
> `--filter`'s override earlier ones.
>
> So I think at minimum we'd want to store the filter arguments in a
> strvec. But I would probably just as soon parse them into a bona-fide
> list_objects_filter_options struct, and then reconstruct the arguments
> to pack-objects based on that.

In version 4 a `list_objects_filter_options` struct is now used, and
there is a test to check that more than one `--filter=<filter-spec>`
option is supported.

> > +     git -C bare.git -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
> > +     test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
> > +     commit_pack=$(test-tool -C bare.git find-pack HEAD) &&
> > +     test -n "$commit_pack" &&
>
> I wonder if the test-tool itself should exit with a non-zero code if it
> can't find the given object in any pack. It would at least allow us to
> drop the "test -n $foo" after every invocation of the test-helper in
> this test.
>
> Arguably callers may want to ensure that an object doesn't exist in any
> pack, and this would be inconvenient for them, since they'd have to
> write something like:
>
>     test_must_fail test-tool find-pack $obj
>
> but I think a more direct test like
>
>     test_must_fail git cat-file -t $obj
>
> would do just as well.

Thanks for these suggestions, but I prefered to add the `--check-count
<n>` option to `test-tool find-pack` in version 4.

This way `--check-count 0` or `-c 0` for short can be used to check
that an object is in no packfile, though it could be for example in a
promisor remote or a loose object file. It's also nice to be able to
check that an object is in exactly 2 packfiles in some cases.

> > +     blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
> > +     test -n "$blob_pack" &&
> > +     test "$commit_pack" != "$blob_pack" &&
> > +     tree_pack=$(test-tool -C bare.git find-pack HEAD^{tree}) &&
> > +     test "$tree_pack" = "$commit_pack" &&
> > +     blob_pack2=$(test-tool -C bare.git find-pack HEAD:file2) &&
> > +     test "$blob_pack2" = "$blob_pack"
> > +'
>
> This all looks good, but I think there are a couple of more things that
> we'd want to test for here:
>
>   - That the list of all objects appears the same before and after all
>     of the repacking. I think that this is tested implicitly already in
>     your test, but having it written down explicitly would harden this
>     against regressions that cause us to inadvertently delete an object
>     we shouldn't have.

I don't think we need to test this. `git pack-objects
--filter=<filter-spec>` already existed before this series and is
tested elsewhere. We can trust that command and its tests, and just
check that we used it correctly by checking that only a few objects
are in the right packfiles.

>     (FWIW, I think this would be limited to running something like "git
>     cat-file --batch-check='%(objectname)' --batch-all-objects" before
>     and after all of the repacking, and ensuring that the two test_cmp
>     without failure).

I agree that it would not be difficult to do. I just think it's not necessary.

>   - Another thing that I don't think we're testing here is that objects
>     that *don't* match the filter don't appear in one of the filtered
>     packs.

In version 4 we do test that for some objects, as `test-tool find-pack
-c 1 $object` would error out if the object is in more than one
packfile.

> I think we'd probably want to assert on the exact contents of
>     the pack by dumping the list of objects into a file like "expect",
>     and then dumping the actual set of objects with "git show-index
>     <$idx | cut -d' ' -f2" or something.
>
> Another thought from the OPT_STRING business above is that we probably
> want to test this with non-trivial filter arguments. There are probably
> a handful of interesting cases here, like passing `--no-filter`, passing
> `--filter` multiple times, passing invalid values for `--filter`, etc.

In version 4 there is one test passing `--filter=...` multiple times.
I think this is enough, as the `list_objects_filter_options` struct
and related functions and mechanisms are tested elsewhere already.

> > +test_expect_success '--filter fails with --write-bitmap-index' '
> > +     test_must_fail git -C bare.git repack -a -d --write-bitmap-index \
> > +             --filter=blob:none &&
>
> Do we want to ensure that we get the exit code corresponding with
> showing the usage text? I could go either way, but I do think that we
> should grep through the output on stderr to ensure that we get the
> appropriate error message.

I am not sure that testing the exit code and the stderr output is
always needed. Here I think that this test is more for documentation
purposes than really enforcing something important. In fact if the
behavior would change and `--write-bitmap-index` would understand that
it should write an MIDX instead of a regular index, that behavior
change could be considered in some ways as an improvement and we would
only need to remove 'test_must_fail' here.

> > +     git -C bare.git repack -a -d --no-write-bitmap-index \
> > +             --filter=blob:none
>
> I don't think that this test is adding anything that the above
> "repacking with a filter works" test isn't covering already.

Ok, I have removed it in version 4.

Thanks!

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 6/8] gc: add `gc.repackFilter` config option
  2023-07-25 23:07       ` Taylor Blau
@ 2023-08-08  8:38         ` Christian Couder
  2023-08-09 21:15           ` Taylor Blau
  0 siblings, 1 reply; 161+ messages in thread
From: Christian Couder @ 2023-08-08  8:38 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jul 26, 2023 at 1:07 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Mon, Jul 24, 2023 at 10:59:07AM +0200, Christian Couder wrote:
> > A previous commit has implemented `git repack --filter=<filter-spec>` to
> > allow users to filter out some objects from the main pack and move them
> > into a new different pack.
> >
> > Users might want to perform such a cleanup regularly at the same time as
> > they perform other repacks and cleanups, so as part of `git gc`.
> >
> > Let's allow them to configure a <filter-spec> for that purpose using a
> > new gc.repackFilter config option.
>
> Makes sense.
>
> > Now when `git gc` will perform a repack with a <filter-spec> configured
> > through this option and not empty, the repack process will be passed a
> > corresponding `--filter=<filter-spec>` argument.
>
> I may be missing something, but what happens if the user has configured
> gc.repackFilter, but passes additional filters over the command-line
> arguments? I'm not sure whether these should be AND'd with the existing
> filters in config, or if they should reset them to zero, or something
> else.

`git gc` doesn't recognize `--filter=<...>` arguments, only `git
repack` is being teached to recognize it in this patch series. So I
don't see how there could be multiple such arguments on the command
line when `git gc` is used.

Also in version 4 `git repack` can be passed many such arguments
anyway. So I think we are good.

We could support multiple gc.repackFilter config options, but on the
other hand using something like
`combine:<filter1>+<filter2>+...<filterN>` should work, as the content
of the option is passed as-is to the command line. So we can leave
that improvement for later if people don't like the `combine:...` and
are interested in it.

> Regardless, I think it would be beneficial to users if we spelled this
> out in git-gc(1) instead of just this patch message here.

I am not sure what should be spelled out. I think we refer people to
the `repack --filter=...` option which in turn refers to the `rev-list
--filter=...` which contains a good amount of documentation about how
`--filter=...` works, including the fact that `combine:...` can be
used and that multiple `--filter=...` options can be passed.

> > diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
> > index 69509d0c11..5b89faf505 100755
> > --- a/t/t6500-gc.sh
> > +++ b/t/t6500-gc.sh
> > @@ -202,6 +202,18 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
> >       grep -E "^trace: (built-in|exec|run_command): git reflog expire --" trace.out
> >  '
> >
> > +test_expect_success 'gc.repackFilter launches repack with a filter' '
> > +     test_when_finished "rm -rf bare.git" &&
> > +     git clone --no-local --bare . bare.git &&
> > +
> > +     git -C bare.git -c gc.cruftPacks=false gc &&
> > +     test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
> > +
> > +     GIT_TRACE=$(pwd)/trace.out git -C bare.git -c gc.repackFilter=blob:none -c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&
>
> Nit: can we wrap this across multiple lines?

Done in version 4.

> > +     test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
> > +     grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
> > +'
>
> I think the `test_subcommand` helper might work here, and it would allow
> you to avoid writing a long grep invocation.

Other tests related to gc.reflogExpire above use a grep invocation
similar to this one, while `test_subcommand` isn't used in the test
script, so I think the grep invocation makes the whole script a bit
easier to understand.

Thanks.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option
  2023-07-25 23:08               ` Junio C Hamano
@ 2023-08-08  8:45                 ` Christian Couder
  2023-08-09 20:38                   ` Taylor Blau
  2023-08-09 22:50                   ` Junio C Hamano
  0 siblings, 2 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-08  8:45 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Jul 26, 2023 at 1:09 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
>
> > Thanks for walking through the codepaths involved.  We are good
> > then.
>
> Sorry, but not so fast.
>
> https://github.com/git/git/actions/runs/5661445152 (seen with this topic)
> https://github.com/git/git/actions/runs/5662517690 (seen w/o this topic)
>
> The former fails t7700 in the linux-TEST-vars job, while the latter
> passes the same job.

I think this was because I added the following test:

+test_expect_success '--filter fails with --write-bitmap-index' '
+    test_must_fail git -C bare.git repack -a -d --write-bitmap-index \
+        --filter=blob:none &&
+
+    git -C bare.git repack -a -d --no-write-bitmap-index \
+        --filter=blob:none
+'

which fails because in the linux-TEST-vars job the
GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP env variable is set to 1 and
this counteracts the `--write-bitmap-index` option.

I have tried to fix it like this:

+test_expect_success '--filter fails with --write-bitmap-index' '
+    GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 test_must_fail git -C
bare.git repack \
+        -a -d --write-bitmap-index --filter=blob:none
+'

but I haven't been able to check that this works on CI as all the job
seems to fail these days before they even start:

https://github.com/chriscool/git/actions/runs/5791544404/job/15696524676

Thanks!

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option
  2023-08-08  8:45                 ` Christian Couder
@ 2023-08-09 20:38                   ` Taylor Blau
  2023-08-09 22:50                   ` Junio C Hamano
  1 sibling, 0 replies; 161+ messages in thread
From: Taylor Blau @ 2023-08-09 20:38 UTC (permalink / raw)
  To: Christian Couder
  Cc: Junio C Hamano, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Tue, Aug 08, 2023 at 10:45:48AM +0200, Christian Couder wrote:
> On Wed, Jul 26, 2023 at 1:09 AM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Junio C Hamano <gitster@pobox.com> writes:
> >
> > > Thanks for walking through the codepaths involved.  We are good
> > > then.
> >
> > Sorry, but not so fast.
> >
> > https://github.com/git/git/actions/runs/5661445152 (seen with this topic)
> > https://github.com/git/git/actions/runs/5662517690 (seen w/o this topic)
> >
> > The former fails t7700 in the linux-TEST-vars job, while the latter
> > passes the same job.
>
> I think this was because I added the following test:
>
> +test_expect_success '--filter fails with --write-bitmap-index' '
> +    test_must_fail git -C bare.git repack -a -d --write-bitmap-index \
> +        --filter=blob:none &&
> +
> +    git -C bare.git repack -a -d --no-write-bitmap-index \
> +        --filter=blob:none
> +'
>
> which fails because in the linux-TEST-vars job the
> GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP env variable is set to 1 and
> this counteracts the `--write-bitmap-index` option.

Makes sense. That linux-TEST-vars job always seems to get me, too.

(As an aside, and definitely not related to your patch here, I wonder if
we should consider dropping some of the older TEST variables that belong
to features that we no longer consider experimental).

> I have tried to fix it like this:
>
> +test_expect_success '--filter fails with --write-bitmap-index' '
> +    GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 test_must_fail git -C
> bare.git repack \
> +        -a -d --write-bitmap-index --filter=blob:none
> +'
>
> but I haven't been able to check that this works on CI as all the job
> seems to fail these days before they even start:

I think the cannonical way to do this is with env, like so:

    test_must_fail env GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
      git -C bare.git repack -ad --write-bitmap-index --filter=blob:none 2>err

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 5/8] repack: add `--filter=<filter-spec>` option
  2023-08-08  8:34         ` Christian Couder
@ 2023-08-09 21:12           ` Taylor Blau
  0 siblings, 0 replies; 161+ messages in thread
From: Taylor Blau @ 2023-08-09 21:12 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Tue, Aug 08, 2023 at 10:34:25AM +0200, Christian Couder wrote:
> > > +     git -C bare.git -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
> > > +     test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
> > > +     commit_pack=$(test-tool -C bare.git find-pack HEAD) &&
> > > +     test -n "$commit_pack" &&
> >
> > I wonder if the test-tool itself should exit with a non-zero code if it
> > can't find the given object in any pack. It would at least allow us to
> > drop the "test -n $foo" after every invocation of the test-helper in
> > this test.
> >
> > Arguably callers may want to ensure that an object doesn't exist in any
> > pack, and this would be inconvenient for them, since they'd have to
> > write something like:
> >
> >     test_must_fail test-tool find-pack $obj
> >
> > but I think a more direct test like
> >
> >     test_must_fail git cat-file -t $obj
> >
> > would do just as well.
>
> Thanks for these suggestions, but I prefered to add the `--check-count
> <n>` option to `test-tool find-pack` in version 4.
>
> This way `--check-count 0` or `-c 0` for short can be used to check
> that an object is in no packfile, though it could be for example in a
> promisor remote or a loose object file. It's also nice to be able to
> check that an object is in exactly 2 packfiles in some cases.

"--check-count 0" is a nice approach, thanks!

> > This all looks good, but I think there are a couple of more things that
> > we'd want to test for here:
> >
> >   - That the list of all objects appears the same before and after all
> >     of the repacking. I think that this is tested implicitly already in
> >     your test, but having it written down explicitly would harden this
> >     against regressions that cause us to inadvertently delete an object
> >     we shouldn't have.
>
> I don't think we need to test this. `git pack-objects
> --filter=<filter-spec>` already existed before this series and is
> tested elsewhere. We can trust that command and its tests, and just
> check that we used it correctly by checking that only a few objects
> are in the right packfiles.

Yeah, I don't think we should be worried about whether or not
pack-objects is doing the right thing here: I agree that we have
sufficient coverage for that elsewhere throughout the test suite. I was
more concerned at catching bugs or regressions at the 'repack' layer.

But you're more familiar with these changes than I am, so I trust your
judgement.

> > > +test_expect_success '--filter fails with --write-bitmap-index' '
> > > +     test_must_fail git -C bare.git repack -a -d --write-bitmap-index \
> > > +             --filter=blob:none &&
> >
> > Do we want to ensure that we get the exit code corresponding with
> > showing the usage text? I could go either way, but I do think that we
> > should grep through the output on stderr to ensure that we get the
> > appropriate error message.
>
> I am not sure that testing the exit code and the stderr output is
> always needed. Here I think that this test is more for documentation
> purposes than really enforcing something important. In fact if the
> behavior would change and `--write-bitmap-index` would understand that
> it should write an MIDX instead of a regular index, that behavior
> change could be considered in some ways as an improvement and we would
> only need to remove 'test_must_fail' here.

I don't feel that strongly about it, TBH, I think I was more commenting
on that we seem to have many of these tests that go

    test_must_fail git <some arguments that don't go together> 2>err &&
    grep "appropriate error message" err

throughout the suite. I don't feel strongly enough to suggest that we
add more for this specific purpose.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 6/8] gc: add `gc.repackFilter` config option
  2023-08-08  8:38         ` Christian Couder
@ 2023-08-09 21:15           ` Taylor Blau
  0 siblings, 0 replies; 161+ messages in thread
From: Taylor Blau @ 2023-08-09 21:15 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Tue, Aug 08, 2023 at 10:38:26AM +0200, Christian Couder wrote:
> > I may be missing something, but what happens if the user has configured
> > gc.repackFilter, but passes additional filters over the command-line
> > arguments? I'm not sure whether these should be AND'd with the existing
> > filters in config, or if they should reset them to zero, or something
> > else.
>
> `git gc` doesn't recognize `--filter=<...>` arguments, only `git
> repack` is being teached to recognize it in this patch series. So I
> don't see how there could be multiple such arguments on the command
> line when `git gc` is used.
>
> Also in version 4 `git repack` can be passed many such arguments
> anyway. So I think we are good.

Ah, thanks. Sorry for the misunderstanding :-).

> We could support multiple gc.repackFilter config options, but on the
> other hand using something like
> `combine:<filter1>+<filter2>+...<filterN>` should work, as the content
> of the option is passed as-is to the command line. So we can leave
> that improvement for later if people don't like the `combine:...` and
> are interested in it.

I agree. To me it seems like there are probably relatively few people
who would want to specify a multi-valued configuration directly when
they could just use the "combine" trick you suggest. In either case, I
agree that it can be done on top later.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 2/8] t/helper: add 'find-pack' test-tool
  2023-08-08  8:26       ` [PATCH v4 2/8] t/helper: add 'find-pack' test-tool Christian Couder
@ 2023-08-09 21:18         ` Taylor Blau
  0 siblings, 0 replies; 161+ messages in thread
From: Taylor Blau @ 2023-08-09 21:18 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Tue, Aug 08, 2023 at 10:26:02AM +0200, Christian Couder wrote:
> +	if (count > -1 && count != actual_count)
> +		die ("bad packfile count %d instead of %d", actual_count, count);

I think there is an extra space between "die" and the opening
parenthesis. But obviously not worth a reroll here.

> diff --git a/t/t0080-find-pack.sh b/t/t0080-find-pack.sh
> new file mode 100755
> index 0000000000..67b11216a3
> --- /dev/null
> +++ b/t/t0080-find-pack.sh
> @@ -0,0 +1,82 @@

These new tests look great, thanks for adding them :-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 4/8] repack: refactor finding pack prefix
  2023-08-08  8:26       ` [PATCH v4 4/8] repack: refactor finding pack prefix Christian Couder
@ 2023-08-09 21:20         ` Taylor Blau
  0 siblings, 0 replies; 161+ messages in thread
From: Taylor Blau @ 2023-08-09 21:20 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

On Tue, Aug 08, 2023 at 10:26:04AM +0200, Christian Couder wrote:
> Create a new find_pack_prefix() to refactor code that handles finding
> the pack prefix from the packtmp and packdir global variables, as we are
> going to need this feature again in following commit.
>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org
> ---
>  builtin/repack.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/builtin/repack.c b/builtin/repack.c
> index 96af2d1caf..4e40f4c04e 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -783,6 +783,17 @@ static int write_cruft_pack(const struct pack_objects_args *args,
>  	return finish_pack_objects_cmd(&cmd, names, local);
>  }
>
> +static const char *find_pack_prefix(char *packdir, char *packtmp)

I'm definitely nitpicking here, but I think that both of these could be
"const" to indicate that we're not modifying "packdir" or "packtmp".

But again, definitely not worth a reroll.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 5/8] repack: add `--filter=<filter-spec>` option
  2023-08-08  8:26       ` [PATCH v4 5/8] repack: add `--filter=<filter-spec>` option Christian Couder
@ 2023-08-09 21:40         ` Taylor Blau
  0 siblings, 0 replies; 161+ messages in thread
From: Taylor Blau @ 2023-08-09 21:40 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

On Tue, Aug 08, 2023 at 10:26:05AM +0200, Christian Couder wrote:
> @@ -871,6 +925,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		OPT_END()
>  	};
>
> +	list_objects_filter_init(&po_args.fturailter_options);
> +	list_objects_filter_init(&cruft_po_args.filter_options);

Initializing `po_args`'s `filter_options` makes sense to me, but do we
ever use the cruft_po_args ones? From looking at this patch, I don't
think we do.

Initializing them and then calling list_objects_filter_release() on it
isn't wrong, but I can't tell whether initializing these is necessary
or not in the first place.

> +test_expect_success '--filter fails with --write-bitmap-index' '
> +	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 test_must_fail git -C bare.git repack \
> +		-a -d --write-bitmap-index --filter=blob:none
> +'

I can't remember off-hand why, but I am pretty sure that we usually
write "test_must_fail env ..." before the command-line arguments instead
of setting the environment variable outside of the test_must_fail call.

I *think* that this is the same issue as the single-shot environment
variable assignment before function call thing that we see in some
shells.

Regardless, I wonder if we should be catching this --filter +
--write-bitmap-index thing earlier. The error message I get when
running this is:

    warning: Failed to write bitmap index. Packfile doesn't have full closure (object ac3e272b72bbf89def8657766b855d0656630ed4 is missing)
    fatal: failed to write bitmap index

Which comes from deep within the pack-bitmap-write.c internals
(specifically in a failing call to `find_object_pos()`).

I don't think that's wrong per-se, but I wonder if catching the
combination earlier would allow us to carry on writing the pack even if
the caller erroneously specified that they wanted a bitmap, similar to
how we handle that combination with other options (see the comment in
builtin/pack-objects.c that starts with "'hard reasons not to use
bitmaps [...]'").

I doubt we'd see many naturally occurring instances of users running
"git repack" with both the --filter spec option and
--write-bitmap-index. But, I think that it would come up more often in
bare repositories, where writing reachability bitmaps is the default
state during repacking unless specified otherwise.

I suspect doing something like:

--- 8< ---
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 000ebec7ab..d75d122a86 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4431,7 +4431,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		use_bitmap_index = use_bitmap_index_default;

 	/* "hard" reasons not to use bitmaps; these just won't work at all */
-	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow(the_repository))
+	if (!use_internal_rev_list ||
+	    (!pack_to_stdout && write_bitmap_index) ||
+	    is_repository_shallow(the_repository) ||
+	    filter_options.choice)
 		use_bitmap_index = 0;

 	if (pack_to_stdout || !rev_list_all)
--- >8 ---

would do the trick (perhaps with a warning(), and the corresponding test
modification, but I think I could go either way on the warning() since
there isn't one there currently.)

I looked through the rest of the tests, and they all looked good to me,
thanks.

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 0/8] Repack objects into separate packfiles based on a filter
  2023-08-08  8:26     ` [PATCH v4 " Christian Couder
                         ` (7 preceding siblings ...)
  2023-08-08  8:26       ` [PATCH v4 8/8] gc: add `gc.repackFilterTo` config option Christian Couder
@ 2023-08-09 21:45       ` Taylor Blau
  2023-08-09 21:57         ` Junio C Hamano
  2023-08-12  0:12         ` Christian Couder
  2023-08-12  0:00       ` [PATCH v5 " Christian Couder
  9 siblings, 2 replies; 161+ messages in thread
From: Taylor Blau @ 2023-08-09 21:45 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

On Tue, Aug 08, 2023 at 10:26:00AM +0200, Christian Couder wrote:
> # Changes since version 3
>
> Thanks to Junio who reviewed both version 1, 2 and 3, and to Taylor
> who reviewed version 1 and 3! The changes are the following:

I took a look through the range-diff as well as the patches themselves
again (skimming through the last three, which are much more
straightforward than the preceding ones).

Everything looks good to me here, and I think that this version is ready
to get picked up once we're on the other side of 2.42.

I left a couple of comments throughout, but none of them merit a reroll
on their own. I think there are a couple of things we could easily
ignore (marking parameters as "const", etc.), and a couple of things
that we should probably take a look at after the dust has settled here.

We *may* want to fix up the test_must_fail invocation that has the
environment variable on the left-hand side instead of using
"test_must_fail env", but I don't know for sure.

I do think that we should take another look at disabling the bitmap
machinery when given `--filter`, but I think that, too, can be done in
another series.

Thanks again for being so patient with all of my review comments. I hope
it wasn't too big of a pain; this area feels very fragile (to me, at
least) so I wanted to give it an extra careful set of eyes.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 0/8] Repack objects into separate packfiles based on a filter
  2023-08-09 21:45       ` [PATCH v4 0/8] Repack objects into separate packfiles based on a filter Taylor Blau
@ 2023-08-09 21:57         ` Junio C Hamano
  2023-08-12  0:12         ` Christian Couder
  1 sibling, 0 replies; 161+ messages in thread
From: Junio C Hamano @ 2023-08-09 21:57 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

Taylor Blau <me@ttaylorr.com> writes:

> We *may* want to fix up the test_must_fail invocation that has the
> environment variable on the left-hand side instead of using
> "test_must_fail env", but I don't know for sure.

Ah, that is a show-stopper bug.  We must fix it, but the necessary
change should be trivial.

> Thanks again for being so patient with all of my review comments. I hope
> it wasn't too big of a pain; this area feels very fragile (to me, at
> least) so I wanted to give it an extra careful set of eyes.

Thanks for writing and reviewing.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option
  2023-08-08  8:45                 ` Christian Couder
  2023-08-09 20:38                   ` Taylor Blau
@ 2023-08-09 22:50                   ` Junio C Hamano
  2023-08-09 23:38                     ` Junio C Hamano
  1 sibling, 1 reply; 161+ messages in thread
From: Junio C Hamano @ 2023-08-09 22:50 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> but I haven't been able to check that this works on CI as all the job
> seems to fail these days before they even start:

This remark made me worried, but (luckily) it does not seem to be
the case for other topics cooking in 'next' or queued in 'seen'.

It does seem that this topic directly queued on 'master' by itself
without any other topics do break the CI quite badly.  Almost
nothing passes:

  https://github.com/git/git/actions/runs/5812873987

but it may be something as silly as failing test lint.  I didn't
check very closely.

Thanks.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option
  2023-08-09 22:50                   ` Junio C Hamano
@ 2023-08-09 23:38                     ` Junio C Hamano
  2023-08-10  0:10                       ` Jeff King
  0 siblings, 1 reply; 161+ messages in thread
From: Junio C Hamano @ 2023-08-09 23:38 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt, Christian Couder

Junio C Hamano <gitster@pobox.com> writes:

> It does seem that this topic directly queued on 'master' by itself
> without any other topics do break the CI quite badly.  Almost
> nothing passes:
>
>   https://github.com/git/git/actions/runs/5812873987
>
> but it may be something as silly as failing test lint.  I didn't
> check very closely.

And a bit further digging reveals that it is the case.

https://github.com/git/git/actions/runs/5812873987/job/15759211568#step:4:787

Locally you should be able to reproduce it by

    make
    make -C t test-lint

before sending your patches out.

I've queued a squashable fix-up on top of the topic.

Thanks.

---
 t/t7700-repack.sh | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 9b1e189a62..48e92aa6f7 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -342,8 +342,9 @@ test_expect_success 'repacking with a filter works' '
 '
 
 test_expect_success '--filter fails with --write-bitmap-index' '
-	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 test_must_fail git -C bare.git repack \
-		-a -d --write-bitmap-index --filter=blob:none
+	test_must_fail \
+		env GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
+		git -C bare.git repack -a -d --write-bitmap-index --filter=blob:none
 '
 
 test_expect_success 'repacking with two filters works' '
-- 
2.42.0-rc1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 5/8] repack: add `--filter=<filter-spec>` option
  2023-08-09 23:38                     ` Junio C Hamano
@ 2023-08-10  0:10                       ` Jeff King
  0 siblings, 0 replies; 161+ messages in thread
From: Jeff King @ 2023-08-10  0:10 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

On Wed, Aug 09, 2023 at 04:38:45PM -0700, Junio C Hamano wrote:

> Locally you should be able to reproduce it by
> 
>     make
>     make -C t test-lint
> 
> before sending your patches out.

This shouldn't be necessary. "make test" is supposed to run test-lint
(and does catch this case for me). Provided you haven't overridden that
by setting TEST_LINT to some more limited set.

-Peff

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v5 0/8] Repack objects into separate packfiles based on a filter
  2023-08-08  8:26     ` [PATCH v4 " Christian Couder
                         ` (8 preceding siblings ...)
  2023-08-09 21:45       ` [PATCH v4 0/8] Repack objects into separate packfiles based on a filter Taylor Blau
@ 2023-08-12  0:00       ` Christian Couder
  2023-08-12  0:00         ` [PATCH v5 1/8] pack-objects: allow `--filter` without `--stdout` Christian Couder
                           ` (9 more replies)
  9 siblings, 10 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-12  0:00 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

# Intro

Last year, John Cai sent 2 versions of a patch series to implement
`git repack --filter=<filter-spec>` and later I sent 4 versions of a
patch series trying to do it a bit differently:

  - https://lore.kernel.org/git/pull.1206.git.git.1643248180.gitgitgadget@gmail.com/
  - https://lore.kernel.org/git/20221012135114.294680-1-christian.couder@gmail.com/

In these patch series, the `--filter=<filter-spec>` removed the
filtered out objects altogether which was considered very dangerous
even though we implemented different safety checks in some of the
latter series.

In some discussions, it was mentioned that such a feature, or a
similar feature in `git gc`, or in a new standalone command (perhaps
called `git prune-filtered`), should put the filtered out objects into
a new packfile instead of deleting them.

Recently there were internal discussions at GitLab about either moving
blobs from inactive repos onto cheaper storage, or moving large blobs
onto cheaper storage. This lead us to rethink at repacking using a
filter, but moving the filtered out objects into a separate packfile
instead of deleting them.

So here is a new patch series doing that while implementing the
`--filter=<filter-spec>` option in `git repack`.

# Use cases for the new feature

This could be useful for example for the following purposes:

  1) As a way for servers to save storage costs by for example moving
     large blobs, or all the blobs, or all the blobs in inactive
     repos, to separate storage (while still making them accessible
     using for example the alternates mechanism).

  2) As a way to use partial clone on a Git server to offload large
     blobs to, for example, an http server, while using multiple
     promisor remotes (to be able to access everything) on the client
     side. (In this case the packfile that contains the filtered out
     object can be manualy removed after checking that all the objects
     it contains are available through the promisor remote.)

  3) As a way for clients to reclaim some space when they cloned with
     a filter to save disk space but then fetched a lot of unwanted
     objects (for example when checking out old branches) and now want
     to remove these unwanted objects. (In this case they can first
     move the packfile that contains filtered out objects to a
     separate directory or storage, then check that everything works
     well, and then manually remove the packfile after some time.)

As the features and the code are quite different from those in the
previous series, I decided to start a new series instead of continuing
a previous one.

Also since version 2 of this new series, commit messages, don't
mention uses cases like 2) or 3) above, as people have different
opinions on how it should be done. How it should be done could depend
a lot on the way promisor remotes are used, the software and hardware
setups used, etc, so it seems more difficult to "sell" this series by
talking about such use cases. As use case 1) seems simpler and more
appealing, it makes more sense to only talk about it in the commit
messages.

# Changes since version 4

Thanks to Junio who reviewed versions 1, 2, 3 and 4, and to Taylor who
reviewed version 1, 3 and 4! Thanks also to Robert Coup who
participated in the discussions related to version 2 and Peff who
participated in the discussions related to version 4. The changes are
the following:

- In patch 2/8, which introduces `test-tool find-pack`, a spurious
  space character has been removed between 'die' and '(', as suggested
  by Taylor.

- In patch 4/8, which refactors code into a find_pack_prefix()
  function, this function has been changed so that the `packdir` and
  `packtmp` arguments are now 'const', as suggested by Taylor.

- In patch 5/8, which introduces `--filter=<filter-spec>` option, the
  `filter_options` member of the 'cruft_po_args' variable is not
  initialized and freed anymore, as this member is actually unused.

- Also in patch 5/8, the '--filter fails with --write-bitmap-index'
  test has been changed to use `test_must_fail env` to fix failures
  with the 'test-lint' Makefile target, as suggested by Junio and
  Taylor. (Junio's 'SQUASH???' patch was squashed into that patch.)

- Also the series was rebased on top of v2.42.0-rc1 as it will likely
  be merged after v2.42.0 will be released and Junio's
  cc/repack-sift-filtered-objects-to-separate-pack branch is based on
  top of v2.42.0-rc0.

# Commit overview

* 1/8 pack-objects: allow `--filter` without `--stdout`

  This patch is the same as in v1, v2, v3 and v4. To be able to later
  repack with a filter we need `git pack-objects` to write packfiles
  when it's filtering instead of just writing the pack without the
  filtered out objects to stdout.

* 2/8 t/helper: add 'find-pack' test-tool

  For testing `git repack --filter=...` that we are going to
  implement, it's useful to have a test helper that can tell which
  packfiles contain a specific object. Since v4 only a space character
  has been removed between a function name and the following '(' to
  comply with our style guide.

* 3/8 repack: refactor finishing pack-objects command

  No change in this patch compared to v2, v3 and v4. This is a small
  refactoring creating a new useful function, so that `git repack
  --filter=...` will be able to reuse it.

* 4/8 repack: refactor finding pack prefix

  This is another small refactoring creating a small function that
  will be reused in the next patch. Since v4 the new function has been
  changed so that its `packdir` and `packtmp` argument are now const.

* 5/8 repack: add `--filter=<filter-spec>` option

  This actually adds the `--filter=<filter-spec>` option. It uses one
  `git pack-objects` process with the `--filter` option. And then
  another `git pack-objects` process with the `--stdin-packs`
  option. A few changes have been made since v4:

    - The `filter_options` member of the 'cruft_po_args' variable is
      not initialized and freed anymore, as this member is actually
      unused.

    - The test that checks that `--filter=...` fails with
      `--write-bitmap-index` has been changed to use `test_must_fail
      env` to fix failures with the 'test-lint' Makefile target.

* 6/8 gc: add `gc.repackFilter` config option

  No change in this patch compared to v4. This is a gc config option
  so that `git gc` can also repack using a filter and put the filtered
  out objects into a separate packfile.

* 7/8 repack: implement `--filter-to` for storing filtered out objects

  No change in this patch compared to v4. For some use cases, it's
  interesting to create the packfile that contains the filtered out
  objects into a separate location. This is similar to the
  `--expire-to` option for cruft packfiles.

* 8/8 gc: add `gc.repackFilterTo` config option

  No change in this patch compared to v3 and v4. This allows
  specifying the location of the packfile that contains the filtered
  out objects when using `gc.repackFilter`.

# Range-diff since v4

1:  09fd23c7d0 = 1:  bbcc368876 pack-objects: allow `--filter` without `--stdout`
2:  c75010d20c ! 2:  f1b80e5728 t/helper: add 'find-pack' test-tool
    @@ t/helper/test-find-pack.c (new)
     +          }
     +
     +  if (count > -1 && count != actual_count)
    -+          die ("bad packfile count %d instead of %d", actual_count, count);
    ++          die("bad packfile count %d instead of %d", actual_count, count);
     +
     +  return 0;
     +}
3:  28221861a0 = 3:  ffecc73960 repack: refactor finishing pack-objects command
4:  41d4faf62b ! 4:  6c2f381a88 repack: refactor finding pack prefix
    @@ builtin/repack.c: static int write_cruft_pack(const struct pack_objects_args *ar
        return finish_pack_objects_cmd(&cmd, names, local);
      }
      
    -+static const char *find_pack_prefix(char *packdir, char *packtmp)
    ++static const char *find_pack_prefix(const char *packdir, const char *packtmp)
     +{
     +  const char *pack_prefix;
     +  if (!skip_prefix(packtmp, packdir, &pack_prefix))
5:  a929572b96 ! 5:  134700c2ce repack: add `--filter=<filter-spec>` option
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
        };
      
     +  list_objects_filter_init(&po_args.filter_options);
    -+  list_objects_filter_init(&cruft_po_args.filter_options);
     +
        git_config(repack_config, &cruft_po_args);
      
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
        string_list_clear(&existing_kept_packs, 0);
        clear_pack_geometry(geometry);
     +  list_objects_filter_release(&po_args.filter_options);
    -+  list_objects_filter_release(&cruft_po_args.filter_options);
      
        return ret;
      }
    @@ t/t7700-repack.sh: test_expect_success 'auto-bitmaps do not complain if unavaila
     +'
     +
     +test_expect_success '--filter fails with --write-bitmap-index' '
    -+  GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 test_must_fail git -C bare.git repack \
    -+          -a -d --write-bitmap-index --filter=blob:none
    ++  test_must_fail \
    ++          env GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
    ++          git -C bare.git repack -a -d --write-bitmap-index --filter=blob:none
     +'
     +
     +test_expect_success 'repacking with two filters works' '
6:  a22a560d74 = 6:  d3365c7b48 gc: add `gc.repackFilter` config option
7:  387b427fed = 7:  9a09382cd1 repack: implement `--filter-to` for storing filtered out objects
8:  76fac86b0e = 8:  a52e3a71db gc: add `gc.repackFilterTo` config option


Christian Couder (8):
  pack-objects: allow `--filter` without `--stdout`
  t/helper: add 'find-pack' test-tool
  repack: refactor finishing pack-objects command
  repack: refactor finding pack prefix
  repack: add `--filter=<filter-spec>` option
  gc: add `gc.repackFilter` config option
  repack: implement `--filter-to` for storing filtered out objects
  gc: add `gc.repackFilterTo` config option

 Documentation/config/gc.txt            |  16 ++
 Documentation/git-pack-objects.txt     |   4 +-
 Documentation/git-repack.txt           |  23 +++
 Makefile                               |   1 +
 builtin/gc.c                           |  10 ++
 builtin/pack-objects.c                 |   8 +-
 builtin/repack.c                       | 167 +++++++++++++++------
 t/helper/test-find-pack.c              |  50 +++++++
 t/helper/test-tool.c                   |   1 +
 t/helper/test-tool.h                   |   1 +
 t/t0080-find-pack.sh                   |  82 ++++++++++
 t/t5317-pack-objects-filter-objects.sh |   8 +
 t/t6500-gc.sh                          |  24 +++
 t/t7700-repack.sh                      | 197 +++++++++++++++++++++++++
 14 files changed, 542 insertions(+), 50 deletions(-)
 create mode 100644 t/helper/test-find-pack.c
 create mode 100755 t/t0080-find-pack.sh

-- 
2.42.0.rc1.8.ga52e3a71db


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v5 1/8] pack-objects: allow `--filter` without `--stdout`
  2023-08-12  0:00       ` [PATCH v5 " Christian Couder
@ 2023-08-12  0:00         ` Christian Couder
  2023-08-12  0:00         ` [PATCH v5 2/8] t/helper: add 'find-pack' test-tool Christian Couder
                           ` (8 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-12  0:00 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

9535ce7337 (pack-objects: add list-objects filtering, 2017-11-21)
taught `git pack-objects` to use `--filter`, but required the use of
`--stdout` since a partial clone mechanism was not yet in place to
handle missing objects. Since then, changes like 9e27beaa23
(promisor-remote: implement promisor_remote_get_direct(), 2019-06-25)
and others added support to dynamically fetch objects that were missing.

Even without a promisor remote, filtering out objects can also be useful
if we can put the filtered out objects in a separate pack, and in this
case it also makes sense for pack-objects to write the packfile directly
to an actual file rather than on stdout.

Remove the `--stdout` requirement when using `--filter`, so that in a
follow-up commit, repack can pass `--filter` to pack-objects to omit
certain objects from the resulting packfile.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-pack-objects.txt     | 4 ++--
 builtin/pack-objects.c                 | 8 ++------
 t/t5317-pack-objects-filter-objects.sh | 8 ++++++++
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index a9995a932c..583270a85f 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -298,8 +298,8 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it creates a bundle.
 	nevertheless.
 
 --filter=<filter-spec>::
-	Requires `--stdout`.  Omits certain objects (usually blobs) from
-	the resulting packfile.  See linkgit:git-rev-list[1] for valid
+	Omits certain objects (usually blobs) from the resulting
+	packfile.  See linkgit:git-rev-list[1] for valid
 	`<filter-spec>` forms.
 
 --no-filter::
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index d2a162d528..000ebec7ab 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4400,12 +4400,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (filter_options.choice) {
-		if (!pack_to_stdout)
-			die(_("cannot use --filter without --stdout"));
-		if (stdin_packs)
-			die(_("cannot use --filter with --stdin-packs"));
-	}
+	if (stdin_packs && filter_options.choice)
+		die(_("cannot use --filter with --stdin-packs"));
 
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
diff --git a/t/t5317-pack-objects-filter-objects.sh b/t/t5317-pack-objects-filter-objects.sh
index b26d476c64..2ff3eef9a3 100755
--- a/t/t5317-pack-objects-filter-objects.sh
+++ b/t/t5317-pack-objects-filter-objects.sh
@@ -53,6 +53,14 @@ test_expect_success 'verify blob:none packfile has no blobs' '
 	! grep blob verify_result
 '
 
+test_expect_success 'verify blob:none packfile without --stdout' '
+	git -C r1 pack-objects --revs --filter=blob:none mypackname >packhash <<-EOF &&
+	HEAD
+	EOF
+	git -C r1 verify-pack -v "mypackname-$(cat packhash).pack" >verify_result &&
+	! grep blob verify_result
+'
+
 test_expect_success 'verify normal and blob:none packfiles have same commits/trees' '
 	git -C r1 verify-pack -v ../all.pack >verify_result &&
 	grep -E "commit|tree" verify_result |
-- 
2.42.0.rc1.8.ga52e3a71db


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v5 2/8] t/helper: add 'find-pack' test-tool
  2023-08-12  0:00       ` [PATCH v5 " Christian Couder
  2023-08-12  0:00         ` [PATCH v5 1/8] pack-objects: allow `--filter` without `--stdout` Christian Couder
@ 2023-08-12  0:00         ` Christian Couder
  2023-08-12  0:00         ` [PATCH v5 3/8] repack: refactor finishing pack-objects command Christian Couder
                           ` (7 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-12  0:00 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

In a following commit, we will make it possible to separate objects in
different packfiles depending on a filter.

To make sure that the right objects are in the right packs, let's add a
new test-tool that can display which packfile(s) a given object is in.

Let's also make it possible to check if a given object is in the
expected number of packfiles with a `--check-count <n>` option.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Makefile                  |  1 +
 t/helper/test-find-pack.c | 50 ++++++++++++++++++++++++
 t/helper/test-tool.c      |  1 +
 t/helper/test-tool.h      |  1 +
 t/t0080-find-pack.sh      | 82 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 135 insertions(+)
 create mode 100644 t/helper/test-find-pack.c
 create mode 100755 t/t0080-find-pack.sh

diff --git a/Makefile b/Makefile
index ace3e5a506..2534c831e8 100644
--- a/Makefile
+++ b/Makefile
@@ -800,6 +800,7 @@ TEST_BUILTINS_OBJS += test-dump-untracked-cache.o
 TEST_BUILTINS_OBJS += test-env-helper.o
 TEST_BUILTINS_OBJS += test-example-decorate.o
 TEST_BUILTINS_OBJS += test-fast-rebase.o
+TEST_BUILTINS_OBJS += test-find-pack.o
 TEST_BUILTINS_OBJS += test-fsmonitor-client.o
 TEST_BUILTINS_OBJS += test-genrandom.o
 TEST_BUILTINS_OBJS += test-genzeros.o
diff --git a/t/helper/test-find-pack.c b/t/helper/test-find-pack.c
new file mode 100644
index 0000000000..e8bd793e58
--- /dev/null
+++ b/t/helper/test-find-pack.c
@@ -0,0 +1,50 @@
+#include "test-tool.h"
+#include "object-name.h"
+#include "object-store.h"
+#include "packfile.h"
+#include "parse-options.h"
+#include "setup.h"
+
+/*
+ * Display the path(s), one per line, of the packfile(s) containing
+ * the given object.
+ *
+ * If '--check-count <n>' is passed, then error out if the number of
+ * packfiles containing the object is not <n>.
+ */
+
+static const char *find_pack_usage[] = {
+	"test-tool find-pack [--check-count <n>] <object>",
+	NULL
+};
+
+int cmd__find_pack(int argc, const char **argv)
+{
+	struct object_id oid;
+	struct packed_git *p;
+	int count = -1, actual_count = 0;
+	const char *prefix = setup_git_directory();
+
+	struct option options[] = {
+		OPT_INTEGER('c', "check-count", &count, "expected number of packs"),
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, prefix, options, find_pack_usage, 0);
+	if (argc != 1)
+		usage(find_pack_usage[0]);
+
+	if (repo_get_oid(the_repository, argv[0], &oid))
+		die("cannot parse %s as an object name", argv[0]);
+
+	for (p = get_all_packs(the_repository); p; p = p->next)
+		if (find_pack_entry_one(oid.hash, p)) {
+			printf("%s\n", p->pack_name);
+			actual_count++;
+		}
+
+	if (count > -1 && count != actual_count)
+		die("bad packfile count %d instead of %d", actual_count, count);
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index abe8a785eb..41da40c296 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -31,6 +31,7 @@ static struct test_cmd cmds[] = {
 	{ "env-helper", cmd__env_helper },
 	{ "example-decorate", cmd__example_decorate },
 	{ "fast-rebase", cmd__fast_rebase },
+	{ "find-pack", cmd__find_pack },
 	{ "fsmonitor-client", cmd__fsmonitor_client },
 	{ "genrandom", cmd__genrandom },
 	{ "genzeros", cmd__genzeros },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index ea2672436c..411dbf2db4 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -25,6 +25,7 @@ int cmd__dump_reftable(int argc, const char **argv);
 int cmd__env_helper(int argc, const char **argv);
 int cmd__example_decorate(int argc, const char **argv);
 int cmd__fast_rebase(int argc, const char **argv);
+int cmd__find_pack(int argc, const char **argv);
 int cmd__fsmonitor_client(int argc, const char **argv);
 int cmd__genrandom(int argc, const char **argv);
 int cmd__genzeros(int argc, const char **argv);
diff --git a/t/t0080-find-pack.sh b/t/t0080-find-pack.sh
new file mode 100755
index 0000000000..67b11216a3
--- /dev/null
+++ b/t/t0080-find-pack.sh
@@ -0,0 +1,82 @@
+#!/bin/sh
+
+test_description='test `test-tool find-pack`'
+
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_commit one &&
+	test_commit two &&
+	test_commit three &&
+	test_commit four &&
+	test_commit five
+'
+
+test_expect_success 'repack everything into a single packfile' '
+	git repack -a -d --no-write-bitmap-index &&
+
+	head_commit_pack=$(test-tool find-pack HEAD) &&
+	head_tree_pack=$(test-tool find-pack HEAD^{tree}) &&
+	one_pack=$(test-tool find-pack HEAD:one.t) &&
+	three_pack=$(test-tool find-pack HEAD:three.t) &&
+	old_commit_pack=$(test-tool find-pack HEAD~4) &&
+
+	test-tool find-pack --check-count 1 HEAD &&
+	test-tool find-pack --check-count=1 HEAD^{tree} &&
+	! test-tool find-pack --check-count=0 HEAD:one.t &&
+	! test-tool find-pack -c 2 HEAD:one.t &&
+	test-tool find-pack -c 1 HEAD:three.t &&
+
+	# Packfile exists at the right path
+	case "$head_commit_pack" in
+		".git/objects/pack/pack-"*".pack") true ;;
+		*) false ;;
+	esac &&
+	test -f "$head_commit_pack" &&
+
+	# Everything is in the same pack
+	test "$head_commit_pack" = "$head_tree_pack" &&
+	test "$head_commit_pack" = "$one_pack" &&
+	test "$head_commit_pack" = "$three_pack" &&
+	test "$head_commit_pack" = "$old_commit_pack"
+'
+
+test_expect_success 'add more packfiles' '
+	git rev-parse HEAD^{tree} HEAD:two.t HEAD:four.t >objects &&
+	git pack-objects .git/objects/pack/mypackname1 >packhash1 <objects &&
+
+	git rev-parse HEAD~ HEAD~^{tree} HEAD:five.t >objects &&
+	git pack-objects .git/objects/pack/mypackname2 >packhash2 <objects &&
+
+	head_commit_pack=$(test-tool find-pack HEAD) &&
+
+	# HEAD^{tree} is in 2 packfiles
+	test-tool find-pack HEAD^{tree} >head_tree_packs &&
+	grep "$head_commit_pack" head_tree_packs &&
+	grep mypackname1 head_tree_packs &&
+	! grep mypackname2 head_tree_packs &&
+	test-tool find-pack --check-count 2 HEAD^{tree} &&
+	! test-tool find-pack --check-count 1 HEAD^{tree} &&
+
+	# HEAD:five.t is also in 2 packfiles
+	test-tool find-pack HEAD:five.t >five_packs &&
+	grep "$head_commit_pack" five_packs &&
+	! grep mypackname1 five_packs &&
+	grep mypackname2 five_packs &&
+	test-tool find-pack -c 2 HEAD:five.t &&
+	! test-tool find-pack --check-count=0 HEAD:five.t
+'
+
+test_expect_success 'add more commits (as loose objects)' '
+	test_commit six &&
+	test_commit seven &&
+
+	test -z "$(test-tool find-pack HEAD)" &&
+	test -z "$(test-tool find-pack HEAD:six.t)" &&
+	test-tool find-pack --check-count 0 HEAD &&
+	test-tool find-pack -c 0 HEAD:six.t &&
+	! test-tool find-pack -c 1 HEAD:seven.t
+'
+
+test_done
-- 
2.42.0.rc1.8.ga52e3a71db


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v5 3/8] repack: refactor finishing pack-objects command
  2023-08-12  0:00       ` [PATCH v5 " Christian Couder
  2023-08-12  0:00         ` [PATCH v5 1/8] pack-objects: allow `--filter` without `--stdout` Christian Couder
  2023-08-12  0:00         ` [PATCH v5 2/8] t/helper: add 'find-pack' test-tool Christian Couder
@ 2023-08-12  0:00         ` Christian Couder
  2023-08-12  0:00         ` [PATCH v5 4/8] repack: refactor finding pack prefix Christian Couder
                           ` (6 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-12  0:00 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

Create a new finish_pack_objects_cmd() to refactor duplicated code
that handles reading the packfile names from the output of a
`git pack-objects` command and putting it into a string_list, as well as
calling finish_command().

While at it, beautify a code comment a bit in the new function.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org
---
 builtin/repack.c | 70 +++++++++++++++++++++++-------------------------
 1 file changed, 33 insertions(+), 37 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index aea5ca9d44..96af2d1caf 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -696,6 +696,36 @@ static void remove_redundant_bitmaps(struct string_list *include,
 	strbuf_release(&path);
 }
 
+static int finish_pack_objects_cmd(struct child_process *cmd,
+				   struct string_list *names,
+				   int local)
+{
+	FILE *out;
+	struct strbuf line = STRBUF_INIT;
+
+	out = xfdopen(cmd->out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		struct string_list_item *item;
+
+		if (line.len != the_hash_algo->hexsz)
+			die(_("repack: Expecting full hex object ID lines only "
+			      "from pack-objects."));
+		/*
+		 * Avoid putting packs written outside of the repository in the
+		 * list of names.
+		 */
+		if (local) {
+			item = string_list_append(names, line.buf);
+			item->util = populate_pack_exts(line.buf);
+		}
+	}
+	fclose(out);
+
+	strbuf_release(&line);
+
+	return finish_command(cmd);
+}
+
 static int write_cruft_pack(const struct pack_objects_args *args,
 			    const char *destination,
 			    const char *pack_prefix,
@@ -705,9 +735,8 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 			    struct string_list *existing_kept_packs)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
-	struct strbuf line = STRBUF_INIT;
 	struct string_list_item *item;
-	FILE *in, *out;
+	FILE *in;
 	int ret;
 	const char *scratch;
 	int local = skip_prefix(destination, packdir, &scratch);
@@ -751,27 +780,7 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 		fprintf(in, "%s.pack\n", item->string);
 	fclose(in);
 
-	out = xfdopen(cmd.out, "r");
-	while (strbuf_getline_lf(&line, out) != EOF) {
-		struct string_list_item *item;
-
-		if (line.len != the_hash_algo->hexsz)
-			die(_("repack: Expecting full hex object ID lines only "
-			      "from pack-objects."));
-		/*
-		 * avoid putting packs written outside of the repository in the
-		 * list of names
-		 */
-		if (local) {
-			item = string_list_append(names, line.buf);
-			item->util = populate_pack_exts(line.buf);
-		}
-	}
-	fclose(out);
-
-	strbuf_release(&line);
-
-	return finish_command(&cmd);
+	return finish_pack_objects_cmd(&cmd, names, local);
 }
 
 int cmd_repack(int argc, const char **argv, const char *prefix)
@@ -782,10 +791,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list existing_nonkept_packs = STRING_LIST_INIT_DUP;
 	struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
 	struct pack_geometry *geometry = NULL;
-	struct strbuf line = STRBUF_INIT;
 	struct tempfile *refs_snapshot = NULL;
 	int i, ext, ret;
-	FILE *out;
 	int show_progress;
 
 	/* variables to be filled by option parsing */
@@ -1016,18 +1023,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		fclose(in);
 	}
 
-	out = xfdopen(cmd.out, "r");
-	while (strbuf_getline_lf(&line, out) != EOF) {
-		struct string_list_item *item;
-
-		if (line.len != the_hash_algo->hexsz)
-			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
-		item = string_list_append(&names, line.buf);
-		item->util = populate_pack_exts(item->string);
-	}
-	strbuf_release(&line);
-	fclose(out);
-	ret = finish_command(&cmd);
+	ret = finish_pack_objects_cmd(&cmd, &names, 1);
 	if (ret)
 		goto cleanup;
 
-- 
2.42.0.rc1.8.ga52e3a71db


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v5 4/8] repack: refactor finding pack prefix
  2023-08-12  0:00       ` [PATCH v5 " Christian Couder
                           ` (2 preceding siblings ...)
  2023-08-12  0:00         ` [PATCH v5 3/8] repack: refactor finishing pack-objects command Christian Couder
@ 2023-08-12  0:00         ` Christian Couder
  2023-08-12  0:00         ` [PATCH v5 5/8] repack: add `--filter=<filter-spec>` option Christian Couder
                           ` (5 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-12  0:00 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

Create a new find_pack_prefix() to refactor code that handles finding
the pack prefix from the packtmp and packdir global variables, as we are
going to need this feature again in following commit.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org
---
 builtin/repack.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 96af2d1caf..825da1caca 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -783,6 +783,17 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 	return finish_pack_objects_cmd(&cmd, names, local);
 }
 
+static const char *find_pack_prefix(const char *packdir, const char *packtmp)
+{
+	const char *pack_prefix;
+	if (!skip_prefix(packtmp, packdir, &pack_prefix))
+		die(_("pack prefix %s does not begin with objdir %s"),
+		    packtmp, packdir);
+	if (*pack_prefix == '/')
+		pack_prefix++;
+	return pack_prefix;
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -1031,12 +1042,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		printf_ln(_("Nothing new to pack."));
 
 	if (pack_everything & PACK_CRUFT) {
-		const char *pack_prefix;
-		if (!skip_prefix(packtmp, packdir, &pack_prefix))
-			die(_("pack prefix %s does not begin with objdir %s"),
-			    packtmp, packdir);
-		if (*pack_prefix == '/')
-			pack_prefix++;
+		const char *pack_prefix = find_pack_prefix(packdir, packtmp);
 
 		if (!cruft_po_args.window)
 			cruft_po_args.window = po_args.window;
-- 
2.42.0.rc1.8.ga52e3a71db


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v5 5/8] repack: add `--filter=<filter-spec>` option
  2023-08-12  0:00       ` [PATCH v5 " Christian Couder
                           ` (3 preceding siblings ...)
  2023-08-12  0:00         ` [PATCH v5 4/8] repack: refactor finding pack prefix Christian Couder
@ 2023-08-12  0:00         ` Christian Couder
  2023-08-12  0:00         ` [PATCH v5 6/8] gc: add `gc.repackFilter` config option Christian Couder
                           ` (4 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-12  0:00 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

This new option puts the objects specified by `<filter-spec>` into a
separate packfile.

This could be useful if, for example, some blobs take up a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.

It's possible to find which new packfile contains the filtered out
objects using one of the following:

  - `git verify-pack -v ...`,
  - `test-tool find-pack ...`, which a previous commit added,
  - `--filter-to=<dir>`, which a following commit will add to specify
    where the pack containing the filtered out objects will be.

This feature is implemented by running `git pack-objects` twice in a
row. The first command is run with `--filter=<filter-spec>`, using the
specified filter. It packs objects while omitting the objects specified
by the filter. Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.

As the interactions with kept packs are a bit tricky, a few related
tests are added.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt |  12 ++++
 builtin/repack.c             |  73 +++++++++++++++++++
 t/t7700-repack.sh            | 135 +++++++++++++++++++++++++++++++++++
 3 files changed, 220 insertions(+)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 4017157949..6d5bec7716 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -143,6 +143,18 @@ depth is 4095.
 	a larger and slower repository; see the discussion in
 	`pack.packSizeLimit`.
 
+--filter=<filter-spec>::
+	Remove objects matching the filter specification from the
+	resulting packfile and put them into a separate packfile. Note
+	that objects used in the working directory are not filtered
+	out. So for the split to fully work, it's best to perform it
+	in a bare repo and to use the `-a` and `-d` options along with
+	this option.  Also `--no-write-bitmap-index` (or the
+	`repack.writebitmaps` config option set to `false`) should be
+	used otherwise writing bitmap index will fail, as it supposes
+	a single packfile containing all the objects. See
+	linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index 825da1caca..c672387ab9 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -21,6 +21,7 @@
 #include "pack.h"
 #include "pack-bitmap.h"
 #include "refs.h"
+#include "list-objects-filter-options.h"
 
 #define ALL_INTO_ONE 1
 #define LOOSEN_UNREACHABLE 2
@@ -57,6 +58,7 @@ struct pack_objects_args {
 	int no_reuse_object;
 	int quiet;
 	int local;
+	struct list_objects_filter_options filter_options;
 };
 
 static int repack_config(const char *var, const char *value,
@@ -726,6 +728,57 @@ static int finish_pack_objects_cmd(struct child_process *cmd,
 	return finish_command(cmd);
 }
 
+static int write_filtered_pack(const struct pack_objects_args *args,
+			       const char *destination,
+			       const char *pack_prefix,
+			       struct string_list *keep_pack_list,
+			       struct string_list *names,
+			       struct string_list *existing_packs,
+			       struct string_list *existing_kept_packs)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct string_list_item *item;
+	FILE *in;
+	int ret, i;
+	const char *caret;
+	const char *scratch;
+	int local = skip_prefix(destination, packdir, &scratch);
+
+	prepare_pack_objects(&cmd, args, destination);
+
+	strvec_push(&cmd.args, "--stdin-packs");
+
+	if (!pack_kept_objects)
+		strvec_push(&cmd.args, "--honor-pack-keep");
+	for (i = 0; i < keep_pack_list->nr; i++)
+		strvec_pushf(&cmd.args, "--keep-pack=%s",
+			     keep_pack_list->items[i].string);
+
+	cmd.in = -1;
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	/*
+	 * Here 'names' contains only the pack(s) that were just
+	 * written, which is exactly the packs we want to keep. Also
+	 * 'existing_kept_packs' already contains the packs in
+	 * 'keep_pack_list'.
+	 */
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, names)
+		fprintf(in, "^%s-%s.pack\n", pack_prefix, item->string);
+	for_each_string_list_item(item, existing_packs)
+		fprintf(in, "%s.pack\n", item->string);
+	caret = pack_kept_objects ? "" : "^";
+	for_each_string_list_item(item, existing_kept_packs)
+		fprintf(in, "%s%s.pack\n", caret, item->string);
+	fclose(in);
+
+	return finish_pack_objects_cmd(&cmd, names, local);
+}
+
 static int write_cruft_pack(const struct pack_objects_args *args,
 			    const char *destination,
 			    const char *pack_prefix,
@@ -858,6 +911,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("limits the maximum number of threads")),
 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
 				N_("maximum size of each packfile")),
+		OPT_PARSE_LIST_OBJECTS_FILTER(&po_args.filter_options),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
@@ -871,6 +925,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_END()
 	};
 
+	list_objects_filter_init(&po_args.filter_options);
+
 	git_config(repack_config, &cruft_po_args);
 
 	argc = parse_options(argc, argv, prefix, builtin_repack_options,
@@ -1011,6 +1067,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		strvec_push(&cmd.args, "--incremental");
 	}
 
+	if (po_args.filter_options.choice)
+		strvec_pushf(&cmd.args, "--filter=%s",
+			     expand_list_objects_filter_spec(&po_args.filter_options));
+
 	if (geometry)
 		cmd.in = -1;
 	else
@@ -1097,6 +1157,18 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		}
 	}
 
+	if (po_args.filter_options.choice) {
+		ret = write_filtered_pack(&po_args,
+					  packtmp,
+					  find_pack_prefix(packdir, packtmp),
+					  &keep_pack_list,
+					  &names,
+					  &existing_nonkept_packs,
+					  &existing_kept_packs);
+		if (ret)
+			goto cleanup;
+	}
+
 	string_list_sort(&names);
 
 	close_object_store(the_repository->objects);
@@ -1231,6 +1303,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	string_list_clear(&existing_nonkept_packs, 0);
 	string_list_clear(&existing_kept_packs, 0);
 	clear_pack_geometry(geometry);
+	list_objects_filter_release(&po_args.filter_options);
 
 	return ret;
 }
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 27b66807cd..39e89445fd 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -327,6 +327,141 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_must_be_empty actual
 '
 
+test_expect_success 'repacking with a filter works' '
+	git -C bare.git repack -a -d &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+	git -C bare.git -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
+	test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
+	commit_pack=$(test-tool -C bare.git find-pack -c 1 HEAD) &&
+	blob_pack=$(test-tool -C bare.git find-pack -c 1 HEAD:file1) &&
+	test "$commit_pack" != "$blob_pack" &&
+	tree_pack=$(test-tool -C bare.git find-pack -c 1 HEAD^{tree}) &&
+	test "$tree_pack" = "$commit_pack" &&
+	blob_pack2=$(test-tool -C bare.git find-pack -c 1 HEAD:file2) &&
+	test "$blob_pack2" = "$blob_pack"
+'
+
+test_expect_success '--filter fails with --write-bitmap-index' '
+	test_must_fail \
+		env GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
+		git -C bare.git repack -a -d --write-bitmap-index --filter=blob:none
+'
+
+test_expect_success 'repacking with two filters works' '
+	git init two-filters &&
+	(
+		cd two-filters &&
+		mkdir subdir &&
+		test_commit foo &&
+		test_commit subdir_bar subdir/bar &&
+		test_commit subdir_baz subdir/baz
+	) &&
+	git clone --no-local --bare two-filters two-filters.git &&
+	(
+		cd two-filters.git &&
+		test_stdout_line_count = 1 ls objects/pack/*.pack &&
+		git -c repack.writebitmaps=false repack -a -d \
+			--filter=blob:none --filter=tree:1 &&
+		test_stdout_line_count = 2 ls objects/pack/*.pack &&
+		commit_pack=$(test-tool find-pack -c 1 HEAD) &&
+		blob_pack=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		root_tree_pack=$(test-tool find-pack -c 1 HEAD^{tree}) &&
+		subdir_tree_hash=$(git ls-tree --object-only HEAD -- subdir) &&
+		subdir_tree_pack=$(test-tool find-pack -c 1 "$subdir_tree_hash") &&
+
+		# Root tree and subdir tree are not in the same packfiles
+		test "$commit_pack" != "$blob_pack" &&
+		test "$commit_pack" = "$root_tree_pack" &&
+		test "$blob_pack" = "$subdir_tree_pack"
+	)
+'
+
+prepare_for_keep_packs () {
+	git init keep-packs &&
+	(
+		cd keep-packs &&
+		test_commit foo &&
+		test_commit bar
+	) &&
+	git clone --no-local --bare keep-packs keep-packs.git &&
+	(
+		cd keep-packs.git &&
+
+		# Create two packs
+		# The first pack will contain all of the objects except one blob
+		git rev-list --objects --all >objs &&
+		grep -v "bar.t" objs | git pack-objects pack &&
+		# The second pack will contain the excluded object and be kept
+		packid=$(grep "bar.t" objs | git pack-objects pack) &&
+		>pack-$packid.keep &&
+
+		# Replace the existing pack with the 2 new ones
+		rm -f objects/pack/pack* &&
+		mv pack-* objects/pack/
+	)
+}
+
+test_expect_success '--filter works with .keep packs' '
+	prepare_for_keep_packs &&
+	(
+		cd keep-packs.git &&
+
+		foo_pack=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		bar_pack=$(test-tool find-pack -c 1 HEAD:bar.t) &&
+		head_pack=$(test-tool find-pack -c 1 HEAD) &&
+
+		test "$foo_pack" != "$bar_pack" &&
+		test "$foo_pack" = "$head_pack" &&
+
+		git -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
+
+		foo_pack_1=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		bar_pack_1=$(test-tool find-pack -c 1 HEAD:bar.t) &&
+		head_pack_1=$(test-tool find-pack -c 1 HEAD) &&
+
+		# Object bar is still only in the old .keep pack
+		test "$foo_pack_1" != "$foo_pack" &&
+		test "$bar_pack_1" = "$bar_pack" &&
+		test "$head_pack_1" != "$head_pack" &&
+
+		test "$foo_pack_1" != "$bar_pack_1" &&
+		test "$foo_pack_1" != "$head_pack_1" &&
+		test "$bar_pack_1" != "$head_pack_1"
+	)
+'
+
+test_expect_success '--filter works with --pack-kept-objects and .keep packs' '
+	rm -rf keep-packs keep-packs.git &&
+	prepare_for_keep_packs &&
+	(
+		cd keep-packs.git &&
+
+		foo_pack=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		bar_pack=$(test-tool find-pack -c 1 HEAD:bar.t) &&
+		head_pack=$(test-tool find-pack -c 1 HEAD) &&
+
+		test "$foo_pack" != "$bar_pack" &&
+		test "$foo_pack" = "$head_pack" &&
+
+		git -c repack.writebitmaps=false repack -a -d --filter=blob:none \
+			--pack-kept-objects &&
+
+		foo_pack_1=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		test-tool find-pack -c 2 HEAD:bar.t >bar_pack_1 &&
+		head_pack_1=$(test-tool find-pack -c 1 HEAD) &&
+
+		test "$foo_pack_1" != "$foo_pack" &&
+		test "$foo_pack_1" != "$bar_pack" &&
+		test "$head_pack_1" != "$head_pack" &&
+
+		# Object bar is in both the old .keep pack and the new
+		# pack that contained the filtered out objects
+		grep "$bar_pack" bar_pack_1 &&
+		grep "$foo_pack_1" bar_pack_1 &&
+		test "$foo_pack_1" != "$head_pack_1"
+	)
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.42.0.rc1.8.ga52e3a71db


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v5 6/8] gc: add `gc.repackFilter` config option
  2023-08-12  0:00       ` [PATCH v5 " Christian Couder
                           ` (4 preceding siblings ...)
  2023-08-12  0:00         ` [PATCH v5 5/8] repack: add `--filter=<filter-spec>` option Christian Couder
@ 2023-08-12  0:00         ` Christian Couder
  2023-08-12  0:00         ` [PATCH v5 7/8] repack: implement `--filter-to` for storing filtered out objects Christian Couder
                           ` (3 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-12  0:00 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit has implemented `git repack --filter=<filter-spec>` to
allow users to filter out some objects from the main pack and move them
into a new different pack.

Users might want to perform such a cleanup regularly at the same time as
they perform other repacks and cleanups, so as part of `git gc`.

Let's allow them to configure a <filter-spec> for that purpose using a
new gc.repackFilter config option.

Now when `git gc` will perform a repack with a <filter-spec> configured
through this option and not empty, the repack process will be passed a
corresponding `--filter=<filter-spec>` argument.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/gc.txt |  5 +++++
 builtin/gc.c                |  6 ++++++
 t/t6500-gc.sh               | 13 +++++++++++++
 3 files changed, 24 insertions(+)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index ca47eb2008..2153bde7ac 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -145,6 +145,11 @@ Multiple hooks are supported, but all must exit successfully, else the
 operation (either generating a cruft pack or unpacking unreachable
 objects) will be halted.
 
+gc.repackFilter::
+	When repacking, use the specified filter to move certain
+	objects into a separate packfile.  See the
+	`--filter=<filter-spec>` option of linkgit:git-repack[1].
+
 gc.rerereResolved::
 	Records of conflicted merge you resolved earlier are
 	kept for this many days when 'git rerere gc' is run.
diff --git a/builtin/gc.c b/builtin/gc.c
index 19d73067aa..9b0984f301 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -61,6 +61,7 @@ static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
+static char *repack_filter;
 static unsigned long big_pack_threshold;
 static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 
@@ -170,6 +171,8 @@ static void gc_config(void)
 	git_config_get_ulong("gc.bigpackthreshold", &big_pack_threshold);
 	git_config_get_ulong("pack.deltacachesize", &max_delta_cache_size);
 
+	git_config_get_string("gc.repackfilter", &repack_filter);
+
 	git_config(git_default_config, NULL);
 }
 
@@ -355,6 +358,9 @@ static void add_repack_all_option(struct string_list *keep_pack)
 
 	if (keep_pack)
 		for_each_string_list(keep_pack, keep_one_pack, NULL);
+
+	if (repack_filter && *repack_filter)
+		strvec_pushf(&repack, "--filter=%s", repack_filter);
 }
 
 static void add_repack_incremental_option(void)
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 69509d0c11..232e403b66 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -202,6 +202,19 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
 	grep -E "^trace: (built-in|exec|run_command): git reflog expire --" trace.out
 '
 
+test_expect_success 'gc.repackFilter launches repack with a filter' '
+	test_when_finished "rm -rf bare.git" &&
+	git clone --no-local --bare . bare.git &&
+
+	git -C bare.git -c gc.cruftPacks=false gc &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+
+	GIT_TRACE=$(pwd)/trace.out git -C bare.git -c gc.repackFilter=blob:none \
+		-c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&
+	test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
+	grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
+'
+
 prepare_cruft_history () {
 	test_commit base &&
 
-- 
2.42.0.rc1.8.ga52e3a71db


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v5 7/8] repack: implement `--filter-to` for storing filtered out objects
  2023-08-12  0:00       ` [PATCH v5 " Christian Couder
                           ` (5 preceding siblings ...)
  2023-08-12  0:00         ` [PATCH v5 6/8] gc: add `gc.repackFilter` config option Christian Couder
@ 2023-08-12  0:00         ` Christian Couder
  2023-08-12  0:00         ` [PATCH v5 8/8] gc: add `gc.repackFilterTo` config option Christian Couder
                           ` (2 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-12  0:00 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit has implemented `git repack --filter=<filter-spec>` to
allow users to filter out some objects from the main pack and move them
into a new different pack.

It would be nice if this new different pack could be created in a
different directory than the regular pack. This would make it possible
to move large blobs into a pack on a different kind of storage, for
example cheaper storage.

Even in a different directory, this pack can be accessible if, for
example, the Git alternates mechanism is used to point to it. In fact
not using the Git alternates mechanism can corrupt a repo as the
generated pack containing the filtered objects might not be accessible
from the repo any more. So setting up the Git alternates mechanism
should be done before using this feature if the user wants the repo to
be fully usable while this feature is used.

In some cases, like when a repo has just been cloned or when there is no
other activity in the repo, it's Ok to setup the Git alternates
mechanism afterwards though. It's also Ok to just inspect the generated
packfile containing the filtered objects and then just move it into the
'.git/objects/pack/' directory manually. That's why it's not necessary
for this command to check that the Git alternates mechanism has been
already setup.

While at it, as an example to show that `--filter` and `--filter-to`
work well with other options, let's also add a test to check that these
options work well with `--max-pack-size`.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt | 11 +++++++
 builtin/repack.c             | 10 +++++-
 t/t7700-repack.sh            | 62 ++++++++++++++++++++++++++++++++++++
 3 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 6d5bec7716..8545a32667 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -155,6 +155,17 @@ depth is 4095.
 	a single packfile containing all the objects. See
 	linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
 
+--filter-to=<dir>::
+	Write the pack containing filtered out objects to the
+	directory `<dir>`. Only useful with `--filter`. This can be
+	used for putting the pack on a separate object directory that
+	is accessed through the Git alternates mechanism. **WARNING:**
+	If the packfile containing the filtered out objects is not
+	accessible, the repo can become corrupt as it might not be
+	possible to access the objects in that packfile. See the
+	`objects` and `objects/info/alternates` sections of
+	linkgit:gitrepository-layout[5].
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index c672387ab9..c396029ec9 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -870,6 +870,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int write_midx = 0;
 	const char *cruft_expiration = NULL;
 	const char *expire_to = NULL;
+	const char *filter_to = NULL;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -922,6 +923,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			   N_("write a multi-pack index of the resulting packs")),
 		OPT_STRING(0, "expire-to", &expire_to, N_("dir"),
 			   N_("pack prefix to store a pack containing pruned objects")),
+		OPT_STRING(0, "filter-to", &filter_to, N_("dir"),
+			   N_("pack prefix to store a pack containing filtered out objects")),
 		OPT_END()
 	};
 
@@ -1070,6 +1073,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (po_args.filter_options.choice)
 		strvec_pushf(&cmd.args, "--filter=%s",
 			     expand_list_objects_filter_spec(&po_args.filter_options));
+	else if (filter_to)
+		die(_("option '%s' can only be used along with '%s'"), "--filter-to", "--filter");
 
 	if (geometry)
 		cmd.in = -1;
@@ -1158,8 +1163,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 
 	if (po_args.filter_options.choice) {
+		if (!filter_to)
+			filter_to = packtmp;
+
 		ret = write_filtered_pack(&po_args,
-					  packtmp,
+					  filter_to,
 					  find_pack_prefix(packdir, packtmp),
 					  &keep_pack_list,
 					  &names,
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 39e89445fd..48e92aa6f7 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -462,6 +462,68 @@ test_expect_success '--filter works with --pack-kept-objects and .keep packs' '
 	)
 '
 
+test_expect_success '--filter-to stores filtered out objects' '
+	git -C bare.git repack -a -d &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+
+	git init --bare filtered.git &&
+	git -C bare.git -c repack.writebitmaps=false repack -a -d \
+		--filter=blob:none \
+		--filter-to=../filtered.git/objects/pack/pack &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/pack-*.pack &&
+	test_stdout_line_count = 1 ls filtered.git/objects/pack/pack-*.pack &&
+
+	commit_pack=$(test-tool -C bare.git find-pack -c 1 HEAD) &&
+	blob_pack=$(test-tool -C bare.git find-pack -c 0 HEAD:file1) &&
+	blob_hash=$(git -C bare.git rev-parse HEAD:file1) &&
+	test -n "$blob_hash" &&
+	blob_pack=$(test-tool -C filtered.git find-pack -c 1 $blob_hash) &&
+
+	echo $(pwd)/filtered.git/objects >bare.git/objects/info/alternates &&
+	blob_pack=$(test-tool -C bare.git find-pack -c 1 HEAD:file1) &&
+	blob_content=$(git -C bare.git show $blob_hash) &&
+	test "$blob_content" = "content1"
+'
+
+test_expect_success '--filter works with --max-pack-size' '
+	rm -rf filtered.git &&
+	git init --bare filtered.git &&
+	git init max-pack-size &&
+	(
+		cd max-pack-size &&
+		test_commit base &&
+		# two blobs which exceed the maximum pack size
+		test-tool genrandom foo 1048576 >foo &&
+		git hash-object -w foo &&
+		test-tool genrandom bar 1048576 >bar &&
+		git hash-object -w bar &&
+		git add foo bar &&
+		git commit -m "adding foo and bar"
+	) &&
+	git clone --no-local --bare max-pack-size max-pack-size.git &&
+	(
+		cd max-pack-size.git &&
+		git -c repack.writebitmaps=false repack -a -d --filter=blob:none \
+			--max-pack-size=1M \
+			--filter-to=../filtered.git/objects/pack/pack &&
+		echo $(cd .. && pwd)/filtered.git/objects >objects/info/alternates &&
+
+		# Check that the 3 blobs are in different packfiles in filtered.git
+		test_stdout_line_count = 3 ls ../filtered.git/objects/pack/pack-*.pack &&
+		test_stdout_line_count = 1 ls objects/pack/pack-*.pack &&
+		foo_pack=$(test-tool find-pack -c 1 HEAD:foo) &&
+		bar_pack=$(test-tool find-pack -c 1 HEAD:bar) &&
+		base_pack=$(test-tool find-pack -c 1 HEAD:base.t) &&
+		test "$foo_pack" != "$bar_pack" &&
+		test "$foo_pack" != "$base_pack" &&
+		test "$bar_pack" != "$base_pack" &&
+		for pack in "$foo_pack" "$bar_pack" "$base_pack"
+		do
+			case "$foo_pack" in */filtered.git/objects/pack/*) true ;; *) return 1 ;; esac
+		done
+	)
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.42.0.rc1.8.ga52e3a71db


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v5 8/8] gc: add `gc.repackFilterTo` config option
  2023-08-12  0:00       ` [PATCH v5 " Christian Couder
                           ` (6 preceding siblings ...)
  2023-08-12  0:00         ` [PATCH v5 7/8] repack: implement `--filter-to` for storing filtered out objects Christian Couder
@ 2023-08-12  0:00         ` Christian Couder
  2023-08-15  0:51         ` [PATCH v5 0/8] Repack objects into separate packfiles based on a filter Junio C Hamano
  2023-09-11 15:06         ` [PATCH v6 0/9] " Christian Couder
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-12  0:00 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit implemented the `gc.repackFilter` config option
to specify a filter that should be used by `git gc` when
performing repacks.

Another previous commit has implemented
`git repack --filter-to=<dir>` to specify the location of the
packfile containing filtered out objects when using a filter.

Let's implement the `gc.repackFilterTo` config option to specify
that location in the config when `gc.repackFilter` is used.

Now when `git gc` will perform a repack with a <dir> configured
through this option and not empty, the repack process will be
passed a corresponding `--filter-to=<dir>` argument.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/gc.txt | 11 +++++++++++
 builtin/gc.c                |  4 ++++
 t/t6500-gc.sh               | 13 ++++++++++++-
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index 2153bde7ac..466466d6cc 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -150,6 +150,17 @@ gc.repackFilter::
 	objects into a separate packfile.  See the
 	`--filter=<filter-spec>` option of linkgit:git-repack[1].
 
+gc.repackFilterTo::
+	When repacking and using a filter, see `gc.repackFilter`, the
+	specified location will be used to create the packfile
+	containing the filtered out objects. **WARNING:** The
+	specified location should be accessible, using for example the
+	Git alternates mechanism, otherwise the repo could be
+	considered corrupt by Git as it migh not be able to access the
+	objects in that packfile. See the `--filter-to=<dir>` option
+	of linkgit:git-repack[1] and the `objects/info/alternates`
+	section of linkgit:gitrepository-layout[5].
+
 gc.rerereResolved::
 	Records of conflicted merge you resolved earlier are
 	kept for this many days when 'git rerere gc' is run.
diff --git a/builtin/gc.c b/builtin/gc.c
index 9b0984f301..1b7c775d94 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -62,6 +62,7 @@ static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
 static char *repack_filter;
+static char *repack_filter_to;
 static unsigned long big_pack_threshold;
 static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 
@@ -172,6 +173,7 @@ static void gc_config(void)
 	git_config_get_ulong("pack.deltacachesize", &max_delta_cache_size);
 
 	git_config_get_string("gc.repackfilter", &repack_filter);
+	git_config_get_string("gc.repackfilterto", &repack_filter_to);
 
 	git_config(git_default_config, NULL);
 }
@@ -361,6 +363,8 @@ static void add_repack_all_option(struct string_list *keep_pack)
 
 	if (repack_filter && *repack_filter)
 		strvec_pushf(&repack, "--filter=%s", repack_filter);
+	if (repack_filter_to && *repack_filter_to)
+		strvec_pushf(&repack, "--filter-to=%s", repack_filter_to);
 }
 
 static void add_repack_incremental_option(void)
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 232e403b66..e412cf8daf 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -203,7 +203,6 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
 '
 
 test_expect_success 'gc.repackFilter launches repack with a filter' '
-	test_when_finished "rm -rf bare.git" &&
 	git clone --no-local --bare . bare.git &&
 
 	git -C bare.git -c gc.cruftPacks=false gc &&
@@ -215,6 +214,18 @@ test_expect_success 'gc.repackFilter launches repack with a filter' '
 	grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
 '
 
+test_expect_success 'gc.repackFilterTo store filtered out objects' '
+	test_when_finished "rm -rf bare.git filtered.git" &&
+
+	git init --bare filtered.git &&
+	git -C bare.git -c gc.repackFilter=blob:none \
+		-c gc.repackFilterTo=../filtered.git/objects/pack/pack \
+		-c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&
+
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+	test_stdout_line_count = 1 ls filtered.git/objects/pack/*.pack
+'
+
 prepare_cruft_history () {
 	test_commit base &&
 
-- 
2.42.0.rc1.8.ga52e3a71db


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 0/8] Repack objects into separate packfiles based on a filter
  2023-08-09 21:45       ` [PATCH v4 0/8] Repack objects into separate packfiles based on a filter Taylor Blau
  2023-08-09 21:57         ` Junio C Hamano
@ 2023-08-12  0:12         ` Christian Couder
  1 sibling, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-08-12  0:12 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

On Wed, Aug 9, 2023 at 11:45 PM Taylor Blau <me@ttaylorr.com> wrote:

> I took a look through the range-diff as well as the patches themselves
> again (skimming through the last three, which are much more
> straightforward than the preceding ones).
>
> Everything looks good to me here, and I think that this version is ready
> to get picked up once we're on the other side of 2.42.

Thanks again for your review!

> I left a couple of comments throughout, but none of them merit a reroll
> on their own. I think there are a couple of things we could easily
> ignore (marking parameters as "const", etc.), and a couple of things
> that we should probably take a look at after the dust has settled here.

The version 5 I just sent should fix all the small things that you
found in your review.

> We *may* want to fix up the test_must_fail invocation that has the
> environment variable on the left-hand side instead of using
> "test_must_fail env", but I don't know for sure.

This is fixed by squashing Junio's 'SQUASH???' commit in version 5.

> I do think that we should take another look at disabling the bitmap
> machinery when given `--filter`, but I think that, too, can be done in
> another series.

I agree. I plan to do it later when this is merged. I think it would
make it easier to use the new --filter feature, but it would require
changes in code, tests and documentation, which can be done later.

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v5 0/8] Repack objects into separate packfiles based on a filter
  2023-08-12  0:00       ` [PATCH v5 " Christian Couder
                           ` (7 preceding siblings ...)
  2023-08-12  0:00         ` [PATCH v5 8/8] gc: add `gc.repackFilterTo` config option Christian Couder
@ 2023-08-15  0:51         ` Junio C Hamano
  2023-08-15 21:43           ` Taylor Blau
  2023-09-11 15:06         ` [PATCH v6 0/9] " Christian Couder
  9 siblings, 1 reply; 161+ messages in thread
From: Junio C Hamano @ 2023-08-15  0:51 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt

Christian Couder <christian.couder@gmail.com> writes:

> # Changes since version 4
>
> Thanks to Junio who reviewed versions 1, 2, 3 and 4, and to Taylor who
> reviewed version 1, 3 and 4! Thanks also to Robert Coup who
> participated in the discussions related to version 2 and Peff who
> participated in the discussions related to version 4. The changes are
> the following:
>
> - In patch 2/8, which introduces `test-tool find-pack`, a spurious
>   space character has been removed between 'die' and '(', as suggested
>   by Taylor.
>
> - In patch 4/8, which refactors code into a find_pack_prefix()
>   function, this function has been changed so that the `packdir` and
>   `packtmp` arguments are now 'const', as suggested by Taylor.
>
> - In patch 5/8, which introduces `--filter=<filter-spec>` option, the
>   `filter_options` member of the 'cruft_po_args' variable is not
>   initialized and freed anymore, as this member is actually unused.
>
> - Also in patch 5/8, the '--filter fails with --write-bitmap-index'
>   test has been changed to use `test_must_fail env` to fix failures
>   with the 'test-lint' Makefile target, as suggested by Junio and
>   Taylor. (Junio's 'SQUASH???' patch was squashed into that patch.)

Thanks.  I do not recall if the previous version with SQUASH??? passed
the tests or not, but this round seems to be breaking the exact test
we had trouble with with the previous round:

  https://github.com/git/git/actions/runs/5850998716/job/15861158252#step:4:1822

The symptom looks like that "test_must_fail env" test is not
failing.  Ring a bell?

Thanks.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v5 0/8] Repack objects into separate packfiles based on a filter
  2023-08-15  0:51         ` [PATCH v5 0/8] Repack objects into separate packfiles based on a filter Junio C Hamano
@ 2023-08-15 21:43           ` Taylor Blau
  2023-08-15 22:32             ` Junio C Hamano
  0 siblings, 1 reply; 161+ messages in thread
From: Taylor Blau @ 2023-08-15 21:43 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

On Mon, Aug 14, 2023 at 05:51:05PM -0700, Junio C Hamano wrote:
> Thanks.  I do not recall if the previous version with SQUASH??? passed
> the tests or not, but this round seems to be breaking the exact test
> we had trouble with with the previous round:
>
>   https://github.com/git/git/actions/runs/5850998716/job/15861158252#step:4:1822
>
> The symptom looks like that "test_must_fail env" test is not
> failing.  Ring a bell?

That does ring a bell for me, but this is a different failure than
before, IIRC.

This time we're expecting to fail writing a bitmap during a filtered
repack, but we succeed. I was wondering in [1] whether or not we should
be catching this bad combination of options more eagerly than relying on
the pack-bitmap machinery to notice that we're missing a reachability
closure.

I think the reason that this succeeds is that we already have a bitmap,
and it likely reuses all of the existing bitmaps before discovering that
the pack we wrote doesn't contain all objects. So doing this "fixes" the
immediate issue:

--- 8< ---
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 48e92aa6f7..e5134d3451 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -342,6 +342,7 @@ test_expect_success 'repacking with a filter works' '
 '

 test_expect_success '--filter fails with --write-bitmap-index' '
+	rm -f bare.git/objects/pack/*.bitmap &&
 	test_must_fail \
 		env GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
 		git -C bare.git repack -a -d --write-bitmap-index --filter=blob:none
--- >8 ---

but I wonder if a more complete fix would be something like:

--- 8< ---
diff --git a/builtin/repack.c b/builtin/repack.c
index c396029ec9..f021349c4e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -48,6 +48,11 @@ static const char incremental_bitmap_conflict_error[] = N_(
 "--no-write-bitmap-index or disable the pack.writeBitmaps configuration."
 );

+static const char filtered_bitmap_conflict_error[] = N_(
+"Filtered repacks are incompatible with bitmap indexes.  Use\n"
+"--no-write-bitmap-index or disable the pack.writeBitmaps configuration."
+);
+
 struct pack_objects_args {
 	const char *window;
 	const char *window_memory;
@@ -953,7 +958,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)

 	if (write_bitmaps < 0) {
 		if (!write_midx &&
-		    (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
+		    (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()) &&
+		    !po_args.filter_options.choice)
 			write_bitmaps = 0;
 	} else if (write_bitmaps &&
 		   git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0) &&
@@ -966,6 +972,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE) && !write_midx)
 		die(_(incremental_bitmap_conflict_error));

+	if (write_bitmaps && po_args.filter_options.choice)
+		die(_(filtered_bitmap_conflict_error));
+
 	if (write_bitmaps && po_args.local && has_alt_odb(the_repository)) {
 		/*
 		 * When asked to do a local repack, but we have
--- >8 ---

would be preferable.

Thanks,
Taylor

[1]: https://lore.kernel.org/git/ZNQH6EMKqbuUzEhs@nand.local/

^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v5 0/8] Repack objects into separate packfiles based on a filter
  2023-08-15 21:43           ` Taylor Blau
@ 2023-08-15 22:32             ` Junio C Hamano
  2023-08-15 23:09               ` Taylor Blau
  0 siblings, 1 reply; 161+ messages in thread
From: Junio C Hamano @ 2023-08-15 22:32 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

Taylor Blau <me@ttaylorr.com> writes:

> I think the reason that this succeeds is that we already have a bitmap,
> and it likely reuses all of the existing bitmaps before discovering that
> the pack we wrote doesn't contain all objects.

Now I am confused.

We were asked to write bitmap index when we are going to create an
incomplete pack, and the packfile we generate with the filter will
not have full set of objects, and generating a bitmap with such an
incomplete knowledge of what objects are reachable from what would
be a disaster, so we should turn it off.  But the posted patch
lacked such a "we should abort when bitmap is asked to be written
while filtering" logic.

Then what were we expecting for the test to fail for?

> but I wonder if a more complete fix would be something like:
> ...
> @@ -966,6 +972,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE) && !write_midx)
>  		die(_(incremental_bitmap_conflict_error));
>
> +	if (write_bitmaps && po_args.filter_options.choice)
> +		die(_(filtered_bitmap_conflict_error));
> +

It sounds like the most direct fix.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v5 0/8] Repack objects into separate packfiles based on a filter
  2023-08-15 22:32             ` Junio C Hamano
@ 2023-08-15 23:09               ` Taylor Blau
  2023-08-15 23:18                 ` Junio C Hamano
  2023-09-11 15:20                 ` Christian Couder
  0 siblings, 2 replies; 161+ messages in thread
From: Taylor Blau @ 2023-08-15 23:09 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

On Tue, Aug 15, 2023 at 03:32:23PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > I think the reason that this succeeds is that we already have a bitmap,
> > and it likely reuses all of the existing bitmaps before discovering that
> > the pack we wrote doesn't contain all objects.
>
> Now I am confused.
>
> We were asked to write bitmap index when we are going to create an
> incomplete pack, and the packfile we generate with the filter will
> not have full set of objects, and generating a bitmap with such an
> incomplete knowledge of what objects are reachable from what would
> be a disaster, so we should turn it off.  But the posted patch
> lacked such a "we should abort when bitmap is asked to be written
> while filtering" logic.

I was similarly confused, and started writing a patch to detect when we
see objects in one bitmap but not the other when remapping. But we
already handle that case, see the call to `rebuild_bitmap()` from
`fill_bitmap_commit()` in pack-bitmap-write.c.

So I don't think we'd ever end up reusing an existing bitmap that refers
to objects that we don't have.

But something is definitely strange here. The bitmap generated by this
test claims to have three commits:

    $ ~/src/git/t/helper/test-tool bitmap list-commits
    95a9e53327b06212dcf98bd44794b0e2b913deab
    3677360288c631b6b2e1f0e1f081b1e518605e9f
    6f105e6234717c52e9b117b08840926910a68314

...but none of them actually appear to exist in the bitmap:

    $ git rev-list --test-bitmap 95a9e53327b06212dcf98bd44794b0e2b913deab
    Bitmap v1 test (3 entries loaded)
    Found bitmap for '95a9e53327b06212dcf98bd44794b0e2b913deab'. 64 bits / 8b3b6ee7 checksum
    fatal: object not in bitmap: 'ac3e272b72bbf89def8657766b855d0656630ed4'

I think what's going on here is that we attempt to create bitmaps for
all three of those commits. We then try and reuse the existing bitmaps,
but fail, because we are missing some objects.

So then we try and generate the bitmap from scratch, and when we get
down to fill_bitmap_tree() we look up the bit position of the tree
itself, and find a non-zero answer, indicating that we have already
marked that tree.

And fill_bitmap_tree() correctly assumes that if we have marked the bit
corresponding to the tree, that everything reachable from that tree has
also been marked. So we never try and locate the bit position for the
blob, since we already think that we have a blob marked in the resulting
bitmap!

But why is that tree marked in the first place? It's because we attempt
to rebuild the bitmap from the existing .bitmap file, but fail part of
the way through (when we look up the first blob object in the reposition
table). But that happens *after* we see the tree object, so its bit
position is marked, even though we didn't rebuild a complete bitmap.

I don't think this matters outside of filtered repacks, but it would be
a serious bug to not catch this earlier up like suggested in the
(quoted) patch below.

> > but I wonder if a more complete fix would be something like:
> > ...
> > @@ -966,6 +972,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
> >  	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE) && !write_midx)
> >  		die(_(incremental_bitmap_conflict_error));
> >
> > +	if (write_bitmaps && po_args.filter_options.choice)
> > +		die(_(filtered_bitmap_conflict_error));
> > +
>
> It sounds like the most direct fix.

I agree.

I think that we would be OK to not change the implementation of
rebuild_bitmap(), or its caller in fill_bitmap_commit(), since this only
bites us when bitmapping a filtered pack, and we should catch that case
well before getting this deep into the bitmap code.

But it does seem suspect that we rebuild right into ent->bitmap, so we
may want to consider doing something like:

--- >8 ---
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index f6757c3cbf..f4ecdf8b0e 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -413,15 +413,19 @@ static int fill_bitmap_commit(struct bb_commit *ent,

 		if (old_bitmap && mapping) {
 			struct ewah_bitmap *old = bitmap_for_commit(old_bitmap, c);
+			struct bitmap *remapped = bitmap_new();
 			/*
 			 * If this commit has an old bitmap, then translate that
 			 * bitmap and add its bits to this one. No need to walk
 			 * parents or the tree for this commit.
 			 */
-			if (old && !rebuild_bitmap(mapping, old, ent->bitmap)) {
+			if (old && !rebuild_bitmap(mapping, old, remapped)) {
+				bitmap_or(ent->bitmap, remapped);
+				bitmap_free(remapped);
 				reused_bitmaps_nr++;
 				continue;
 			}
+			bitmap_free(remapped);
 		}

 		/*
--- 8< ---

on top.

Applying that patch and then rerunning the tests with the appropriate
TEST variables causes the 'git repack' to fail as expected, ensuring
that the containing test passes.

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v5 0/8] Repack objects into separate packfiles based on a filter
  2023-08-15 23:09               ` Taylor Blau
@ 2023-08-15 23:18                 ` Junio C Hamano
  2023-08-16  0:38                   ` Taylor Blau
  2023-09-11 15:20                 ` Christian Couder
  1 sibling, 1 reply; 161+ messages in thread
From: Junio C Hamano @ 2023-08-15 23:18 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

Taylor Blau <me@ttaylorr.com> writes:

> But why is that tree marked in the first place? It's because we attempt
> to rebuild the bitmap from the existing .bitmap file, but fail part of
> the way through (when we look up the first blob object in the reposition
> table). But that happens *after* we see the tree object, so its bit
> position is marked, even though we didn't rebuild a complete bitmap.

So, there is another bug lurking, other than the lack of "combining
filtered repack and bitmaps are explicitly forbidden" logic?  We see
the tree object, we immediately mark it as "done" even we are not,
then we finish in failure and the "done" mark is left behind?  Do we
need two bits, "under review" and "done", or something then?

> But it does seem suspect that we rebuild right into ent->bitmap, so we
> may want to consider doing something like:
>
> --- >8 ---
> diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
> index f6757c3cbf..f4ecdf8b0e 100644
> --- a/pack-bitmap-write.c
> +++ b/pack-bitmap-write.c
> @@ -413,15 +413,19 @@ static int fill_bitmap_commit(struct bb_commit *ent,
>
>  		if (old_bitmap && mapping) {
>  			struct ewah_bitmap *old = bitmap_for_commit(old_bitmap, c);
> +			struct bitmap *remapped = bitmap_new();
>  			/*
>  			 * If this commit has an old bitmap, then translate that
>  			 * bitmap and add its bits to this one. No need to walk
>  			 * parents or the tree for this commit.
>  			 */
> -			if (old && !rebuild_bitmap(mapping, old, ent->bitmap)) {
> +			if (old && !rebuild_bitmap(mapping, old, remapped)) {
> +				bitmap_or(ent->bitmap, remapped);
> +				bitmap_free(remapped);
>  				reused_bitmaps_nr++;
>  				continue;
>  			}
> +			bitmap_free(remapped);
>  		}
>
>  		/*
> --- 8< ---
>
> on top.
>
> Applying that patch and then rerunning the tests with the appropriate
> TEST variables causes the 'git repack' to fail as expected, ensuring
> that the containing test passes.

Interesting.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v5 0/8] Repack objects into separate packfiles based on a filter
  2023-08-15 23:18                 ` Junio C Hamano
@ 2023-08-16  0:38                   ` Taylor Blau
  2023-08-16 17:16                     ` Junio C Hamano
  0 siblings, 1 reply; 161+ messages in thread
From: Taylor Blau @ 2023-08-16  0:38 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

On Tue, Aug 15, 2023 at 04:18:02PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > But why is that tree marked in the first place? It's because we attempt
> > to rebuild the bitmap from the existing .bitmap file, but fail part of
> > the way through (when we look up the first blob object in the reposition
> > table). But that happens *after* we see the tree object, so its bit
> > position is marked, even though we didn't rebuild a complete bitmap.
>
> So, there is another bug lurking, other than the lack of "combining
> filtered repack and bitmaps are explicitly forbidden" logic?

I think that there is a bug lurking in the sense of trying to reuse
bitmaps when covering a pack that doesn't have reachability closure in
this particular scenario.

But there are no "blessed" use-cases for doing this. So I think that we
should indeed fix this, but I am not immediately concerned here.

> We see the tree object, we immediately mark it as "done" even we are
> not, then we finish in failure and the "done" mark is left behind?  Do
> we need two bits, "under review" and "done", or something then?

No; we can either reuse a complete bitmap or not. So it's fine to OR
all of the (permuted) bits into ent->bitmap, but it's not OK to fill in
just part of them.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v5 0/8] Repack objects into separate packfiles based on a filter
  2023-08-16  0:38                   ` Taylor Blau
@ 2023-08-16 17:16                     ` Junio C Hamano
  0 siblings, 0 replies; 161+ messages in thread
From: Junio C Hamano @ 2023-08-16 17:16 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

Taylor Blau <me@ttaylorr.com> writes:

> I think that there is a bug lurking in the sense of trying to reuse
> bitmaps when covering a pack that doesn't have reachability closure in
> this particular scenario.
>
> But there are no "blessed" use-cases for doing this. So I think that we
> should indeed fix this, but I am not immediately concerned here.

OK.

> No; we can either reuse a complete bitmap or not. So it's fine to OR
> all of the (permuted) bits into ent->bitmap, but it's not OK to fill in
> just part of them.

Sounds sane.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v6 0/9] Repack objects into separate packfiles based on a filter
  2023-08-12  0:00       ` [PATCH v5 " Christian Couder
                           ` (8 preceding siblings ...)
  2023-08-15  0:51         ` [PATCH v5 0/8] Repack objects into separate packfiles based on a filter Junio C Hamano
@ 2023-09-11 15:06         ` Christian Couder
  2023-09-11 15:06           ` [PATCH v6 1/9] pack-objects: allow `--filter` without `--stdout` Christian Couder
                             ` (9 more replies)
  9 siblings, 10 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-11 15:06 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

# Intro

Last year, John Cai sent 2 versions of a patch series to implement
`git repack --filter=<filter-spec>` and later I sent 4 versions of a
patch series trying to do it a bit differently:

  - https://lore.kernel.org/git/pull.1206.git.git.1643248180.gitgitgadget@gmail.com/
  - https://lore.kernel.org/git/20221012135114.294680-1-christian.couder@gmail.com/

In these patch series, the `--filter=<filter-spec>` removed the
filtered out objects altogether which was considered very dangerous
even though we implemented different safety checks in some of the
latter series.

In some discussions, it was mentioned that such a feature, or a
similar feature in `git gc`, or in a new standalone command (perhaps
called `git prune-filtered`), should put the filtered out objects into
a new packfile instead of deleting them.

Recently there were internal discussions at GitLab about either moving
blobs from inactive repos onto cheaper storage, or moving large blobs
onto cheaper storage. This lead us to rethink at repacking using a
filter, but moving the filtered out objects into a separate packfile
instead of deleting them.

So here is a new patch series doing that while implementing the
`--filter=<filter-spec>` option in `git repack`.

# Use cases for the new feature

This could be useful for example for the following purposes:

  1) As a way for servers to save storage costs by for example moving
     large blobs, or all the blobs, or all the blobs in inactive
     repos, to separate storage (while still making them accessible
     using for example the alternates mechanism).

  2) As a way to use partial clone on a Git server to offload large
     blobs to, for example, an http server, while using multiple
     promisor remotes (to be able to access everything) on the client
     side. (In this case the packfile that contains the filtered out
     object can be manualy removed after checking that all the objects
     it contains are available through the promisor remote.)

  3) As a way for clients to reclaim some space when they cloned with
     a filter to save disk space but then fetched a lot of unwanted
     objects (for example when checking out old branches) and now want
     to remove these unwanted objects. (In this case they can first
     move the packfile that contains filtered out objects to a
     separate directory or storage, then check that everything works
     well, and then manually remove the packfile after some time.)

As the features and the code are quite different from those in the
previous series, I decided to start a new series instead of continuing
a previous one.

Also since version 2 of this new series, commit messages, don't
mention uses cases like 2) or 3) above, as people have different
opinions on how it should be done. How it should be done could depend
a lot on the way promisor remotes are used, the software and hardware
setups used, etc, so it seems more difficult to "sell" this series by
talking about such use cases. As use case 1) seems simpler and more
appealing, it makes more sense to only talk about it in the commit
messages.

# Changes since version 5

Thanks to Junio who reviewed or commented on versions 1, 2, 3, 4 and
5, and to Taylor who reviewed or commented on version 1, 3, 4 and 5!
Thanks also to Robert Coup who participated in the discussions related
to version 2 and Peff who participated in the discussions related to
version 4. There is only the following code change since version 5:

- Patch 5/9 (pack-bitmap-write: rebuild using new bitmap when
  remapping) is new. It fixes a bitmap rebuilding issue that wasn't
  triggered previously but got triggered by this series and caused CI
  tests to fail. The patch is taken from a suggestion by Taylor in:

  https://lore.kernel.org/git/ZNwFlcS3SOS9h77N@nand.local/

  I checked that CI tests now passes in:

  https://github.com/chriscool/git/actions/runs/6122146278

  (There is a failure on 'win test (5)' with "failed: t7527.17
  directory changes to a file", but it looks like it's not related to
  the previous issue and also not related to to this series at all.)

Another change is that this series has been rebased on top of
94e83dcf5b (The seventh batch, 2023-09-07) to fix a few conflicts
related to changes in the geometry code, as can be seen in the
short range-diff below.

# Commit overview

(No changes in any of the patches compared to version 5, except that
patch 5/9 is new.)

* 1/9 pack-objects: allow `--filter` without `--stdout`

  To be able to later repack with a filter we need `git pack-objects`
  to write packfiles when it's filtering instead of just writing the
  pack without the filtered out objects to stdout.

* 2/9 t/helper: add 'find-pack' test-tool

  For testing `git repack --filter=...` that we are going to
  implement, it's useful to have a test helper that can tell which
  packfiles contain a specific object.

* 3/9 repack: refactor finishing pack-objects command

  This is a small refactoring creating a new useful function, so that
  `git repack --filter=...` will be able to reuse it.

* 4/9 repack: refactor finding pack prefix

  This is another small refactoring creating a small function that
  will be reused in the next patch.

* 5/9 pack-bitmap-write: rebuild using new bitmap when remapping

  This patch is new in version 6. It fixes an issue when bitmaps are
  rebuilt that was revealed by this series, and caused a CI test to
  fail.

* 6/9 repack: add `--filter=<filter-spec>` option

  This actually adds the `--filter=<filter-spec>` option. It uses one
  `git pack-objects` process with the `--filter` option. And then
  another `git pack-objects` process with the `--stdin-packs`
  option.
  
* 7/9 gc: add `gc.repackFilter` config option

  This is a gc config option so that `git gc` can also repack using a
  filter and put the filtered out objects into a separate packfile.

* 8/9 repack: implement `--filter-to` for storing filtered out objects

  For some use cases, it's interesting to create the packfile that
  contains the filtered out objects into a separate location. This is
  similar to the `--expire-to` option for cruft packfiles.

* 9/9 gc: add `gc.repackFilterTo` config option

  This allows specifying the location of the packfile that contains
  the filtered out objects when using `gc.repackFilter`.

# Range-diff since v5

 1:  bbcc368876 =  1:  da931b5082 pack-objects: allow `--filter` without `--stdout`
 2:  f1b80e5728 =  2:  10504b3699 t/helper: add 'find-pack' test-tool
 3:  ffecc73960 !  3:  ee12eb8ad7 repack: refactor finishing pack-objects command
    @@ builtin/repack.c: static int write_cruft_pack(const struct pack_objects_args *ar
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
        struct string_list existing_nonkept_packs = STRING_LIST_INIT_DUP;
        struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
    -   struct pack_geometry *geometry = NULL;
    +   struct pack_geometry geometry = { 0 };
     -  struct strbuf line = STRBUF_INIT;
        struct tempfile *refs_snapshot = NULL;
        int i, ext, ret;
 4:  6c2f381a88 =  4:  d197e0c370 repack: refactor finding pack prefix
 -:  ---------- >  5:  abeef5fbad pack-bitmap-write: rebuild using new bitmap when remapping
 5:  134700c2ce !  6:  31ca2579d3 repack: add `--filter=<filter-spec>` option
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
     +          strvec_pushf(&cmd.args, "--filter=%s",
     +                       expand_list_objects_filter_spec(&po_args.filter_options));
     +
    -   if (geometry)
    +   if (geometry.split_factor)
                cmd.in = -1;
        else
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
        string_list_clear(&existing_nonkept_packs, 0);
        string_list_clear(&existing_kept_packs, 0);
    -   clear_pack_geometry(geometry);
    +   free_pack_geometry(&geometry);
     +  list_objects_filter_release(&po_args.filter_options);
      
        return ret;
 6:  d3365c7b48 =  7:  fa70ae85f2 gc: add `gc.repackFilter` config option
 7:  9a09382cd1 !  8:  e01ea3dd70 repack: implement `--filter-to` for storing filtered out objects
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
     +  else if (filter_to)
     +          die(_("option '%s' can only be used along with '%s'"), "--filter-to", "--filter");
      
    -   if (geometry)
    +   if (geometry.split_factor)
                cmd.in = -1;
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
        }
 8:  a52e3a71db =  9:  d6ff314189 gc: add `gc.repackFilterTo` config option


Christian Couder (9):
  pack-objects: allow `--filter` without `--stdout`
  t/helper: add 'find-pack' test-tool
  repack: refactor finishing pack-objects command
  repack: refactor finding pack prefix
  pack-bitmap-write: rebuild using new bitmap when remapping
  repack: add `--filter=<filter-spec>` option
  gc: add `gc.repackFilter` config option
  repack: implement `--filter-to` for storing filtered out objects
  gc: add `gc.repackFilterTo` config option

 Documentation/config/gc.txt            |  16 ++
 Documentation/git-pack-objects.txt     |   4 +-
 Documentation/git-repack.txt           |  23 +++
 Makefile                               |   1 +
 builtin/gc.c                           |  10 ++
 builtin/pack-objects.c                 |   8 +-
 builtin/repack.c                       | 167 +++++++++++++++------
 pack-bitmap-write.c                    |   6 +-
 t/helper/test-find-pack.c              |  50 +++++++
 t/helper/test-tool.c                   |   1 +
 t/helper/test-tool.h                   |   1 +
 t/t0080-find-pack.sh                   |  82 ++++++++++
 t/t5317-pack-objects-filter-objects.sh |   8 +
 t/t6500-gc.sh                          |  24 +++
 t/t7700-repack.sh                      | 197 +++++++++++++++++++++++++
 15 files changed, 547 insertions(+), 51 deletions(-)
 create mode 100644 t/helper/test-find-pack.c
 create mode 100755 t/t0080-find-pack.sh

-- 
2.42.0.167.gd6ff314189


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v6 1/9] pack-objects: allow `--filter` without `--stdout`
  2023-09-11 15:06         ` [PATCH v6 0/9] " Christian Couder
@ 2023-09-11 15:06           ` Christian Couder
  2023-09-11 15:06           ` [PATCH v6 2/9] t/helper: add 'find-pack' test-tool Christian Couder
                             ` (8 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-11 15:06 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

9535ce7337 (pack-objects: add list-objects filtering, 2017-11-21)
taught `git pack-objects` to use `--filter`, but required the use of
`--stdout` since a partial clone mechanism was not yet in place to
handle missing objects. Since then, changes like 9e27beaa23
(promisor-remote: implement promisor_remote_get_direct(), 2019-06-25)
and others added support to dynamically fetch objects that were missing.

Even without a promisor remote, filtering out objects can also be useful
if we can put the filtered out objects in a separate pack, and in this
case it also makes sense for pack-objects to write the packfile directly
to an actual file rather than on stdout.

Remove the `--stdout` requirement when using `--filter`, so that in a
follow-up commit, repack can pass `--filter` to pack-objects to omit
certain objects from the resulting packfile.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-pack-objects.txt     | 4 ++--
 builtin/pack-objects.c                 | 8 ++------
 t/t5317-pack-objects-filter-objects.sh | 8 ++++++++
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index dea7eacb0f..e32404c6aa 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -296,8 +296,8 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it creates a bundle.
 	nevertheless.
 
 --filter=<filter-spec>::
-	Requires `--stdout`.  Omits certain objects (usually blobs) from
-	the resulting packfile.  See linkgit:git-rev-list[1] for valid
+	Omits certain objects (usually blobs) from the resulting
+	packfile.  See linkgit:git-rev-list[1] for valid
 	`<filter-spec>` forms.
 
 --no-filter::
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 72241bdca4..e3e1d11640 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4399,12 +4399,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (filter_options.choice) {
-		if (!pack_to_stdout)
-			die(_("cannot use --filter without --stdout"));
-		if (stdin_packs)
-			die(_("cannot use --filter with --stdin-packs"));
-	}
+	if (stdin_packs && filter_options.choice)
+		die(_("cannot use --filter with --stdin-packs"));
 
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
diff --git a/t/t5317-pack-objects-filter-objects.sh b/t/t5317-pack-objects-filter-objects.sh
index b26d476c64..2ff3eef9a3 100755
--- a/t/t5317-pack-objects-filter-objects.sh
+++ b/t/t5317-pack-objects-filter-objects.sh
@@ -53,6 +53,14 @@ test_expect_success 'verify blob:none packfile has no blobs' '
 	! grep blob verify_result
 '
 
+test_expect_success 'verify blob:none packfile without --stdout' '
+	git -C r1 pack-objects --revs --filter=blob:none mypackname >packhash <<-EOF &&
+	HEAD
+	EOF
+	git -C r1 verify-pack -v "mypackname-$(cat packhash).pack" >verify_result &&
+	! grep blob verify_result
+'
+
 test_expect_success 'verify normal and blob:none packfiles have same commits/trees' '
 	git -C r1 verify-pack -v ../all.pack >verify_result &&
 	grep -E "commit|tree" verify_result |
-- 
2.42.0.167.gd6ff314189


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v6 2/9] t/helper: add 'find-pack' test-tool
  2023-09-11 15:06         ` [PATCH v6 0/9] " Christian Couder
  2023-09-11 15:06           ` [PATCH v6 1/9] pack-objects: allow `--filter` without `--stdout` Christian Couder
@ 2023-09-11 15:06           ` Christian Couder
  2023-09-11 15:06           ` [PATCH v6 3/9] repack: refactor finishing pack-objects command Christian Couder
                             ` (7 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-11 15:06 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

In a following commit, we will make it possible to separate objects in
different packfiles depending on a filter.

To make sure that the right objects are in the right packs, let's add a
new test-tool that can display which packfile(s) a given object is in.

Let's also make it possible to check if a given object is in the
expected number of packfiles with a `--check-count <n>` option.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Makefile                  |  1 +
 t/helper/test-find-pack.c | 50 ++++++++++++++++++++++++
 t/helper/test-tool.c      |  1 +
 t/helper/test-tool.h      |  1 +
 t/t0080-find-pack.sh      | 82 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 135 insertions(+)
 create mode 100644 t/helper/test-find-pack.c
 create mode 100755 t/t0080-find-pack.sh

diff --git a/Makefile b/Makefile
index 5776309365..742b76998e 100644
--- a/Makefile
+++ b/Makefile
@@ -800,6 +800,7 @@ TEST_BUILTINS_OBJS += test-dump-untracked-cache.o
 TEST_BUILTINS_OBJS += test-env-helper.o
 TEST_BUILTINS_OBJS += test-example-decorate.o
 TEST_BUILTINS_OBJS += test-fast-rebase.o
+TEST_BUILTINS_OBJS += test-find-pack.o
 TEST_BUILTINS_OBJS += test-fsmonitor-client.o
 TEST_BUILTINS_OBJS += test-genrandom.o
 TEST_BUILTINS_OBJS += test-genzeros.o
diff --git a/t/helper/test-find-pack.c b/t/helper/test-find-pack.c
new file mode 100644
index 0000000000..e8bd793e58
--- /dev/null
+++ b/t/helper/test-find-pack.c
@@ -0,0 +1,50 @@
+#include "test-tool.h"
+#include "object-name.h"
+#include "object-store.h"
+#include "packfile.h"
+#include "parse-options.h"
+#include "setup.h"
+
+/*
+ * Display the path(s), one per line, of the packfile(s) containing
+ * the given object.
+ *
+ * If '--check-count <n>' is passed, then error out if the number of
+ * packfiles containing the object is not <n>.
+ */
+
+static const char *find_pack_usage[] = {
+	"test-tool find-pack [--check-count <n>] <object>",
+	NULL
+};
+
+int cmd__find_pack(int argc, const char **argv)
+{
+	struct object_id oid;
+	struct packed_git *p;
+	int count = -1, actual_count = 0;
+	const char *prefix = setup_git_directory();
+
+	struct option options[] = {
+		OPT_INTEGER('c', "check-count", &count, "expected number of packs"),
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, prefix, options, find_pack_usage, 0);
+	if (argc != 1)
+		usage(find_pack_usage[0]);
+
+	if (repo_get_oid(the_repository, argv[0], &oid))
+		die("cannot parse %s as an object name", argv[0]);
+
+	for (p = get_all_packs(the_repository); p; p = p->next)
+		if (find_pack_entry_one(oid.hash, p)) {
+			printf("%s\n", p->pack_name);
+			actual_count++;
+		}
+
+	if (count > -1 && count != actual_count)
+		die("bad packfile count %d instead of %d", actual_count, count);
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index abe8a785eb..41da40c296 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -31,6 +31,7 @@ static struct test_cmd cmds[] = {
 	{ "env-helper", cmd__env_helper },
 	{ "example-decorate", cmd__example_decorate },
 	{ "fast-rebase", cmd__fast_rebase },
+	{ "find-pack", cmd__find_pack },
 	{ "fsmonitor-client", cmd__fsmonitor_client },
 	{ "genrandom", cmd__genrandom },
 	{ "genzeros", cmd__genzeros },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index ea2672436c..411dbf2db4 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -25,6 +25,7 @@ int cmd__dump_reftable(int argc, const char **argv);
 int cmd__env_helper(int argc, const char **argv);
 int cmd__example_decorate(int argc, const char **argv);
 int cmd__fast_rebase(int argc, const char **argv);
+int cmd__find_pack(int argc, const char **argv);
 int cmd__fsmonitor_client(int argc, const char **argv);
 int cmd__genrandom(int argc, const char **argv);
 int cmd__genzeros(int argc, const char **argv);
diff --git a/t/t0080-find-pack.sh b/t/t0080-find-pack.sh
new file mode 100755
index 0000000000..67b11216a3
--- /dev/null
+++ b/t/t0080-find-pack.sh
@@ -0,0 +1,82 @@
+#!/bin/sh
+
+test_description='test `test-tool find-pack`'
+
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_commit one &&
+	test_commit two &&
+	test_commit three &&
+	test_commit four &&
+	test_commit five
+'
+
+test_expect_success 'repack everything into a single packfile' '
+	git repack -a -d --no-write-bitmap-index &&
+
+	head_commit_pack=$(test-tool find-pack HEAD) &&
+	head_tree_pack=$(test-tool find-pack HEAD^{tree}) &&
+	one_pack=$(test-tool find-pack HEAD:one.t) &&
+	three_pack=$(test-tool find-pack HEAD:three.t) &&
+	old_commit_pack=$(test-tool find-pack HEAD~4) &&
+
+	test-tool find-pack --check-count 1 HEAD &&
+	test-tool find-pack --check-count=1 HEAD^{tree} &&
+	! test-tool find-pack --check-count=0 HEAD:one.t &&
+	! test-tool find-pack -c 2 HEAD:one.t &&
+	test-tool find-pack -c 1 HEAD:three.t &&
+
+	# Packfile exists at the right path
+	case "$head_commit_pack" in
+		".git/objects/pack/pack-"*".pack") true ;;
+		*) false ;;
+	esac &&
+	test -f "$head_commit_pack" &&
+
+	# Everything is in the same pack
+	test "$head_commit_pack" = "$head_tree_pack" &&
+	test "$head_commit_pack" = "$one_pack" &&
+	test "$head_commit_pack" = "$three_pack" &&
+	test "$head_commit_pack" = "$old_commit_pack"
+'
+
+test_expect_success 'add more packfiles' '
+	git rev-parse HEAD^{tree} HEAD:two.t HEAD:four.t >objects &&
+	git pack-objects .git/objects/pack/mypackname1 >packhash1 <objects &&
+
+	git rev-parse HEAD~ HEAD~^{tree} HEAD:five.t >objects &&
+	git pack-objects .git/objects/pack/mypackname2 >packhash2 <objects &&
+
+	head_commit_pack=$(test-tool find-pack HEAD) &&
+
+	# HEAD^{tree} is in 2 packfiles
+	test-tool find-pack HEAD^{tree} >head_tree_packs &&
+	grep "$head_commit_pack" head_tree_packs &&
+	grep mypackname1 head_tree_packs &&
+	! grep mypackname2 head_tree_packs &&
+	test-tool find-pack --check-count 2 HEAD^{tree} &&
+	! test-tool find-pack --check-count 1 HEAD^{tree} &&
+
+	# HEAD:five.t is also in 2 packfiles
+	test-tool find-pack HEAD:five.t >five_packs &&
+	grep "$head_commit_pack" five_packs &&
+	! grep mypackname1 five_packs &&
+	grep mypackname2 five_packs &&
+	test-tool find-pack -c 2 HEAD:five.t &&
+	! test-tool find-pack --check-count=0 HEAD:five.t
+'
+
+test_expect_success 'add more commits (as loose objects)' '
+	test_commit six &&
+	test_commit seven &&
+
+	test -z "$(test-tool find-pack HEAD)" &&
+	test -z "$(test-tool find-pack HEAD:six.t)" &&
+	test-tool find-pack --check-count 0 HEAD &&
+	test-tool find-pack -c 0 HEAD:six.t &&
+	! test-tool find-pack -c 1 HEAD:seven.t
+'
+
+test_done
-- 
2.42.0.167.gd6ff314189


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v6 3/9] repack: refactor finishing pack-objects command
  2023-09-11 15:06         ` [PATCH v6 0/9] " Christian Couder
  2023-09-11 15:06           ` [PATCH v6 1/9] pack-objects: allow `--filter` without `--stdout` Christian Couder
  2023-09-11 15:06           ` [PATCH v6 2/9] t/helper: add 'find-pack' test-tool Christian Couder
@ 2023-09-11 15:06           ` Christian Couder
  2023-09-11 15:06           ` [PATCH v6 4/9] repack: refactor finding pack prefix Christian Couder
                             ` (6 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-11 15:06 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

Create a new finish_pack_objects_cmd() to refactor duplicated code
that handles reading the packfile names from the output of a
`git pack-objects` command and putting it into a string_list, as well as
calling finish_command().

While at it, beautify a code comment a bit in the new function.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org
---
 builtin/repack.c | 70 +++++++++++++++++++++++-------------------------
 1 file changed, 33 insertions(+), 37 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 6943c5ba11..4f53b24958 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -695,6 +695,36 @@ static void remove_redundant_bitmaps(struct string_list *include,
 	strbuf_release(&path);
 }
 
+static int finish_pack_objects_cmd(struct child_process *cmd,
+				   struct string_list *names,
+				   int local)
+{
+	FILE *out;
+	struct strbuf line = STRBUF_INIT;
+
+	out = xfdopen(cmd->out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		struct string_list_item *item;
+
+		if (line.len != the_hash_algo->hexsz)
+			die(_("repack: Expecting full hex object ID lines only "
+			      "from pack-objects."));
+		/*
+		 * Avoid putting packs written outside of the repository in the
+		 * list of names.
+		 */
+		if (local) {
+			item = string_list_append(names, line.buf);
+			item->util = populate_pack_exts(line.buf);
+		}
+	}
+	fclose(out);
+
+	strbuf_release(&line);
+
+	return finish_command(cmd);
+}
+
 static int write_cruft_pack(const struct pack_objects_args *args,
 			    const char *destination,
 			    const char *pack_prefix,
@@ -704,9 +734,8 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 			    struct string_list *existing_kept_packs)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
-	struct strbuf line = STRBUF_INIT;
 	struct string_list_item *item;
-	FILE *in, *out;
+	FILE *in;
 	int ret;
 	const char *scratch;
 	int local = skip_prefix(destination, packdir, &scratch);
@@ -749,27 +778,7 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 		fprintf(in, "%s.pack\n", item->string);
 	fclose(in);
 
-	out = xfdopen(cmd.out, "r");
-	while (strbuf_getline_lf(&line, out) != EOF) {
-		struct string_list_item *item;
-
-		if (line.len != the_hash_algo->hexsz)
-			die(_("repack: Expecting full hex object ID lines only "
-			      "from pack-objects."));
-		/*
-		 * avoid putting packs written outside of the repository in the
-		 * list of names
-		 */
-		if (local) {
-			item = string_list_append(names, line.buf);
-			item->util = populate_pack_exts(line.buf);
-		}
-	}
-	fclose(out);
-
-	strbuf_release(&line);
-
-	return finish_command(&cmd);
+	return finish_pack_objects_cmd(&cmd, names, local);
 }
 
 int cmd_repack(int argc, const char **argv, const char *prefix)
@@ -780,10 +789,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list existing_nonkept_packs = STRING_LIST_INIT_DUP;
 	struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
 	struct pack_geometry geometry = { 0 };
-	struct strbuf line = STRBUF_INIT;
 	struct tempfile *refs_snapshot = NULL;
 	int i, ext, ret;
-	FILE *out;
 	int show_progress;
 
 	/* variables to be filled by option parsing */
@@ -1013,18 +1020,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		fclose(in);
 	}
 
-	out = xfdopen(cmd.out, "r");
-	while (strbuf_getline_lf(&line, out) != EOF) {
-		struct string_list_item *item;
-
-		if (line.len != the_hash_algo->hexsz)
-			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
-		item = string_list_append(&names, line.buf);
-		item->util = populate_pack_exts(item->string);
-	}
-	strbuf_release(&line);
-	fclose(out);
-	ret = finish_command(&cmd);
+	ret = finish_pack_objects_cmd(&cmd, &names, 1);
 	if (ret)
 		goto cleanup;
 
-- 
2.42.0.167.gd6ff314189


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v6 4/9] repack: refactor finding pack prefix
  2023-09-11 15:06         ` [PATCH v6 0/9] " Christian Couder
                             ` (2 preceding siblings ...)
  2023-09-11 15:06           ` [PATCH v6 3/9] repack: refactor finishing pack-objects command Christian Couder
@ 2023-09-11 15:06           ` Christian Couder
  2023-09-11 15:06           ` [PATCH v6 5/9] pack-bitmap-write: rebuild using new bitmap when remapping Christian Couder
                             ` (5 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-11 15:06 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

Create a new find_pack_prefix() to refactor code that handles finding
the pack prefix from the packtmp and packdir global variables, as we are
going to need this feature again in following commit.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org
---
 builtin/repack.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 4f53b24958..8de3009b9f 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -781,6 +781,17 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 	return finish_pack_objects_cmd(&cmd, names, local);
 }
 
+static const char *find_pack_prefix(const char *packdir, const char *packtmp)
+{
+	const char *pack_prefix;
+	if (!skip_prefix(packtmp, packdir, &pack_prefix))
+		die(_("pack prefix %s does not begin with objdir %s"),
+		    packtmp, packdir);
+	if (*pack_prefix == '/')
+		pack_prefix++;
+	return pack_prefix;
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -1028,12 +1039,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		printf_ln(_("Nothing new to pack."));
 
 	if (pack_everything & PACK_CRUFT) {
-		const char *pack_prefix;
-		if (!skip_prefix(packtmp, packdir, &pack_prefix))
-			die(_("pack prefix %s does not begin with objdir %s"),
-			    packtmp, packdir);
-		if (*pack_prefix == '/')
-			pack_prefix++;
+		const char *pack_prefix = find_pack_prefix(packdir, packtmp);
 
 		if (!cruft_po_args.window)
 			cruft_po_args.window = po_args.window;
-- 
2.42.0.167.gd6ff314189


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v6 5/9] pack-bitmap-write: rebuild using new bitmap when remapping
  2023-09-11 15:06         ` [PATCH v6 0/9] " Christian Couder
                             ` (3 preceding siblings ...)
  2023-09-11 15:06           ` [PATCH v6 4/9] repack: refactor finding pack prefix Christian Couder
@ 2023-09-11 15:06           ` Christian Couder
  2023-09-11 15:06           ` [PATCH v6 6/9] repack: add `--filter=<filter-spec>` option Christian Couder
                             ` (4 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-11 15:06 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

`git repack` is about to learn a new `--filter=<filter-spec>` option and
we will want to check that this option is incompatible with
`--write-bitmap-index`.

Unfortunately it appears that a test like:

test_expect_success '--filter fails with --write-bitmap-index' '
       test_must_fail \
               env GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
               git -C bare.git repack -a -d --write-bitmap-index --filter=blob:none
'

sometimes fail because when rebuilding bitmaps, it appears that we are
reusing existing bitmap information. So instead of detecting that some
objects are missing and erroring out as it should, the
`git repack --write-bitmap-index --filter=...` command succeeds.

Let's fix that by making sure we rebuild bitmaps using new bitmaps
instead of existing ones.

Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 pack-bitmap-write.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index f6757c3cbf..f4ecdf8b0e 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -413,15 +413,19 @@ static int fill_bitmap_commit(struct bb_commit *ent,
 
 		if (old_bitmap && mapping) {
 			struct ewah_bitmap *old = bitmap_for_commit(old_bitmap, c);
+			struct bitmap *remapped = bitmap_new();
 			/*
 			 * If this commit has an old bitmap, then translate that
 			 * bitmap and add its bits to this one. No need to walk
 			 * parents or the tree for this commit.
 			 */
-			if (old && !rebuild_bitmap(mapping, old, ent->bitmap)) {
+			if (old && !rebuild_bitmap(mapping, old, remapped)) {
+				bitmap_or(ent->bitmap, remapped);
+				bitmap_free(remapped);
 				reused_bitmaps_nr++;
 				continue;
 			}
+			bitmap_free(remapped);
 		}
 
 		/*
-- 
2.42.0.167.gd6ff314189


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v6 6/9] repack: add `--filter=<filter-spec>` option
  2023-09-11 15:06         ` [PATCH v6 0/9] " Christian Couder
                             ` (4 preceding siblings ...)
  2023-09-11 15:06           ` [PATCH v6 5/9] pack-bitmap-write: rebuild using new bitmap when remapping Christian Couder
@ 2023-09-11 15:06           ` Christian Couder
  2023-09-11 15:06           ` [PATCH v6 7/9] gc: add `gc.repackFilter` config option Christian Couder
                             ` (3 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-11 15:06 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

This new option puts the objects specified by `<filter-spec>` into a
separate packfile.

This could be useful if, for example, some blobs take up a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.

It's possible to find which new packfile contains the filtered out
objects using one of the following:

  - `git verify-pack -v ...`,
  - `test-tool find-pack ...`, which a previous commit added,
  - `--filter-to=<dir>`, which a following commit will add to specify
    where the pack containing the filtered out objects will be.

This feature is implemented by running `git pack-objects` twice in a
row. The first command is run with `--filter=<filter-spec>`, using the
specified filter. It packs objects while omitting the objects specified
by the filter. Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.

As the interactions with kept packs are a bit tricky, a few related
tests are added.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt |  12 ++++
 builtin/repack.c             |  73 +++++++++++++++++++
 t/t7700-repack.sh            | 135 +++++++++++++++++++++++++++++++++++
 3 files changed, 220 insertions(+)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 4017157949..6d5bec7716 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -143,6 +143,18 @@ depth is 4095.
 	a larger and slower repository; see the discussion in
 	`pack.packSizeLimit`.
 
+--filter=<filter-spec>::
+	Remove objects matching the filter specification from the
+	resulting packfile and put them into a separate packfile. Note
+	that objects used in the working directory are not filtered
+	out. So for the split to fully work, it's best to perform it
+	in a bare repo and to use the `-a` and `-d` options along with
+	this option.  Also `--no-write-bitmap-index` (or the
+	`repack.writebitmaps` config option set to `false`) should be
+	used otherwise writing bitmap index will fail, as it supposes
+	a single packfile containing all the objects. See
+	linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index 8de3009b9f..ac70698a41 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -21,6 +21,7 @@
 #include "pack.h"
 #include "pack-bitmap.h"
 #include "refs.h"
+#include "list-objects-filter-options.h"
 
 #define ALL_INTO_ONE 1
 #define LOOSEN_UNREACHABLE 2
@@ -57,6 +58,7 @@ struct pack_objects_args {
 	int no_reuse_object;
 	int quiet;
 	int local;
+	struct list_objects_filter_options filter_options;
 };
 
 static int repack_config(const char *var, const char *value,
@@ -725,6 +727,57 @@ static int finish_pack_objects_cmd(struct child_process *cmd,
 	return finish_command(cmd);
 }
 
+static int write_filtered_pack(const struct pack_objects_args *args,
+			       const char *destination,
+			       const char *pack_prefix,
+			       struct string_list *keep_pack_list,
+			       struct string_list *names,
+			       struct string_list *existing_packs,
+			       struct string_list *existing_kept_packs)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct string_list_item *item;
+	FILE *in;
+	int ret, i;
+	const char *caret;
+	const char *scratch;
+	int local = skip_prefix(destination, packdir, &scratch);
+
+	prepare_pack_objects(&cmd, args, destination);
+
+	strvec_push(&cmd.args, "--stdin-packs");
+
+	if (!pack_kept_objects)
+		strvec_push(&cmd.args, "--honor-pack-keep");
+	for (i = 0; i < keep_pack_list->nr; i++)
+		strvec_pushf(&cmd.args, "--keep-pack=%s",
+			     keep_pack_list->items[i].string);
+
+	cmd.in = -1;
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	/*
+	 * Here 'names' contains only the pack(s) that were just
+	 * written, which is exactly the packs we want to keep. Also
+	 * 'existing_kept_packs' already contains the packs in
+	 * 'keep_pack_list'.
+	 */
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, names)
+		fprintf(in, "^%s-%s.pack\n", pack_prefix, item->string);
+	for_each_string_list_item(item, existing_packs)
+		fprintf(in, "%s.pack\n", item->string);
+	caret = pack_kept_objects ? "" : "^";
+	for_each_string_list_item(item, existing_kept_packs)
+		fprintf(in, "%s%s.pack\n", caret, item->string);
+	fclose(in);
+
+	return finish_pack_objects_cmd(&cmd, names, local);
+}
+
 static int write_cruft_pack(const struct pack_objects_args *args,
 			    const char *destination,
 			    const char *pack_prefix,
@@ -855,6 +908,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("limits the maximum number of threads")),
 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
 				N_("maximum size of each packfile")),
+		OPT_PARSE_LIST_OBJECTS_FILTER(&po_args.filter_options),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
@@ -868,6 +922,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_END()
 	};
 
+	list_objects_filter_init(&po_args.filter_options);
+
 	git_config(repack_config, &cruft_po_args);
 
 	argc = parse_options(argc, argv, prefix, builtin_repack_options,
@@ -1008,6 +1064,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		strvec_push(&cmd.args, "--incremental");
 	}
 
+	if (po_args.filter_options.choice)
+		strvec_pushf(&cmd.args, "--filter=%s",
+			     expand_list_objects_filter_spec(&po_args.filter_options));
+
 	if (geometry.split_factor)
 		cmd.in = -1;
 	else
@@ -1096,6 +1156,18 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		}
 	}
 
+	if (po_args.filter_options.choice) {
+		ret = write_filtered_pack(&po_args,
+					  packtmp,
+					  find_pack_prefix(packdir, packtmp),
+					  &keep_pack_list,
+					  &names,
+					  &existing_nonkept_packs,
+					  &existing_kept_packs);
+		if (ret)
+			goto cleanup;
+	}
+
 	string_list_sort(&names);
 
 	close_object_store(the_repository->objects);
@@ -1230,6 +1302,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	string_list_clear(&existing_nonkept_packs, 0);
 	string_list_clear(&existing_kept_packs, 0);
 	free_pack_geometry(&geometry);
+	list_objects_filter_release(&po_args.filter_options);
 
 	return ret;
 }
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 27b66807cd..39e89445fd 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -327,6 +327,141 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_must_be_empty actual
 '
 
+test_expect_success 'repacking with a filter works' '
+	git -C bare.git repack -a -d &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+	git -C bare.git -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
+	test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
+	commit_pack=$(test-tool -C bare.git find-pack -c 1 HEAD) &&
+	blob_pack=$(test-tool -C bare.git find-pack -c 1 HEAD:file1) &&
+	test "$commit_pack" != "$blob_pack" &&
+	tree_pack=$(test-tool -C bare.git find-pack -c 1 HEAD^{tree}) &&
+	test "$tree_pack" = "$commit_pack" &&
+	blob_pack2=$(test-tool -C bare.git find-pack -c 1 HEAD:file2) &&
+	test "$blob_pack2" = "$blob_pack"
+'
+
+test_expect_success '--filter fails with --write-bitmap-index' '
+	test_must_fail \
+		env GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
+		git -C bare.git repack -a -d --write-bitmap-index --filter=blob:none
+'
+
+test_expect_success 'repacking with two filters works' '
+	git init two-filters &&
+	(
+		cd two-filters &&
+		mkdir subdir &&
+		test_commit foo &&
+		test_commit subdir_bar subdir/bar &&
+		test_commit subdir_baz subdir/baz
+	) &&
+	git clone --no-local --bare two-filters two-filters.git &&
+	(
+		cd two-filters.git &&
+		test_stdout_line_count = 1 ls objects/pack/*.pack &&
+		git -c repack.writebitmaps=false repack -a -d \
+			--filter=blob:none --filter=tree:1 &&
+		test_stdout_line_count = 2 ls objects/pack/*.pack &&
+		commit_pack=$(test-tool find-pack -c 1 HEAD) &&
+		blob_pack=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		root_tree_pack=$(test-tool find-pack -c 1 HEAD^{tree}) &&
+		subdir_tree_hash=$(git ls-tree --object-only HEAD -- subdir) &&
+		subdir_tree_pack=$(test-tool find-pack -c 1 "$subdir_tree_hash") &&
+
+		# Root tree and subdir tree are not in the same packfiles
+		test "$commit_pack" != "$blob_pack" &&
+		test "$commit_pack" = "$root_tree_pack" &&
+		test "$blob_pack" = "$subdir_tree_pack"
+	)
+'
+
+prepare_for_keep_packs () {
+	git init keep-packs &&
+	(
+		cd keep-packs &&
+		test_commit foo &&
+		test_commit bar
+	) &&
+	git clone --no-local --bare keep-packs keep-packs.git &&
+	(
+		cd keep-packs.git &&
+
+		# Create two packs
+		# The first pack will contain all of the objects except one blob
+		git rev-list --objects --all >objs &&
+		grep -v "bar.t" objs | git pack-objects pack &&
+		# The second pack will contain the excluded object and be kept
+		packid=$(grep "bar.t" objs | git pack-objects pack) &&
+		>pack-$packid.keep &&
+
+		# Replace the existing pack with the 2 new ones
+		rm -f objects/pack/pack* &&
+		mv pack-* objects/pack/
+	)
+}
+
+test_expect_success '--filter works with .keep packs' '
+	prepare_for_keep_packs &&
+	(
+		cd keep-packs.git &&
+
+		foo_pack=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		bar_pack=$(test-tool find-pack -c 1 HEAD:bar.t) &&
+		head_pack=$(test-tool find-pack -c 1 HEAD) &&
+
+		test "$foo_pack" != "$bar_pack" &&
+		test "$foo_pack" = "$head_pack" &&
+
+		git -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
+
+		foo_pack_1=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		bar_pack_1=$(test-tool find-pack -c 1 HEAD:bar.t) &&
+		head_pack_1=$(test-tool find-pack -c 1 HEAD) &&
+
+		# Object bar is still only in the old .keep pack
+		test "$foo_pack_1" != "$foo_pack" &&
+		test "$bar_pack_1" = "$bar_pack" &&
+		test "$head_pack_1" != "$head_pack" &&
+
+		test "$foo_pack_1" != "$bar_pack_1" &&
+		test "$foo_pack_1" != "$head_pack_1" &&
+		test "$bar_pack_1" != "$head_pack_1"
+	)
+'
+
+test_expect_success '--filter works with --pack-kept-objects and .keep packs' '
+	rm -rf keep-packs keep-packs.git &&
+	prepare_for_keep_packs &&
+	(
+		cd keep-packs.git &&
+
+		foo_pack=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		bar_pack=$(test-tool find-pack -c 1 HEAD:bar.t) &&
+		head_pack=$(test-tool find-pack -c 1 HEAD) &&
+
+		test "$foo_pack" != "$bar_pack" &&
+		test "$foo_pack" = "$head_pack" &&
+
+		git -c repack.writebitmaps=false repack -a -d --filter=blob:none \
+			--pack-kept-objects &&
+
+		foo_pack_1=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		test-tool find-pack -c 2 HEAD:bar.t >bar_pack_1 &&
+		head_pack_1=$(test-tool find-pack -c 1 HEAD) &&
+
+		test "$foo_pack_1" != "$foo_pack" &&
+		test "$foo_pack_1" != "$bar_pack" &&
+		test "$head_pack_1" != "$head_pack" &&
+
+		# Object bar is in both the old .keep pack and the new
+		# pack that contained the filtered out objects
+		grep "$bar_pack" bar_pack_1 &&
+		grep "$foo_pack_1" bar_pack_1 &&
+		test "$foo_pack_1" != "$head_pack_1"
+	)
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.42.0.167.gd6ff314189


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v6 7/9] gc: add `gc.repackFilter` config option
  2023-09-11 15:06         ` [PATCH v6 0/9] " Christian Couder
                             ` (5 preceding siblings ...)
  2023-09-11 15:06           ` [PATCH v6 6/9] repack: add `--filter=<filter-spec>` option Christian Couder
@ 2023-09-11 15:06           ` Christian Couder
  2023-09-11 15:06           ` [PATCH v6 8/9] repack: implement `--filter-to` for storing filtered out objects Christian Couder
                             ` (2 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-11 15:06 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit has implemented `git repack --filter=<filter-spec>` to
allow users to filter out some objects from the main pack and move them
into a new different pack.

Users might want to perform such a cleanup regularly at the same time as
they perform other repacks and cleanups, so as part of `git gc`.

Let's allow them to configure a <filter-spec> for that purpose using a
new gc.repackFilter config option.

Now when `git gc` will perform a repack with a <filter-spec> configured
through this option and not empty, the repack process will be passed a
corresponding `--filter=<filter-spec>` argument.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/gc.txt |  5 +++++
 builtin/gc.c                |  6 ++++++
 t/t6500-gc.sh               | 13 +++++++++++++
 3 files changed, 24 insertions(+)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index ca47eb2008..2153bde7ac 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -145,6 +145,11 @@ Multiple hooks are supported, but all must exit successfully, else the
 operation (either generating a cruft pack or unpacking unreachable
 objects) will be halted.
 
+gc.repackFilter::
+	When repacking, use the specified filter to move certain
+	objects into a separate packfile.  See the
+	`--filter=<filter-spec>` option of linkgit:git-repack[1].
+
 gc.rerereResolved::
 	Records of conflicted merge you resolved earlier are
 	kept for this many days when 'git rerere gc' is run.
diff --git a/builtin/gc.c b/builtin/gc.c
index 369bd43fb2..607c0ac23e 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -61,6 +61,7 @@ static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
+static char *repack_filter;
 static unsigned long big_pack_threshold;
 static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 
@@ -170,6 +171,8 @@ static void gc_config(void)
 	git_config_get_ulong("gc.bigpackthreshold", &big_pack_threshold);
 	git_config_get_ulong("pack.deltacachesize", &max_delta_cache_size);
 
+	git_config_get_string("gc.repackfilter", &repack_filter);
+
 	git_config(git_default_config, NULL);
 }
 
@@ -355,6 +358,9 @@ static void add_repack_all_option(struct string_list *keep_pack)
 
 	if (keep_pack)
 		for_each_string_list(keep_pack, keep_one_pack, NULL);
+
+	if (repack_filter && *repack_filter)
+		strvec_pushf(&repack, "--filter=%s", repack_filter);
 }
 
 static void add_repack_incremental_option(void)
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 69509d0c11..232e403b66 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -202,6 +202,19 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
 	grep -E "^trace: (built-in|exec|run_command): git reflog expire --" trace.out
 '
 
+test_expect_success 'gc.repackFilter launches repack with a filter' '
+	test_when_finished "rm -rf bare.git" &&
+	git clone --no-local --bare . bare.git &&
+
+	git -C bare.git -c gc.cruftPacks=false gc &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+
+	GIT_TRACE=$(pwd)/trace.out git -C bare.git -c gc.repackFilter=blob:none \
+		-c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&
+	test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
+	grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
+'
+
 prepare_cruft_history () {
 	test_commit base &&
 
-- 
2.42.0.167.gd6ff314189


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v6 8/9] repack: implement `--filter-to` for storing filtered out objects
  2023-09-11 15:06         ` [PATCH v6 0/9] " Christian Couder
                             ` (6 preceding siblings ...)
  2023-09-11 15:06           ` [PATCH v6 7/9] gc: add `gc.repackFilter` config option Christian Couder
@ 2023-09-11 15:06           ` Christian Couder
  2023-09-11 15:06           ` [PATCH v6 9/9] gc: add `gc.repackFilterTo` config option Christian Couder
  2023-09-25 15:25           ` [PATCH v7 0/9] Repack objects into separate packfiles based on a filter Christian Couder
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-11 15:06 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit has implemented `git repack --filter=<filter-spec>` to
allow users to filter out some objects from the main pack and move them
into a new different pack.

It would be nice if this new different pack could be created in a
different directory than the regular pack. This would make it possible
to move large blobs into a pack on a different kind of storage, for
example cheaper storage.

Even in a different directory, this pack can be accessible if, for
example, the Git alternates mechanism is used to point to it. In fact
not using the Git alternates mechanism can corrupt a repo as the
generated pack containing the filtered objects might not be accessible
from the repo any more. So setting up the Git alternates mechanism
should be done before using this feature if the user wants the repo to
be fully usable while this feature is used.

In some cases, like when a repo has just been cloned or when there is no
other activity in the repo, it's Ok to setup the Git alternates
mechanism afterwards though. It's also Ok to just inspect the generated
packfile containing the filtered objects and then just move it into the
'.git/objects/pack/' directory manually. That's why it's not necessary
for this command to check that the Git alternates mechanism has been
already setup.

While at it, as an example to show that `--filter` and `--filter-to`
work well with other options, let's also add a test to check that these
options work well with `--max-pack-size`.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt | 11 +++++++
 builtin/repack.c             | 10 +++++-
 t/t7700-repack.sh            | 62 ++++++++++++++++++++++++++++++++++++
 3 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 6d5bec7716..8545a32667 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -155,6 +155,17 @@ depth is 4095.
 	a single packfile containing all the objects. See
 	linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
 
+--filter-to=<dir>::
+	Write the pack containing filtered out objects to the
+	directory `<dir>`. Only useful with `--filter`. This can be
+	used for putting the pack on a separate object directory that
+	is accessed through the Git alternates mechanism. **WARNING:**
+	If the packfile containing the filtered out objects is not
+	accessible, the repo can become corrupt as it might not be
+	possible to access the objects in that packfile. See the
+	`objects` and `objects/info/alternates` sections of
+	linkgit:gitrepository-layout[5].
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index ac70698a41..e0e1b52cf0 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -867,6 +867,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int write_midx = 0;
 	const char *cruft_expiration = NULL;
 	const char *expire_to = NULL;
+	const char *filter_to = NULL;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -919,6 +920,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			   N_("write a multi-pack index of the resulting packs")),
 		OPT_STRING(0, "expire-to", &expire_to, N_("dir"),
 			   N_("pack prefix to store a pack containing pruned objects")),
+		OPT_STRING(0, "filter-to", &filter_to, N_("dir"),
+			   N_("pack prefix to store a pack containing filtered out objects")),
 		OPT_END()
 	};
 
@@ -1067,6 +1070,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (po_args.filter_options.choice)
 		strvec_pushf(&cmd.args, "--filter=%s",
 			     expand_list_objects_filter_spec(&po_args.filter_options));
+	else if (filter_to)
+		die(_("option '%s' can only be used along with '%s'"), "--filter-to", "--filter");
 
 	if (geometry.split_factor)
 		cmd.in = -1;
@@ -1157,8 +1162,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 
 	if (po_args.filter_options.choice) {
+		if (!filter_to)
+			filter_to = packtmp;
+
 		ret = write_filtered_pack(&po_args,
-					  packtmp,
+					  filter_to,
 					  find_pack_prefix(packdir, packtmp),
 					  &keep_pack_list,
 					  &names,
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 39e89445fd..48e92aa6f7 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -462,6 +462,68 @@ test_expect_success '--filter works with --pack-kept-objects and .keep packs' '
 	)
 '
 
+test_expect_success '--filter-to stores filtered out objects' '
+	git -C bare.git repack -a -d &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+
+	git init --bare filtered.git &&
+	git -C bare.git -c repack.writebitmaps=false repack -a -d \
+		--filter=blob:none \
+		--filter-to=../filtered.git/objects/pack/pack &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/pack-*.pack &&
+	test_stdout_line_count = 1 ls filtered.git/objects/pack/pack-*.pack &&
+
+	commit_pack=$(test-tool -C bare.git find-pack -c 1 HEAD) &&
+	blob_pack=$(test-tool -C bare.git find-pack -c 0 HEAD:file1) &&
+	blob_hash=$(git -C bare.git rev-parse HEAD:file1) &&
+	test -n "$blob_hash" &&
+	blob_pack=$(test-tool -C filtered.git find-pack -c 1 $blob_hash) &&
+
+	echo $(pwd)/filtered.git/objects >bare.git/objects/info/alternates &&
+	blob_pack=$(test-tool -C bare.git find-pack -c 1 HEAD:file1) &&
+	blob_content=$(git -C bare.git show $blob_hash) &&
+	test "$blob_content" = "content1"
+'
+
+test_expect_success '--filter works with --max-pack-size' '
+	rm -rf filtered.git &&
+	git init --bare filtered.git &&
+	git init max-pack-size &&
+	(
+		cd max-pack-size &&
+		test_commit base &&
+		# two blobs which exceed the maximum pack size
+		test-tool genrandom foo 1048576 >foo &&
+		git hash-object -w foo &&
+		test-tool genrandom bar 1048576 >bar &&
+		git hash-object -w bar &&
+		git add foo bar &&
+		git commit -m "adding foo and bar"
+	) &&
+	git clone --no-local --bare max-pack-size max-pack-size.git &&
+	(
+		cd max-pack-size.git &&
+		git -c repack.writebitmaps=false repack -a -d --filter=blob:none \
+			--max-pack-size=1M \
+			--filter-to=../filtered.git/objects/pack/pack &&
+		echo $(cd .. && pwd)/filtered.git/objects >objects/info/alternates &&
+
+		# Check that the 3 blobs are in different packfiles in filtered.git
+		test_stdout_line_count = 3 ls ../filtered.git/objects/pack/pack-*.pack &&
+		test_stdout_line_count = 1 ls objects/pack/pack-*.pack &&
+		foo_pack=$(test-tool find-pack -c 1 HEAD:foo) &&
+		bar_pack=$(test-tool find-pack -c 1 HEAD:bar) &&
+		base_pack=$(test-tool find-pack -c 1 HEAD:base.t) &&
+		test "$foo_pack" != "$bar_pack" &&
+		test "$foo_pack" != "$base_pack" &&
+		test "$bar_pack" != "$base_pack" &&
+		for pack in "$foo_pack" "$bar_pack" "$base_pack"
+		do
+			case "$foo_pack" in */filtered.git/objects/pack/*) true ;; *) return 1 ;; esac
+		done
+	)
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.42.0.167.gd6ff314189


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v6 9/9] gc: add `gc.repackFilterTo` config option
  2023-09-11 15:06         ` [PATCH v6 0/9] " Christian Couder
                             ` (7 preceding siblings ...)
  2023-09-11 15:06           ` [PATCH v6 8/9] repack: implement `--filter-to` for storing filtered out objects Christian Couder
@ 2023-09-11 15:06           ` Christian Couder
  2023-09-25 15:25           ` [PATCH v7 0/9] Repack objects into separate packfiles based on a filter Christian Couder
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-11 15:06 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit implemented the `gc.repackFilter` config option
to specify a filter that should be used by `git gc` when
performing repacks.

Another previous commit has implemented
`git repack --filter-to=<dir>` to specify the location of the
packfile containing filtered out objects when using a filter.

Let's implement the `gc.repackFilterTo` config option to specify
that location in the config when `gc.repackFilter` is used.

Now when `git gc` will perform a repack with a <dir> configured
through this option and not empty, the repack process will be
passed a corresponding `--filter-to=<dir>` argument.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/gc.txt | 11 +++++++++++
 builtin/gc.c                |  4 ++++
 t/t6500-gc.sh               | 13 ++++++++++++-
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index 2153bde7ac..466466d6cc 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -150,6 +150,17 @@ gc.repackFilter::
 	objects into a separate packfile.  See the
 	`--filter=<filter-spec>` option of linkgit:git-repack[1].
 
+gc.repackFilterTo::
+	When repacking and using a filter, see `gc.repackFilter`, the
+	specified location will be used to create the packfile
+	containing the filtered out objects. **WARNING:** The
+	specified location should be accessible, using for example the
+	Git alternates mechanism, otherwise the repo could be
+	considered corrupt by Git as it migh not be able to access the
+	objects in that packfile. See the `--filter-to=<dir>` option
+	of linkgit:git-repack[1] and the `objects/info/alternates`
+	section of linkgit:gitrepository-layout[5].
+
 gc.rerereResolved::
 	Records of conflicted merge you resolved earlier are
 	kept for this many days when 'git rerere gc' is run.
diff --git a/builtin/gc.c b/builtin/gc.c
index 607c0ac23e..8aad103b45 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -62,6 +62,7 @@ static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
 static char *repack_filter;
+static char *repack_filter_to;
 static unsigned long big_pack_threshold;
 static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 
@@ -172,6 +173,7 @@ static void gc_config(void)
 	git_config_get_ulong("pack.deltacachesize", &max_delta_cache_size);
 
 	git_config_get_string("gc.repackfilter", &repack_filter);
+	git_config_get_string("gc.repackfilterto", &repack_filter_to);
 
 	git_config(git_default_config, NULL);
 }
@@ -361,6 +363,8 @@ static void add_repack_all_option(struct string_list *keep_pack)
 
 	if (repack_filter && *repack_filter)
 		strvec_pushf(&repack, "--filter=%s", repack_filter);
+	if (repack_filter_to && *repack_filter_to)
+		strvec_pushf(&repack, "--filter-to=%s", repack_filter_to);
 }
 
 static void add_repack_incremental_option(void)
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 232e403b66..e412cf8daf 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -203,7 +203,6 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
 '
 
 test_expect_success 'gc.repackFilter launches repack with a filter' '
-	test_when_finished "rm -rf bare.git" &&
 	git clone --no-local --bare . bare.git &&
 
 	git -C bare.git -c gc.cruftPacks=false gc &&
@@ -215,6 +214,18 @@ test_expect_success 'gc.repackFilter launches repack with a filter' '
 	grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
 '
 
+test_expect_success 'gc.repackFilterTo store filtered out objects' '
+	test_when_finished "rm -rf bare.git filtered.git" &&
+
+	git init --bare filtered.git &&
+	git -C bare.git -c gc.repackFilter=blob:none \
+		-c gc.repackFilterTo=../filtered.git/objects/pack/pack \
+		-c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&
+
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+	test_stdout_line_count = 1 ls filtered.git/objects/pack/*.pack
+'
+
 prepare_cruft_history () {
 	test_commit base &&
 
-- 
2.42.0.167.gd6ff314189


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v5 0/8] Repack objects into separate packfiles based on a filter
  2023-08-15 23:09               ` Taylor Blau
  2023-08-15 23:18                 ` Junio C Hamano
@ 2023-09-11 15:20                 ` Christian Couder
  1 sibling, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-11 15:20 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Junio C Hamano, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

On Wed, Aug 16, 2023 at 1:09 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Tue, Aug 15, 2023 at 03:32:23PM -0700, Junio C Hamano wrote:
> > Taylor Blau <me@ttaylorr.com> writes:

> > > but I wonder if a more complete fix would be something like:
> > > ...
> > > @@ -966,6 +972,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
> > >     if (write_bitmaps && !(pack_everything & ALL_INTO_ONE) && !write_midx)
> > >             die(_(incremental_bitmap_conflict_error));
> > >
> > > +   if (write_bitmaps && po_args.filter_options.choice)
> > > +           die(_(filtered_bitmap_conflict_error));
> > > +
> >
> > It sounds like the most direct fix.

I would be Ok with such a fix, if we think that we don't want to fix
the underlying issue, or if we think that fixing the underlying issue
is not enough...

> I agree.
>
> I think that we would be OK to not change the implementation of
> rebuild_bitmap(), or its caller in fill_bitmap_commit(), since this only
> bites us when bitmapping a filtered pack, and we should catch that case
> well before getting this deep into the bitmap code.
>
> But it does seem suspect that we rebuild right into ent->bitmap, so we
> may want to consider doing something like:
>
> --- >8 ---
> diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
> index f6757c3cbf..f4ecdf8b0e 100644
> --- a/pack-bitmap-write.c
> +++ b/pack-bitmap-write.c
> @@ -413,15 +413,19 @@ static int fill_bitmap_commit(struct bb_commit *ent,
>
>                 if (old_bitmap && mapping) {
>                         struct ewah_bitmap *old = bitmap_for_commit(old_bitmap, c);
> +                       struct bitmap *remapped = bitmap_new();
>                         /*
>                          * If this commit has an old bitmap, then translate that
>                          * bitmap and add its bits to this one. No need to walk
>                          * parents or the tree for this commit.
>                          */
> -                       if (old && !rebuild_bitmap(mapping, old, ent->bitmap)) {
> +                       if (old && !rebuild_bitmap(mapping, old, remapped)) {
> +                               bitmap_or(ent->bitmap, remapped);
> +                               bitmap_free(remapped);
>                                 reused_bitmaps_nr++;
>                                 continue;
>                         }
> +                       bitmap_free(remapped);
>                 }
>
>                 /*
> --- 8< ---
>
> on top.
>
> Applying that patch and then rerunning the tests with the appropriate
> TEST variables causes the 'git repack' to fail as expected, ensuring
> that the containing test passes.

...however I think that fixing this underlying issue is important, as
it might cause other tricky issues in the future, for example if other
bitmap code is copying or reusing this code.

So I just sent a version 6 of this series with this change in a new
patch. I hope my explanations in the commit message are good enough.

Thanks for finding the cause of the CI test failures and suggesting this fix,
Christian.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v7 0/9] Repack objects into separate packfiles based on a filter
  2023-09-11 15:06         ` [PATCH v6 0/9] " Christian Couder
                             ` (8 preceding siblings ...)
  2023-09-11 15:06           ` [PATCH v6 9/9] gc: add `gc.repackFilterTo` config option Christian Couder
@ 2023-09-25 15:25           ` Christian Couder
  2023-09-25 15:25             ` [PATCH v7 1/9] pack-objects: allow `--filter` without `--stdout` Christian Couder
                               ` (10 more replies)
  9 siblings, 11 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-25 15:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

# Intro

Last year, John Cai sent 2 versions of a patch series to implement
`git repack --filter=<filter-spec>` and later I sent 4 versions of a
patch series trying to do it a bit differently:

  - https://lore.kernel.org/git/pull.1206.git.git.1643248180.gitgitgadget@gmail.com/
  - https://lore.kernel.org/git/20221012135114.294680-1-christian.couder@gmail.com/

In these patch series, the `--filter=<filter-spec>` removed the
filtered out objects altogether which was considered very dangerous
even though we implemented different safety checks in some of the
latter series.

In some discussions, it was mentioned that such a feature, or a
similar feature in `git gc`, or in a new standalone command (perhaps
called `git prune-filtered`), should put the filtered out objects into
a new packfile instead of deleting them.

Recently there were internal discussions at GitLab about either moving
blobs from inactive repos onto cheaper storage, or moving large blobs
onto cheaper storage. This lead us to rethink at repacking using a
filter, but moving the filtered out objects into a separate packfile
instead of deleting them.

So here is a new patch series doing that while implementing the
`--filter=<filter-spec>` option in `git repack`.

# Use cases for the new feature

This could be useful for example for the following purposes:

  1) As a way for servers to save storage costs by for example moving
     large blobs, or all the blobs, or all the blobs in inactive
     repos, to separate storage (while still making them accessible
     using for example the alternates mechanism).

  2) As a way to use partial clone on a Git server to offload large
     blobs to, for example, an http server, while using multiple
     promisor remotes (to be able to access everything) on the client
     side. (In this case the packfile that contains the filtered out
     object can be manualy removed after checking that all the objects
     it contains are available through the promisor remote.)

  3) As a way for clients to reclaim some space when they cloned with
     a filter to save disk space but then fetched a lot of unwanted
     objects (for example when checking out old branches) and now want
     to remove these unwanted objects. (In this case they can first
     move the packfile that contains filtered out objects to a
     separate directory or storage, then check that everything works
     well, and then manually remove the packfile after some time.)

As the features and the code are quite different from those in the
previous series, I decided to start a new series instead of continuing
a previous one.

Also since version 2 of this new series, commit messages, don't
mention uses cases like 2) or 3) above, as people have different
opinions on how it should be done. How it should be done could depend
a lot on the way promisor remotes are used, the software and hardware
setups used, etc, so it seems more difficult to "sell" this series by
talking about such use cases. As use case 1) seems simpler and more
appealing, it makes more sense to only talk about it in the commit
messages.

# Changes since version 6

Thanks to Junio who reviewed or commented on versions 1, 2, 3, 4 and
5, and to Taylor who reviewed or commented on version 1, 3, 4, 5 and
6!  Thanks also to Robert Coup who participated in the discussions
related to version 2 and Peff who participated in the discussions
related to version 4. There are only the following changes since
version 6:

- This series has been rebased on top of bcb6cae296 (The twelfth
  batch, 2023-09-22) to fix conflicts with a `builtin/repack.c`
  refactoring patch series called tb/repack-existing-packs-cleanup by
  Taylor Blau that recently graduated to 'master':

	https://lore.kernel.org/git/cover.1694632644.git.me@ttaylorr.com/
	https://lore.kernel.org/git/xmqqil81wqkx.fsf@gitster.g/

- Patch 6/9 (repack: add `--filter=<filter-spec>` option) has been
  reworked to apply on top of the above mentioned patch series.
  Taylor even posted the fixup patch to apply to this series so that
  it works well on top of his series:
  
    https://lore.kernel.org/git/ZQNKkn0YYLUyN5Ih@nand.local/

I checked that CI tests passes in:

https://github.com/chriscool/git/actions/runs/6300816764

All jobs seem to have succeeded.

# Commit overview

(No changes in any of the patches compared to version 5, except on
patch 6/9.)

* 1/9 pack-objects: allow `--filter` without `--stdout`

  To be able to later repack with a filter we need `git pack-objects`
  to write packfiles when it's filtering instead of just writing the
  pack without the filtered out objects to stdout.

* 2/9 t/helper: add 'find-pack' test-tool

  For testing `git repack --filter=...` that we are going to
  implement, it's useful to have a test helper that can tell which
  packfiles contain a specific object.

* 3/9 repack: refactor finishing pack-objects command

  This is a small refactoring creating a new useful function, so that
  `git repack --filter=...` will be able to reuse it.

* 4/9 repack: refactor finding pack prefix

  This is another small refactoring creating a small function that
  will be reused in the next patch.

* 5/9 pack-bitmap-write: rebuild using new bitmap when remapping

  This patch is new in version 6. It fixes an issue when bitmaps are
  rebuilt that was revealed by this series, and caused a CI test to
  fail.

* 6/9 repack: add `--filter=<filter-spec>` option

  This actually adds the `--filter=<filter-spec>` option. It uses one
  `git pack-objects` process with the `--filter` option. And then
  another `git pack-objects` process with the `--stdin-packs`
  option. This is the only patch changed in version 7.
  
* 7/9 gc: add `gc.repackFilter` config option

  This is a gc config option so that `git gc` can also repack using a
  filter and put the filtered out objects into a separate packfile.

* 8/9 repack: implement `--filter-to` for storing filtered out objects

  For some use cases, it's interesting to create the packfile that
  contains the filtered out objects into a separate location. This is
  similar to the `--expire-to` option for cruft packfiles.

* 9/9 gc: add `gc.repackFilterTo` config option

  This allows specifying the location of the packfile that contains
  the filtered out objects when using `gc.repackFilter`.

# Range-diff since v6

 1:  da931b5082 =  1:  eec0c09731 pack-objects: allow `--filter` without `--stdout`
 2:  10504b3699 =  2:  19c8b8a4b9 t/helper: add 'find-pack' test-tool
 3:  ee12eb8ad7 !  3:  aaaf40bd5d repack: refactor finishing pack-objects command
    @@ builtin/repack.c: static void remove_redundant_bitmaps(struct string_list *inclu
                            const char *destination,
                            const char *pack_prefix,
     @@ builtin/repack.c: static int write_cruft_pack(const struct pack_objects_args *args,
    -                       struct string_list *existing_kept_packs)
    +                       struct existing_packs *existing)
      {
        struct child_process cmd = CHILD_PROCESS_INIT;
     -  struct strbuf line = STRBUF_INIT;
    @@ builtin/repack.c: static int write_cruft_pack(const struct pack_objects_args *ar
      
      int cmd_repack(int argc, const char **argv, const char *prefix)
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    -   struct string_list existing_nonkept_packs = STRING_LIST_INIT_DUP;
    -   struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
    +   struct string_list names = STRING_LIST_INIT_DUP;
    +   struct existing_packs existing = EXISTING_PACKS_INIT;
        struct pack_geometry geometry = { 0 };
     -  struct strbuf line = STRBUF_INIT;
        struct tempfile *refs_snapshot = NULL;
 4:  d197e0c370 =  4:  1eb6bc3f7e repack: refactor finding pack prefix
 5:  abeef5fbad =  5:  b9159e1803 pack-bitmap-write: rebuild using new bitmap when remapping
 6:  31ca2579d3 !  6:  f2f5bb54d3 repack: add `--filter=<filter-spec>` option
    @@ Commit message
         As the interactions with kept packs are a bit tricky, a few related
         tests are added.
     
    +    Helped-by: Taylor Blau <me@ttaylorr.com>
         Signed-off-by: John Cai <johncai86@gmail.com>
         Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
     
    @@ builtin/repack.c: static int finish_pack_objects_cmd(struct child_process *cmd,
     +static int write_filtered_pack(const struct pack_objects_args *args,
     +                         const char *destination,
     +                         const char *pack_prefix,
    -+                         struct string_list *keep_pack_list,
    -+                         struct string_list *names,
    -+                         struct string_list *existing_packs,
    -+                         struct string_list *existing_kept_packs)
    ++                         struct existing_packs *existing,
    ++                         struct string_list *names)
     +{
     +  struct child_process cmd = CHILD_PROCESS_INIT;
     +  struct string_list_item *item;
     +  FILE *in;
    -+  int ret, i;
    ++  int ret;
     +  const char *caret;
     +  const char *scratch;
     +  int local = skip_prefix(destination, packdir, &scratch);
    @@ builtin/repack.c: static int finish_pack_objects_cmd(struct child_process *cmd,
     +
     +  if (!pack_kept_objects)
     +          strvec_push(&cmd.args, "--honor-pack-keep");
    -+  for (i = 0; i < keep_pack_list->nr; i++)
    -+          strvec_pushf(&cmd.args, "--keep-pack=%s",
    -+                       keep_pack_list->items[i].string);
    ++  for_each_string_list_item(item, &existing->kept_packs)
    ++          strvec_pushf(&cmd.args, "--keep-pack=%s", item->string);
     +
     +  cmd.in = -1;
     +
    @@ builtin/repack.c: static int finish_pack_objects_cmd(struct child_process *cmd,
     +  in = xfdopen(cmd.in, "w");
     +  for_each_string_list_item(item, names)
     +          fprintf(in, "^%s-%s.pack\n", pack_prefix, item->string);
    -+  for_each_string_list_item(item, existing_packs)
    ++  for_each_string_list_item(item, &existing->non_kept_packs)
    ++          fprintf(in, "%s.pack\n", item->string);
    ++  for_each_string_list_item(item, &existing->cruft_packs)
     +          fprintf(in, "%s.pack\n", item->string);
     +  caret = pack_kept_objects ? "" : "^";
    -+  for_each_string_list_item(item, existing_kept_packs)
    ++  for_each_string_list_item(item, &existing->kept_packs)
     +          fprintf(in, "%s%s.pack\n", caret, item->string);
     +  fclose(in);
     +
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
     +          ret = write_filtered_pack(&po_args,
     +                                    packtmp,
     +                                    find_pack_prefix(packdir, packtmp),
    -+                                    &keep_pack_list,
    -+                                    &names,
    -+                                    &existing_nonkept_packs,
    -+                                    &existing_kept_packs);
    ++                                    &existing,
    ++                                    &names);
     +          if (ret)
     +                  goto cleanup;
     +  }
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
      
        close_object_store(the_repository->objects);
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    -   string_list_clear(&existing_nonkept_packs, 0);
    -   string_list_clear(&existing_kept_packs, 0);
    +   string_list_clear(&names, 1);
    +   existing_packs_release(&existing);
        free_pack_geometry(&geometry);
     +  list_objects_filter_release(&po_args.filter_options);
      
 7:  fa70ae85f2 =  7:  7ea0307628 gc: add `gc.repackFilter` config option
 8:  e01ea3dd70 !  8:  698647815b repack: implement `--filter-to` for storing filtered out objects
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
     -                                    packtmp,
     +                                    filter_to,
                                          find_pack_prefix(packdir, packtmp),
    -                                     &keep_pack_list,
    -                                     &names,
    +                                     &existing,
    +                                     &names);
     
      ## t/t7700-repack.sh ##
     @@ t/t7700-repack.sh: test_expect_success '--filter works with --pack-kept-objects and .keep packs' '
 9:  d6ff314189 =  9:  57b2ba444c gc: add `gc.repackFilterTo` config option


Christian Couder (9):
  pack-objects: allow `--filter` without `--stdout`
  t/helper: add 'find-pack' test-tool
  repack: refactor finishing pack-objects command
  repack: refactor finding pack prefix
  pack-bitmap-write: rebuild using new bitmap when remapping
  repack: add `--filter=<filter-spec>` option
  gc: add `gc.repackFilter` config option
  repack: implement `--filter-to` for storing filtered out objects
  gc: add `gc.repackFilterTo` config option

 Documentation/config/gc.txt            |  16 ++
 Documentation/git-pack-objects.txt     |   4 +-
 Documentation/git-repack.txt           |  23 +++
 Makefile                               |   1 +
 builtin/gc.c                           |  10 ++
 builtin/pack-objects.c                 |   8 +-
 builtin/repack.c                       | 164 ++++++++++++++------
 pack-bitmap-write.c                    |   6 +-
 t/helper/test-find-pack.c              |  50 +++++++
 t/helper/test-tool.c                   |   1 +
 t/helper/test-tool.h                   |   1 +
 t/t0080-find-pack.sh                   |  82 ++++++++++
 t/t5317-pack-objects-filter-objects.sh |   8 +
 t/t6500-gc.sh                          |  24 +++
 t/t7700-repack.sh                      | 197 +++++++++++++++++++++++++
 15 files changed, 544 insertions(+), 51 deletions(-)
 create mode 100644 t/helper/test-find-pack.c
 create mode 100755 t/t0080-find-pack.sh

-- 
2.42.0.279.g57b2ba444c


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v7 1/9] pack-objects: allow `--filter` without `--stdout`
  2023-09-25 15:25           ` [PATCH v7 0/9] Repack objects into separate packfiles based on a filter Christian Couder
@ 2023-09-25 15:25             ` Christian Couder
  2023-09-25 15:25             ` [PATCH v7 2/9] t/helper: add 'find-pack' test-tool Christian Couder
                               ` (9 subsequent siblings)
  10 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-25 15:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

9535ce7337 (pack-objects: add list-objects filtering, 2017-11-21)
taught `git pack-objects` to use `--filter`, but required the use of
`--stdout` since a partial clone mechanism was not yet in place to
handle missing objects. Since then, changes like 9e27beaa23
(promisor-remote: implement promisor_remote_get_direct(), 2019-06-25)
and others added support to dynamically fetch objects that were missing.

Even without a promisor remote, filtering out objects can also be useful
if we can put the filtered out objects in a separate pack, and in this
case it also makes sense for pack-objects to write the packfile directly
to an actual file rather than on stdout.

Remove the `--stdout` requirement when using `--filter`, so that in a
follow-up commit, repack can pass `--filter` to pack-objects to omit
certain objects from the resulting packfile.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-pack-objects.txt     | 4 ++--
 builtin/pack-objects.c                 | 8 ++------
 t/t5317-pack-objects-filter-objects.sh | 8 ++++++++
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index dea7eacb0f..e32404c6aa 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -296,8 +296,8 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it creates a bundle.
 	nevertheless.
 
 --filter=<filter-spec>::
-	Requires `--stdout`.  Omits certain objects (usually blobs) from
-	the resulting packfile.  See linkgit:git-rev-list[1] for valid
+	Omits certain objects (usually blobs) from the resulting
+	packfile.  See linkgit:git-rev-list[1] for valid
 	`<filter-spec>` forms.
 
 --no-filter::
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6eb9756836..89a8b5a976 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4402,12 +4402,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (filter_options.choice) {
-		if (!pack_to_stdout)
-			die(_("cannot use --filter without --stdout"));
-		if (stdin_packs)
-			die(_("cannot use --filter with --stdin-packs"));
-	}
+	if (stdin_packs && filter_options.choice)
+		die(_("cannot use --filter with --stdin-packs"));
 
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
diff --git a/t/t5317-pack-objects-filter-objects.sh b/t/t5317-pack-objects-filter-objects.sh
index b26d476c64..2ff3eef9a3 100755
--- a/t/t5317-pack-objects-filter-objects.sh
+++ b/t/t5317-pack-objects-filter-objects.sh
@@ -53,6 +53,14 @@ test_expect_success 'verify blob:none packfile has no blobs' '
 	! grep blob verify_result
 '
 
+test_expect_success 'verify blob:none packfile without --stdout' '
+	git -C r1 pack-objects --revs --filter=blob:none mypackname >packhash <<-EOF &&
+	HEAD
+	EOF
+	git -C r1 verify-pack -v "mypackname-$(cat packhash).pack" >verify_result &&
+	! grep blob verify_result
+'
+
 test_expect_success 'verify normal and blob:none packfiles have same commits/trees' '
 	git -C r1 verify-pack -v ../all.pack >verify_result &&
 	grep -E "commit|tree" verify_result |
-- 
2.42.0.279.g57b2ba444c


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v7 2/9] t/helper: add 'find-pack' test-tool
  2023-09-25 15:25           ` [PATCH v7 0/9] Repack objects into separate packfiles based on a filter Christian Couder
  2023-09-25 15:25             ` [PATCH v7 1/9] pack-objects: allow `--filter` without `--stdout` Christian Couder
@ 2023-09-25 15:25             ` Christian Couder
  2023-09-25 15:25             ` [PATCH v7 3/9] repack: refactor finishing pack-objects command Christian Couder
                               ` (8 subsequent siblings)
  10 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-25 15:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

In a following commit, we will make it possible to separate objects in
different packfiles depending on a filter.

To make sure that the right objects are in the right packs, let's add a
new test-tool that can display which packfile(s) a given object is in.

Let's also make it possible to check if a given object is in the
expected number of packfiles with a `--check-count <n>` option.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Makefile                  |  1 +
 t/helper/test-find-pack.c | 50 ++++++++++++++++++++++++
 t/helper/test-tool.c      |  1 +
 t/helper/test-tool.h      |  1 +
 t/t0080-find-pack.sh      | 82 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 135 insertions(+)
 create mode 100644 t/helper/test-find-pack.c
 create mode 100755 t/t0080-find-pack.sh

diff --git a/Makefile b/Makefile
index 003e63b792..f267034d23 100644
--- a/Makefile
+++ b/Makefile
@@ -800,6 +800,7 @@ TEST_BUILTINS_OBJS += test-dump-untracked-cache.o
 TEST_BUILTINS_OBJS += test-env-helper.o
 TEST_BUILTINS_OBJS += test-example-decorate.o
 TEST_BUILTINS_OBJS += test-fast-rebase.o
+TEST_BUILTINS_OBJS += test-find-pack.o
 TEST_BUILTINS_OBJS += test-fsmonitor-client.o
 TEST_BUILTINS_OBJS += test-genrandom.o
 TEST_BUILTINS_OBJS += test-genzeros.o
diff --git a/t/helper/test-find-pack.c b/t/helper/test-find-pack.c
new file mode 100644
index 0000000000..e8bd793e58
--- /dev/null
+++ b/t/helper/test-find-pack.c
@@ -0,0 +1,50 @@
+#include "test-tool.h"
+#include "object-name.h"
+#include "object-store.h"
+#include "packfile.h"
+#include "parse-options.h"
+#include "setup.h"
+
+/*
+ * Display the path(s), one per line, of the packfile(s) containing
+ * the given object.
+ *
+ * If '--check-count <n>' is passed, then error out if the number of
+ * packfiles containing the object is not <n>.
+ */
+
+static const char *find_pack_usage[] = {
+	"test-tool find-pack [--check-count <n>] <object>",
+	NULL
+};
+
+int cmd__find_pack(int argc, const char **argv)
+{
+	struct object_id oid;
+	struct packed_git *p;
+	int count = -1, actual_count = 0;
+	const char *prefix = setup_git_directory();
+
+	struct option options[] = {
+		OPT_INTEGER('c', "check-count", &count, "expected number of packs"),
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, prefix, options, find_pack_usage, 0);
+	if (argc != 1)
+		usage(find_pack_usage[0]);
+
+	if (repo_get_oid(the_repository, argv[0], &oid))
+		die("cannot parse %s as an object name", argv[0]);
+
+	for (p = get_all_packs(the_repository); p; p = p->next)
+		if (find_pack_entry_one(oid.hash, p)) {
+			printf("%s\n", p->pack_name);
+			actual_count++;
+		}
+
+	if (count > -1 && count != actual_count)
+		die("bad packfile count %d instead of %d", actual_count, count);
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 621ac3dd10..9010ac6de7 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -31,6 +31,7 @@ static struct test_cmd cmds[] = {
 	{ "env-helper", cmd__env_helper },
 	{ "example-decorate", cmd__example_decorate },
 	{ "fast-rebase", cmd__fast_rebase },
+	{ "find-pack", cmd__find_pack },
 	{ "fsmonitor-client", cmd__fsmonitor_client },
 	{ "genrandom", cmd__genrandom },
 	{ "genzeros", cmd__genzeros },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index a641c3a81d..f134f96b97 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -25,6 +25,7 @@ int cmd__dump_reftable(int argc, const char **argv);
 int cmd__env_helper(int argc, const char **argv);
 int cmd__example_decorate(int argc, const char **argv);
 int cmd__fast_rebase(int argc, const char **argv);
+int cmd__find_pack(int argc, const char **argv);
 int cmd__fsmonitor_client(int argc, const char **argv);
 int cmd__genrandom(int argc, const char **argv);
 int cmd__genzeros(int argc, const char **argv);
diff --git a/t/t0080-find-pack.sh b/t/t0080-find-pack.sh
new file mode 100755
index 0000000000..67b11216a3
--- /dev/null
+++ b/t/t0080-find-pack.sh
@@ -0,0 +1,82 @@
+#!/bin/sh
+
+test_description='test `test-tool find-pack`'
+
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_commit one &&
+	test_commit two &&
+	test_commit three &&
+	test_commit four &&
+	test_commit five
+'
+
+test_expect_success 'repack everything into a single packfile' '
+	git repack -a -d --no-write-bitmap-index &&
+
+	head_commit_pack=$(test-tool find-pack HEAD) &&
+	head_tree_pack=$(test-tool find-pack HEAD^{tree}) &&
+	one_pack=$(test-tool find-pack HEAD:one.t) &&
+	three_pack=$(test-tool find-pack HEAD:three.t) &&
+	old_commit_pack=$(test-tool find-pack HEAD~4) &&
+
+	test-tool find-pack --check-count 1 HEAD &&
+	test-tool find-pack --check-count=1 HEAD^{tree} &&
+	! test-tool find-pack --check-count=0 HEAD:one.t &&
+	! test-tool find-pack -c 2 HEAD:one.t &&
+	test-tool find-pack -c 1 HEAD:three.t &&
+
+	# Packfile exists at the right path
+	case "$head_commit_pack" in
+		".git/objects/pack/pack-"*".pack") true ;;
+		*) false ;;
+	esac &&
+	test -f "$head_commit_pack" &&
+
+	# Everything is in the same pack
+	test "$head_commit_pack" = "$head_tree_pack" &&
+	test "$head_commit_pack" = "$one_pack" &&
+	test "$head_commit_pack" = "$three_pack" &&
+	test "$head_commit_pack" = "$old_commit_pack"
+'
+
+test_expect_success 'add more packfiles' '
+	git rev-parse HEAD^{tree} HEAD:two.t HEAD:four.t >objects &&
+	git pack-objects .git/objects/pack/mypackname1 >packhash1 <objects &&
+
+	git rev-parse HEAD~ HEAD~^{tree} HEAD:five.t >objects &&
+	git pack-objects .git/objects/pack/mypackname2 >packhash2 <objects &&
+
+	head_commit_pack=$(test-tool find-pack HEAD) &&
+
+	# HEAD^{tree} is in 2 packfiles
+	test-tool find-pack HEAD^{tree} >head_tree_packs &&
+	grep "$head_commit_pack" head_tree_packs &&
+	grep mypackname1 head_tree_packs &&
+	! grep mypackname2 head_tree_packs &&
+	test-tool find-pack --check-count 2 HEAD^{tree} &&
+	! test-tool find-pack --check-count 1 HEAD^{tree} &&
+
+	# HEAD:five.t is also in 2 packfiles
+	test-tool find-pack HEAD:five.t >five_packs &&
+	grep "$head_commit_pack" five_packs &&
+	! grep mypackname1 five_packs &&
+	grep mypackname2 five_packs &&
+	test-tool find-pack -c 2 HEAD:five.t &&
+	! test-tool find-pack --check-count=0 HEAD:five.t
+'
+
+test_expect_success 'add more commits (as loose objects)' '
+	test_commit six &&
+	test_commit seven &&
+
+	test -z "$(test-tool find-pack HEAD)" &&
+	test -z "$(test-tool find-pack HEAD:six.t)" &&
+	test-tool find-pack --check-count 0 HEAD &&
+	test-tool find-pack -c 0 HEAD:six.t &&
+	! test-tool find-pack -c 1 HEAD:seven.t
+'
+
+test_done
-- 
2.42.0.279.g57b2ba444c


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v7 3/9] repack: refactor finishing pack-objects command
  2023-09-25 15:25           ` [PATCH v7 0/9] Repack objects into separate packfiles based on a filter Christian Couder
  2023-09-25 15:25             ` [PATCH v7 1/9] pack-objects: allow `--filter` without `--stdout` Christian Couder
  2023-09-25 15:25             ` [PATCH v7 2/9] t/helper: add 'find-pack' test-tool Christian Couder
@ 2023-09-25 15:25             ` Christian Couder
  2023-09-25 15:25             ` [PATCH v7 4/9] repack: refactor finding pack prefix Christian Couder
                               ` (7 subsequent siblings)
  10 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-25 15:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

Create a new finish_pack_objects_cmd() to refactor duplicated code
that handles reading the packfile names from the output of a
`git pack-objects` command and putting it into a string_list, as well as
calling finish_command().

While at it, beautify a code comment a bit in the new function.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org
---
 builtin/repack.c | 70 +++++++++++++++++++++++-------------------------
 1 file changed, 33 insertions(+), 37 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 529e13120d..d0ab55c0d9 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -806,6 +806,36 @@ static void remove_redundant_bitmaps(struct string_list *include,
 	strbuf_release(&path);
 }
 
+static int finish_pack_objects_cmd(struct child_process *cmd,
+				   struct string_list *names,
+				   int local)
+{
+	FILE *out;
+	struct strbuf line = STRBUF_INIT;
+
+	out = xfdopen(cmd->out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		struct string_list_item *item;
+
+		if (line.len != the_hash_algo->hexsz)
+			die(_("repack: Expecting full hex object ID lines only "
+			      "from pack-objects."));
+		/*
+		 * Avoid putting packs written outside of the repository in the
+		 * list of names.
+		 */
+		if (local) {
+			item = string_list_append(names, line.buf);
+			item->util = populate_pack_exts(line.buf);
+		}
+	}
+	fclose(out);
+
+	strbuf_release(&line);
+
+	return finish_command(cmd);
+}
+
 static int write_cruft_pack(const struct pack_objects_args *args,
 			    const char *destination,
 			    const char *pack_prefix,
@@ -814,9 +844,8 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 			    struct existing_packs *existing)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
-	struct strbuf line = STRBUF_INIT;
 	struct string_list_item *item;
-	FILE *in, *out;
+	FILE *in;
 	int ret;
 	const char *scratch;
 	int local = skip_prefix(destination, packdir, &scratch);
@@ -861,27 +890,7 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 		fprintf(in, "%s.pack\n", item->string);
 	fclose(in);
 
-	out = xfdopen(cmd.out, "r");
-	while (strbuf_getline_lf(&line, out) != EOF) {
-		struct string_list_item *item;
-
-		if (line.len != the_hash_algo->hexsz)
-			die(_("repack: Expecting full hex object ID lines only "
-			      "from pack-objects."));
-		/*
-		 * avoid putting packs written outside of the repository in the
-		 * list of names
-		 */
-		if (local) {
-			item = string_list_append(names, line.buf);
-			item->util = populate_pack_exts(line.buf);
-		}
-	}
-	fclose(out);
-
-	strbuf_release(&line);
-
-	return finish_command(&cmd);
+	return finish_pack_objects_cmd(&cmd, names, local);
 }
 
 int cmd_repack(int argc, const char **argv, const char *prefix)
@@ -891,10 +900,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list names = STRING_LIST_INIT_DUP;
 	struct existing_packs existing = EXISTING_PACKS_INIT;
 	struct pack_geometry geometry = { 0 };
-	struct strbuf line = STRBUF_INIT;
 	struct tempfile *refs_snapshot = NULL;
 	int i, ext, ret;
-	FILE *out;
 	int show_progress;
 
 	/* variables to be filled by option parsing */
@@ -1124,18 +1131,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		fclose(in);
 	}
 
-	out = xfdopen(cmd.out, "r");
-	while (strbuf_getline_lf(&line, out) != EOF) {
-		struct string_list_item *item;
-
-		if (line.len != the_hash_algo->hexsz)
-			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
-		item = string_list_append(&names, line.buf);
-		item->util = populate_pack_exts(item->string);
-	}
-	strbuf_release(&line);
-	fclose(out);
-	ret = finish_command(&cmd);
+	ret = finish_pack_objects_cmd(&cmd, &names, 1);
 	if (ret)
 		goto cleanup;
 
-- 
2.42.0.279.g57b2ba444c


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v7 4/9] repack: refactor finding pack prefix
  2023-09-25 15:25           ` [PATCH v7 0/9] Repack objects into separate packfiles based on a filter Christian Couder
                               ` (2 preceding siblings ...)
  2023-09-25 15:25             ` [PATCH v7 3/9] repack: refactor finishing pack-objects command Christian Couder
@ 2023-09-25 15:25             ` Christian Couder
  2023-09-25 15:25             ` [PATCH v7 5/9] pack-bitmap-write: rebuild using new bitmap when remapping Christian Couder
                               ` (6 subsequent siblings)
  10 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-25 15:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

Create a new find_pack_prefix() to refactor code that handles finding
the pack prefix from the packtmp and packdir global variables, as we are
going to need this feature again in following commit.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org
---
 builtin/repack.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index d0ab55c0d9..9ef0044384 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -893,6 +893,17 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 	return finish_pack_objects_cmd(&cmd, names, local);
 }
 
+static const char *find_pack_prefix(const char *packdir, const char *packtmp)
+{
+	const char *pack_prefix;
+	if (!skip_prefix(packtmp, packdir, &pack_prefix))
+		die(_("pack prefix %s does not begin with objdir %s"),
+		    packtmp, packdir);
+	if (*pack_prefix == '/')
+		pack_prefix++;
+	return pack_prefix;
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -1139,12 +1150,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		printf_ln(_("Nothing new to pack."));
 
 	if (pack_everything & PACK_CRUFT) {
-		const char *pack_prefix;
-		if (!skip_prefix(packtmp, packdir, &pack_prefix))
-			die(_("pack prefix %s does not begin with objdir %s"),
-			    packtmp, packdir);
-		if (*pack_prefix == '/')
-			pack_prefix++;
+		const char *pack_prefix = find_pack_prefix(packdir, packtmp);
 
 		if (!cruft_po_args.window)
 			cruft_po_args.window = po_args.window;
-- 
2.42.0.279.g57b2ba444c


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v7 5/9] pack-bitmap-write: rebuild using new bitmap when remapping
  2023-09-25 15:25           ` [PATCH v7 0/9] Repack objects into separate packfiles based on a filter Christian Couder
                               ` (3 preceding siblings ...)
  2023-09-25 15:25             ` [PATCH v7 4/9] repack: refactor finding pack prefix Christian Couder
@ 2023-09-25 15:25             ` Christian Couder
  2023-09-25 15:25             ` [PATCH v7 6/9] repack: add `--filter=<filter-spec>` option Christian Couder
                               ` (5 subsequent siblings)
  10 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-25 15:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

`git repack` is about to learn a new `--filter=<filter-spec>` option and
we will want to check that this option is incompatible with
`--write-bitmap-index`.

Unfortunately it appears that a test like:

test_expect_success '--filter fails with --write-bitmap-index' '
       test_must_fail \
               env GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
               git -C bare.git repack -a -d --write-bitmap-index --filter=blob:none
'

sometimes fail because when rebuilding bitmaps, it appears that we are
reusing existing bitmap information. So instead of detecting that some
objects are missing and erroring out as it should, the
`git repack --write-bitmap-index --filter=...` command succeeds.

Let's fix that by making sure we rebuild bitmaps using new bitmaps
instead of existing ones.

Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 pack-bitmap-write.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index f6757c3cbf..f4ecdf8b0e 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -413,15 +413,19 @@ static int fill_bitmap_commit(struct bb_commit *ent,
 
 		if (old_bitmap && mapping) {
 			struct ewah_bitmap *old = bitmap_for_commit(old_bitmap, c);
+			struct bitmap *remapped = bitmap_new();
 			/*
 			 * If this commit has an old bitmap, then translate that
 			 * bitmap and add its bits to this one. No need to walk
 			 * parents or the tree for this commit.
 			 */
-			if (old && !rebuild_bitmap(mapping, old, ent->bitmap)) {
+			if (old && !rebuild_bitmap(mapping, old, remapped)) {
+				bitmap_or(ent->bitmap, remapped);
+				bitmap_free(remapped);
 				reused_bitmaps_nr++;
 				continue;
 			}
+			bitmap_free(remapped);
 		}
 
 		/*
-- 
2.42.0.279.g57b2ba444c


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v7 6/9] repack: add `--filter=<filter-spec>` option
  2023-09-25 15:25           ` [PATCH v7 0/9] Repack objects into separate packfiles based on a filter Christian Couder
                               ` (4 preceding siblings ...)
  2023-09-25 15:25             ` [PATCH v7 5/9] pack-bitmap-write: rebuild using new bitmap when remapping Christian Couder
@ 2023-09-25 15:25             ` Christian Couder
  2023-09-25 15:25             ` [PATCH v7 7/9] gc: add `gc.repackFilter` config option Christian Couder
                               ` (4 subsequent siblings)
  10 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-25 15:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

This new option puts the objects specified by `<filter-spec>` into a
separate packfile.

This could be useful if, for example, some blobs take up a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.

It's possible to find which new packfile contains the filtered out
objects using one of the following:

  - `git verify-pack -v ...`,
  - `test-tool find-pack ...`, which a previous commit added,
  - `--filter-to=<dir>`, which a following commit will add to specify
    where the pack containing the filtered out objects will be.

This feature is implemented by running `git pack-objects` twice in a
row. The first command is run with `--filter=<filter-spec>`, using the
specified filter. It packs objects while omitting the objects specified
by the filter. Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.

As the interactions with kept packs are a bit tricky, a few related
tests are added.

Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt |  12 ++++
 builtin/repack.c             |  70 ++++++++++++++++++
 t/t7700-repack.sh            | 135 +++++++++++++++++++++++++++++++++++
 3 files changed, 217 insertions(+)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 4017157949..6d5bec7716 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -143,6 +143,18 @@ depth is 4095.
 	a larger and slower repository; see the discussion in
 	`pack.packSizeLimit`.
 
+--filter=<filter-spec>::
+	Remove objects matching the filter specification from the
+	resulting packfile and put them into a separate packfile. Note
+	that objects used in the working directory are not filtered
+	out. So for the split to fully work, it's best to perform it
+	in a bare repo and to use the `-a` and `-d` options along with
+	this option.  Also `--no-write-bitmap-index` (or the
+	`repack.writebitmaps` config option set to `false`) should be
+	used otherwise writing bitmap index will fail, as it supposes
+	a single packfile containing all the objects. See
+	linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index 9ef0044384..c7b564192f 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -21,6 +21,7 @@
 #include "pack.h"
 #include "pack-bitmap.h"
 #include "refs.h"
+#include "list-objects-filter-options.h"
 
 #define ALL_INTO_ONE 1
 #define LOOSEN_UNREACHABLE 2
@@ -56,6 +57,7 @@ struct pack_objects_args {
 	int no_reuse_object;
 	int quiet;
 	int local;
+	struct list_objects_filter_options filter_options;
 };
 
 static int repack_config(const char *var, const char *value,
@@ -836,6 +838,56 @@ static int finish_pack_objects_cmd(struct child_process *cmd,
 	return finish_command(cmd);
 }
 
+static int write_filtered_pack(const struct pack_objects_args *args,
+			       const char *destination,
+			       const char *pack_prefix,
+			       struct existing_packs *existing,
+			       struct string_list *names)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct string_list_item *item;
+	FILE *in;
+	int ret;
+	const char *caret;
+	const char *scratch;
+	int local = skip_prefix(destination, packdir, &scratch);
+
+	prepare_pack_objects(&cmd, args, destination);
+
+	strvec_push(&cmd.args, "--stdin-packs");
+
+	if (!pack_kept_objects)
+		strvec_push(&cmd.args, "--honor-pack-keep");
+	for_each_string_list_item(item, &existing->kept_packs)
+		strvec_pushf(&cmd.args, "--keep-pack=%s", item->string);
+
+	cmd.in = -1;
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	/*
+	 * Here 'names' contains only the pack(s) that were just
+	 * written, which is exactly the packs we want to keep. Also
+	 * 'existing_kept_packs' already contains the packs in
+	 * 'keep_pack_list'.
+	 */
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, names)
+		fprintf(in, "^%s-%s.pack\n", pack_prefix, item->string);
+	for_each_string_list_item(item, &existing->non_kept_packs)
+		fprintf(in, "%s.pack\n", item->string);
+	for_each_string_list_item(item, &existing->cruft_packs)
+		fprintf(in, "%s.pack\n", item->string);
+	caret = pack_kept_objects ? "" : "^";
+	for_each_string_list_item(item, &existing->kept_packs)
+		fprintf(in, "%s%s.pack\n", caret, item->string);
+	fclose(in);
+
+	return finish_pack_objects_cmd(&cmd, names, local);
+}
+
 static int write_cruft_pack(const struct pack_objects_args *args,
 			    const char *destination,
 			    const char *pack_prefix,
@@ -966,6 +1018,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("limits the maximum number of threads")),
 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
 				N_("maximum size of each packfile")),
+		OPT_PARSE_LIST_OBJECTS_FILTER(&po_args.filter_options),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
@@ -979,6 +1032,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_END()
 	};
 
+	list_objects_filter_init(&po_args.filter_options);
+
 	git_config(repack_config, &cruft_po_args);
 
 	argc = parse_options(argc, argv, prefix, builtin_repack_options,
@@ -1119,6 +1174,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		strvec_push(&cmd.args, "--incremental");
 	}
 
+	if (po_args.filter_options.choice)
+		strvec_pushf(&cmd.args, "--filter=%s",
+			     expand_list_objects_filter_spec(&po_args.filter_options));
+
 	if (geometry.split_factor)
 		cmd.in = -1;
 	else
@@ -1205,6 +1264,16 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		}
 	}
 
+	if (po_args.filter_options.choice) {
+		ret = write_filtered_pack(&po_args,
+					  packtmp,
+					  find_pack_prefix(packdir, packtmp),
+					  &existing,
+					  &names);
+		if (ret)
+			goto cleanup;
+	}
+
 	string_list_sort(&names);
 
 	close_object_store(the_repository->objects);
@@ -1297,6 +1366,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	string_list_clear(&names, 1);
 	existing_packs_release(&existing);
 	free_pack_geometry(&geometry);
+	list_objects_filter_release(&po_args.filter_options);
 
 	return ret;
 }
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 27b66807cd..39e89445fd 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -327,6 +327,141 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_must_be_empty actual
 '
 
+test_expect_success 'repacking with a filter works' '
+	git -C bare.git repack -a -d &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+	git -C bare.git -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
+	test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
+	commit_pack=$(test-tool -C bare.git find-pack -c 1 HEAD) &&
+	blob_pack=$(test-tool -C bare.git find-pack -c 1 HEAD:file1) &&
+	test "$commit_pack" != "$blob_pack" &&
+	tree_pack=$(test-tool -C bare.git find-pack -c 1 HEAD^{tree}) &&
+	test "$tree_pack" = "$commit_pack" &&
+	blob_pack2=$(test-tool -C bare.git find-pack -c 1 HEAD:file2) &&
+	test "$blob_pack2" = "$blob_pack"
+'
+
+test_expect_success '--filter fails with --write-bitmap-index' '
+	test_must_fail \
+		env GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
+		git -C bare.git repack -a -d --write-bitmap-index --filter=blob:none
+'
+
+test_expect_success 'repacking with two filters works' '
+	git init two-filters &&
+	(
+		cd two-filters &&
+		mkdir subdir &&
+		test_commit foo &&
+		test_commit subdir_bar subdir/bar &&
+		test_commit subdir_baz subdir/baz
+	) &&
+	git clone --no-local --bare two-filters two-filters.git &&
+	(
+		cd two-filters.git &&
+		test_stdout_line_count = 1 ls objects/pack/*.pack &&
+		git -c repack.writebitmaps=false repack -a -d \
+			--filter=blob:none --filter=tree:1 &&
+		test_stdout_line_count = 2 ls objects/pack/*.pack &&
+		commit_pack=$(test-tool find-pack -c 1 HEAD) &&
+		blob_pack=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		root_tree_pack=$(test-tool find-pack -c 1 HEAD^{tree}) &&
+		subdir_tree_hash=$(git ls-tree --object-only HEAD -- subdir) &&
+		subdir_tree_pack=$(test-tool find-pack -c 1 "$subdir_tree_hash") &&
+
+		# Root tree and subdir tree are not in the same packfiles
+		test "$commit_pack" != "$blob_pack" &&
+		test "$commit_pack" = "$root_tree_pack" &&
+		test "$blob_pack" = "$subdir_tree_pack"
+	)
+'
+
+prepare_for_keep_packs () {
+	git init keep-packs &&
+	(
+		cd keep-packs &&
+		test_commit foo &&
+		test_commit bar
+	) &&
+	git clone --no-local --bare keep-packs keep-packs.git &&
+	(
+		cd keep-packs.git &&
+
+		# Create two packs
+		# The first pack will contain all of the objects except one blob
+		git rev-list --objects --all >objs &&
+		grep -v "bar.t" objs | git pack-objects pack &&
+		# The second pack will contain the excluded object and be kept
+		packid=$(grep "bar.t" objs | git pack-objects pack) &&
+		>pack-$packid.keep &&
+
+		# Replace the existing pack with the 2 new ones
+		rm -f objects/pack/pack* &&
+		mv pack-* objects/pack/
+	)
+}
+
+test_expect_success '--filter works with .keep packs' '
+	prepare_for_keep_packs &&
+	(
+		cd keep-packs.git &&
+
+		foo_pack=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		bar_pack=$(test-tool find-pack -c 1 HEAD:bar.t) &&
+		head_pack=$(test-tool find-pack -c 1 HEAD) &&
+
+		test "$foo_pack" != "$bar_pack" &&
+		test "$foo_pack" = "$head_pack" &&
+
+		git -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
+
+		foo_pack_1=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		bar_pack_1=$(test-tool find-pack -c 1 HEAD:bar.t) &&
+		head_pack_1=$(test-tool find-pack -c 1 HEAD) &&
+
+		# Object bar is still only in the old .keep pack
+		test "$foo_pack_1" != "$foo_pack" &&
+		test "$bar_pack_1" = "$bar_pack" &&
+		test "$head_pack_1" != "$head_pack" &&
+
+		test "$foo_pack_1" != "$bar_pack_1" &&
+		test "$foo_pack_1" != "$head_pack_1" &&
+		test "$bar_pack_1" != "$head_pack_1"
+	)
+'
+
+test_expect_success '--filter works with --pack-kept-objects and .keep packs' '
+	rm -rf keep-packs keep-packs.git &&
+	prepare_for_keep_packs &&
+	(
+		cd keep-packs.git &&
+
+		foo_pack=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		bar_pack=$(test-tool find-pack -c 1 HEAD:bar.t) &&
+		head_pack=$(test-tool find-pack -c 1 HEAD) &&
+
+		test "$foo_pack" != "$bar_pack" &&
+		test "$foo_pack" = "$head_pack" &&
+
+		git -c repack.writebitmaps=false repack -a -d --filter=blob:none \
+			--pack-kept-objects &&
+
+		foo_pack_1=$(test-tool find-pack -c 1 HEAD:foo.t) &&
+		test-tool find-pack -c 2 HEAD:bar.t >bar_pack_1 &&
+		head_pack_1=$(test-tool find-pack -c 1 HEAD) &&
+
+		test "$foo_pack_1" != "$foo_pack" &&
+		test "$foo_pack_1" != "$bar_pack" &&
+		test "$head_pack_1" != "$head_pack" &&
+
+		# Object bar is in both the old .keep pack and the new
+		# pack that contained the filtered out objects
+		grep "$bar_pack" bar_pack_1 &&
+		grep "$foo_pack_1" bar_pack_1 &&
+		test "$foo_pack_1" != "$head_pack_1"
+	)
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.42.0.279.g57b2ba444c


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v7 7/9] gc: add `gc.repackFilter` config option
  2023-09-25 15:25           ` [PATCH v7 0/9] Repack objects into separate packfiles based on a filter Christian Couder
                               ` (5 preceding siblings ...)
  2023-09-25 15:25             ` [PATCH v7 6/9] repack: add `--filter=<filter-spec>` option Christian Couder
@ 2023-09-25 15:25             ` Christian Couder
  2023-09-25 15:25             ` [PATCH v7 8/9] repack: implement `--filter-to` for storing filtered out objects Christian Couder
                               ` (3 subsequent siblings)
  10 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-25 15:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit has implemented `git repack --filter=<filter-spec>` to
allow users to filter out some objects from the main pack and move them
into a new different pack.

Users might want to perform such a cleanup regularly at the same time as
they perform other repacks and cleanups, so as part of `git gc`.

Let's allow them to configure a <filter-spec> for that purpose using a
new gc.repackFilter config option.

Now when `git gc` will perform a repack with a <filter-spec> configured
through this option and not empty, the repack process will be passed a
corresponding `--filter=<filter-spec>` argument.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/gc.txt |  5 +++++
 builtin/gc.c                |  6 ++++++
 t/t6500-gc.sh               | 13 +++++++++++++
 3 files changed, 24 insertions(+)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index ca47eb2008..2153bde7ac 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -145,6 +145,11 @@ Multiple hooks are supported, but all must exit successfully, else the
 operation (either generating a cruft pack or unpacking unreachable
 objects) will be halted.
 
+gc.repackFilter::
+	When repacking, use the specified filter to move certain
+	objects into a separate packfile.  See the
+	`--filter=<filter-spec>` option of linkgit:git-repack[1].
+
 gc.rerereResolved::
 	Records of conflicted merge you resolved earlier are
 	kept for this many days when 'git rerere gc' is run.
diff --git a/builtin/gc.c b/builtin/gc.c
index 00192ae5d3..98148e98fe 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -61,6 +61,7 @@ static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
+static char *repack_filter;
 static unsigned long big_pack_threshold;
 static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 
@@ -170,6 +171,8 @@ static void gc_config(void)
 	git_config_get_ulong("gc.bigpackthreshold", &big_pack_threshold);
 	git_config_get_ulong("pack.deltacachesize", &max_delta_cache_size);
 
+	git_config_get_string("gc.repackfilter", &repack_filter);
+
 	git_config(git_default_config, NULL);
 }
 
@@ -355,6 +358,9 @@ static void add_repack_all_option(struct string_list *keep_pack)
 
 	if (keep_pack)
 		for_each_string_list(keep_pack, keep_one_pack, NULL);
+
+	if (repack_filter && *repack_filter)
+		strvec_pushf(&repack, "--filter=%s", repack_filter);
 }
 
 static void add_repack_incremental_option(void)
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 69509d0c11..232e403b66 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -202,6 +202,19 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
 	grep -E "^trace: (built-in|exec|run_command): git reflog expire --" trace.out
 '
 
+test_expect_success 'gc.repackFilter launches repack with a filter' '
+	test_when_finished "rm -rf bare.git" &&
+	git clone --no-local --bare . bare.git &&
+
+	git -C bare.git -c gc.cruftPacks=false gc &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+
+	GIT_TRACE=$(pwd)/trace.out git -C bare.git -c gc.repackFilter=blob:none \
+		-c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&
+	test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
+	grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
+'
+
 prepare_cruft_history () {
 	test_commit base &&
 
-- 
2.42.0.279.g57b2ba444c


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v7 8/9] repack: implement `--filter-to` for storing filtered out objects
  2023-09-25 15:25           ` [PATCH v7 0/9] Repack objects into separate packfiles based on a filter Christian Couder
                               ` (6 preceding siblings ...)
  2023-09-25 15:25             ` [PATCH v7 7/9] gc: add `gc.repackFilter` config option Christian Couder
@ 2023-09-25 15:25             ` Christian Couder
  2023-09-25 15:25             ` [PATCH v7 9/9] gc: add `gc.repackFilterTo` config option Christian Couder
                               ` (2 subsequent siblings)
  10 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-25 15:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit has implemented `git repack --filter=<filter-spec>` to
allow users to filter out some objects from the main pack and move them
into a new different pack.

It would be nice if this new different pack could be created in a
different directory than the regular pack. This would make it possible
to move large blobs into a pack on a different kind of storage, for
example cheaper storage.

Even in a different directory, this pack can be accessible if, for
example, the Git alternates mechanism is used to point to it. In fact
not using the Git alternates mechanism can corrupt a repo as the
generated pack containing the filtered objects might not be accessible
from the repo any more. So setting up the Git alternates mechanism
should be done before using this feature if the user wants the repo to
be fully usable while this feature is used.

In some cases, like when a repo has just been cloned or when there is no
other activity in the repo, it's Ok to setup the Git alternates
mechanism afterwards though. It's also Ok to just inspect the generated
packfile containing the filtered objects and then just move it into the
'.git/objects/pack/' directory manually. That's why it's not necessary
for this command to check that the Git alternates mechanism has been
already setup.

While at it, as an example to show that `--filter` and `--filter-to`
work well with other options, let's also add a test to check that these
options work well with `--max-pack-size`.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt | 11 +++++++
 builtin/repack.c             | 10 +++++-
 t/t7700-repack.sh            | 62 ++++++++++++++++++++++++++++++++++++
 3 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 6d5bec7716..8545a32667 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -155,6 +155,17 @@ depth is 4095.
 	a single packfile containing all the objects. See
 	linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
 
+--filter-to=<dir>::
+	Write the pack containing filtered out objects to the
+	directory `<dir>`. Only useful with `--filter`. This can be
+	used for putting the pack on a separate object directory that
+	is accessed through the Git alternates mechanism. **WARNING:**
+	If the packfile containing the filtered out objects is not
+	accessible, the repo can become corrupt as it might not be
+	possible to access the objects in that packfile. See the
+	`objects` and `objects/info/alternates` sections of
+	linkgit:gitrepository-layout[5].
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index c7b564192f..db9277081d 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -977,6 +977,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int write_midx = 0;
 	const char *cruft_expiration = NULL;
 	const char *expire_to = NULL;
+	const char *filter_to = NULL;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -1029,6 +1030,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			   N_("write a multi-pack index of the resulting packs")),
 		OPT_STRING(0, "expire-to", &expire_to, N_("dir"),
 			   N_("pack prefix to store a pack containing pruned objects")),
+		OPT_STRING(0, "filter-to", &filter_to, N_("dir"),
+			   N_("pack prefix to store a pack containing filtered out objects")),
 		OPT_END()
 	};
 
@@ -1177,6 +1180,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (po_args.filter_options.choice)
 		strvec_pushf(&cmd.args, "--filter=%s",
 			     expand_list_objects_filter_spec(&po_args.filter_options));
+	else if (filter_to)
+		die(_("option '%s' can only be used along with '%s'"), "--filter-to", "--filter");
 
 	if (geometry.split_factor)
 		cmd.in = -1;
@@ -1265,8 +1270,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 
 	if (po_args.filter_options.choice) {
+		if (!filter_to)
+			filter_to = packtmp;
+
 		ret = write_filtered_pack(&po_args,
-					  packtmp,
+					  filter_to,
 					  find_pack_prefix(packdir, packtmp),
 					  &existing,
 					  &names);
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 39e89445fd..48e92aa6f7 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -462,6 +462,68 @@ test_expect_success '--filter works with --pack-kept-objects and .keep packs' '
 	)
 '
 
+test_expect_success '--filter-to stores filtered out objects' '
+	git -C bare.git repack -a -d &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+
+	git init --bare filtered.git &&
+	git -C bare.git -c repack.writebitmaps=false repack -a -d \
+		--filter=blob:none \
+		--filter-to=../filtered.git/objects/pack/pack &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/pack-*.pack &&
+	test_stdout_line_count = 1 ls filtered.git/objects/pack/pack-*.pack &&
+
+	commit_pack=$(test-tool -C bare.git find-pack -c 1 HEAD) &&
+	blob_pack=$(test-tool -C bare.git find-pack -c 0 HEAD:file1) &&
+	blob_hash=$(git -C bare.git rev-parse HEAD:file1) &&
+	test -n "$blob_hash" &&
+	blob_pack=$(test-tool -C filtered.git find-pack -c 1 $blob_hash) &&
+
+	echo $(pwd)/filtered.git/objects >bare.git/objects/info/alternates &&
+	blob_pack=$(test-tool -C bare.git find-pack -c 1 HEAD:file1) &&
+	blob_content=$(git -C bare.git show $blob_hash) &&
+	test "$blob_content" = "content1"
+'
+
+test_expect_success '--filter works with --max-pack-size' '
+	rm -rf filtered.git &&
+	git init --bare filtered.git &&
+	git init max-pack-size &&
+	(
+		cd max-pack-size &&
+		test_commit base &&
+		# two blobs which exceed the maximum pack size
+		test-tool genrandom foo 1048576 >foo &&
+		git hash-object -w foo &&
+		test-tool genrandom bar 1048576 >bar &&
+		git hash-object -w bar &&
+		git add foo bar &&
+		git commit -m "adding foo and bar"
+	) &&
+	git clone --no-local --bare max-pack-size max-pack-size.git &&
+	(
+		cd max-pack-size.git &&
+		git -c repack.writebitmaps=false repack -a -d --filter=blob:none \
+			--max-pack-size=1M \
+			--filter-to=../filtered.git/objects/pack/pack &&
+		echo $(cd .. && pwd)/filtered.git/objects >objects/info/alternates &&
+
+		# Check that the 3 blobs are in different packfiles in filtered.git
+		test_stdout_line_count = 3 ls ../filtered.git/objects/pack/pack-*.pack &&
+		test_stdout_line_count = 1 ls objects/pack/pack-*.pack &&
+		foo_pack=$(test-tool find-pack -c 1 HEAD:foo) &&
+		bar_pack=$(test-tool find-pack -c 1 HEAD:bar) &&
+		base_pack=$(test-tool find-pack -c 1 HEAD:base.t) &&
+		test "$foo_pack" != "$bar_pack" &&
+		test "$foo_pack" != "$base_pack" &&
+		test "$bar_pack" != "$base_pack" &&
+		for pack in "$foo_pack" "$bar_pack" "$base_pack"
+		do
+			case "$foo_pack" in */filtered.git/objects/pack/*) true ;; *) return 1 ;; esac
+		done
+	)
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.42.0.279.g57b2ba444c


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v7 9/9] gc: add `gc.repackFilterTo` config option
  2023-09-25 15:25           ` [PATCH v7 0/9] Repack objects into separate packfiles based on a filter Christian Couder
                               ` (7 preceding siblings ...)
  2023-09-25 15:25             ` [PATCH v7 8/9] repack: implement `--filter-to` for storing filtered out objects Christian Couder
@ 2023-09-25 15:25             ` Christian Couder
  2023-09-25 19:14             ` [PATCH v7 0/9] Repack objects into separate packfiles based on a filter Junio C Hamano
  2023-10-02 16:54             ` [PATCH v8 " Christian Couder
  10 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-09-25 15:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit implemented the `gc.repackFilter` config option
to specify a filter that should be used by `git gc` when
performing repacks.

Another previous commit has implemented
`git repack --filter-to=<dir>` to specify the location of the
packfile containing filtered out objects when using a filter.

Let's implement the `gc.repackFilterTo` config option to specify
that location in the config when `gc.repackFilter` is used.

Now when `git gc` will perform a repack with a <dir> configured
through this option and not empty, the repack process will be
passed a corresponding `--filter-to=<dir>` argument.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/gc.txt | 11 +++++++++++
 builtin/gc.c                |  4 ++++
 t/t6500-gc.sh               | 13 ++++++++++++-
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index 2153bde7ac..466466d6cc 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -150,6 +150,17 @@ gc.repackFilter::
 	objects into a separate packfile.  See the
 	`--filter=<filter-spec>` option of linkgit:git-repack[1].
 
+gc.repackFilterTo::
+	When repacking and using a filter, see `gc.repackFilter`, the
+	specified location will be used to create the packfile
+	containing the filtered out objects. **WARNING:** The
+	specified location should be accessible, using for example the
+	Git alternates mechanism, otherwise the repo could be
+	considered corrupt by Git as it migh not be able to access the
+	objects in that packfile. See the `--filter-to=<dir>` option
+	of linkgit:git-repack[1] and the `objects/info/alternates`
+	section of linkgit:gitrepository-layout[5].
+
 gc.rerereResolved::
 	Records of conflicted merge you resolved earlier are
 	kept for this many days when 'git rerere gc' is run.
diff --git a/builtin/gc.c b/builtin/gc.c
index 98148e98fe..68ca8d45bf 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -62,6 +62,7 @@ static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
 static char *repack_filter;
+static char *repack_filter_to;
 static unsigned long big_pack_threshold;
 static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 
@@ -172,6 +173,7 @@ static void gc_config(void)
 	git_config_get_ulong("pack.deltacachesize", &max_delta_cache_size);
 
 	git_config_get_string("gc.repackfilter", &repack_filter);
+	git_config_get_string("gc.repackfilterto", &repack_filter_to);
 
 	git_config(git_default_config, NULL);
 }
@@ -361,6 +363,8 @@ static void add_repack_all_option(struct string_list *keep_pack)
 
 	if (repack_filter && *repack_filter)
 		strvec_pushf(&repack, "--filter=%s", repack_filter);
+	if (repack_filter_to && *repack_filter_to)
+		strvec_pushf(&repack, "--filter-to=%s", repack_filter_to);
 }
 
 static void add_repack_incremental_option(void)
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 232e403b66..e412cf8daf 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -203,7 +203,6 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
 '
 
 test_expect_success 'gc.repackFilter launches repack with a filter' '
-	test_when_finished "rm -rf bare.git" &&
 	git clone --no-local --bare . bare.git &&
 
 	git -C bare.git -c gc.cruftPacks=false gc &&
@@ -215,6 +214,18 @@ test_expect_success 'gc.repackFilter launches repack with a filter' '
 	grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out
 '
 
+test_expect_success 'gc.repackFilterTo store filtered out objects' '
+	test_when_finished "rm -rf bare.git filtered.git" &&
+
+	git init --bare filtered.git &&
+	git -C bare.git -c gc.repackFilter=blob:none \
+		-c gc.repackFilterTo=../filtered.git/objects/pack/pack \
+		-c repack.writeBitmaps=false -c gc.cruftPacks=false gc &&
+
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+	test_stdout_line_count = 1 ls filtered.git/objects/pack/*.pack
+'
+
 prepare_cruft_history () {
 	test_commit base &&
 
-- 
2.42.0.279.g57b2ba444c


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v7 0/9] Repack objects into separate packfiles based on a filter
  2023-09-25 15:25           ` [PATCH v7 0/9] Repack objects into separate packfiles based on a filter Christian Couder
                               ` (8 preceding siblings ...)
  2023-09-25 15:25             ` [PATCH v7 9/9] gc: add `gc.repackFilterTo` config option Christian Couder
@ 2023-09-25 19:14             ` Junio C Hamano
  2023-09-25 22:41               ` Taylor Blau
  2023-10-02 16:54             ` [PATCH v8 " Christian Couder
  10 siblings, 1 reply; 161+ messages in thread
From: Junio C Hamano @ 2023-09-25 19:14 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee, Patrick Steinhardt

Christian Couder <christian.couder@gmail.com> writes:

> # Changes since version 6
>
> Thanks to Junio who reviewed or commented on versions 1, 2, 3, 4 and
> 5, and to Taylor who reviewed or commented on version 1, 3, 4, 5 and
> 6!  Thanks also to Robert Coup who participated in the discussions
> related to version 2 and Peff who participated in the discussions
> related to version 4. There are only the following changes since
> version 6:
>
> - This series has been rebased on top of bcb6cae296 (The twelfth
>   batch, 2023-09-22) to fix conflicts with a `builtin/repack.c`
>   refactoring patch series called tb/repack-existing-packs-cleanup by
>   Taylor Blau that recently graduated to 'master':
>
> 	https://lore.kernel.org/git/cover.1694632644.git.me@ttaylorr.com/
> 	https://lore.kernel.org/git/xmqqil81wqkx.fsf@gitster.g/
>
> - Patch 6/9 (repack: add `--filter=<filter-spec>` option) has been
>   reworked to apply on top of the above mentioned patch series.
>   Taylor even posted the fixup patch to apply to this series so that
>   it works well on top of his series:
>   
>     https://lore.kernel.org/git/ZQNKkn0YYLUyN5Ih@nand.local/

Thanks, both, for working well together.

Will replace and merge to 'seen'.  Let's see others supporting the
change to chime in, and get it merged to 'next' soonish.  I gave a
quick cursory look and changes to rebuild on the "existing packs
cleanup" topic all looked sensible.



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v7 0/9] Repack objects into separate packfiles based on a filter
  2023-09-25 19:14             ` [PATCH v7 0/9] Repack objects into separate packfiles based on a filter Junio C Hamano
@ 2023-09-25 22:41               ` Taylor Blau
  0 siblings, 0 replies; 161+ messages in thread
From: Taylor Blau @ 2023-09-25 22:41 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee, Patrick Steinhardt

On Mon, Sep 25, 2023 at 12:14:00PM -0700, Junio C Hamano wrote:
> Thanks, both, for working well together.

Christian made it easy to do so! ;-)

> Will replace and merge to 'seen'.  Let's see others supporting the
> change to chime in, and get it merged to 'next' soonish.  I gave a
> quick cursory look and changes to rebuild on the "existing packs
> cleanup" topic all looked sensible.

Sounds good. I took a look over the range-diff and the changes were as
expected. Having reviewed earlier rounds of this series in depth, I'm
comfortable merging this down whenever you are.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v8 0/9] Repack objects into separate packfiles based on a filter
  2023-09-25 15:25           ` [PATCH v7 0/9] Repack objects into separate packfiles based on a filter Christian Couder
                               ` (9 preceding siblings ...)
  2023-09-25 19:14             ` [PATCH v7 0/9] Repack objects into separate packfiles based on a filter Junio C Hamano
@ 2023-10-02 16:54             ` Christian Couder
  2023-10-02 16:54               ` [PATCH v8 1/9] pack-objects: allow `--filter` without `--stdout` Christian Couder
                                 ` (9 more replies)
  10 siblings, 10 replies; 161+ messages in thread
From: Christian Couder @ 2023-10-02 16:54 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

# Intro

Last year, John Cai sent 2 versions of a patch series to implement
`git repack --filter=<filter-spec>` and later I sent 4 versions of a
patch series trying to do it a bit differently:

  - https://lore.kernel.org/git/pull.1206.git.git.1643248180.gitgitgadget@gmail.com/
  - https://lore.kernel.org/git/20221012135114.294680-1-christian.couder@gmail.com/

In these patch series, the `--filter=<filter-spec>` removed the
filtered out objects altogether which was considered very dangerous
even though we implemented different safety checks in some of the
latter series.

In some discussions, it was mentioned that such a feature, or a
similar feature in `git gc`, or in a new standalone command (perhaps
called `git prune-filtered`), should put the filtered out objects into
a new packfile instead of deleting them.

Recently there were internal discussions at GitLab about either moving
blobs from inactive repos onto cheaper storage, or moving large blobs
onto cheaper storage. This lead us to rethink at repacking using a
filter, but moving the filtered out objects into a separate packfile
instead of deleting them.

So here is a new patch series doing that while implementing the
`--filter=<filter-spec>` option in `git repack`.

# Use cases for the new feature

This could be useful for example for the following purposes:

  1) As a way for servers to save storage costs by for example moving
     large blobs, or all the blobs, or all the blobs in inactive
     repos, to separate storage (while still making them accessible
     using for example the alternates mechanism).

  2) As a way to use partial clone on a Git server to offload large
     blobs to, for example, an http server, while using multiple
     promisor remotes (to be able to access everything) on the client
     side. (In this case the packfile that contains the filtered out
     object can be manualy removed after checking that all the objects
     it contains are available through the promisor remote.)

  3) As a way for clients to reclaim some space when they cloned with
     a filter to save disk space but then fetched a lot of unwanted
     objects (for example when checking out old branches) and now want
     to remove these unwanted objects. (In this case they can first
     move the packfile that contains filtered out objects to a
     separate directory or storage, then check that everything works
     well, and then manually remove the packfile after some time.)

As the features and the code are quite different from those in the
previous series, I decided to start a new series instead of continuing
a previous one.

Also since version 2 of this new series, commit messages, don't
mention uses cases like 2) or 3) above, as people have different
opinions on how it should be done. How it should be done could depend
a lot on the way promisor remotes are used, the software and hardware
setups used, etc, so it seems more difficult to "sell" this series by
talking about such use cases. As use case 1) seems simpler and more
appealing, it makes more sense to only talk about it in the commit
messages.

# Changes since version 7

Thanks to Junio who reviewed or commented on nearly all the versions,
and to Taylor who reviewed or commented on version 1, 3, 4, 5 and 6!
Thanks also to Robert Coup who participated in the discussions related
to version 2 and Peff who participated in the discussions related to
version 4.

There are only the following changes since version 7:

- This series has been rebased on top of 493f462273 (The thirteenth
  batch, 2023-09-29) to avoid possible conflicts with other series
  that could potentially conflict with this one.

- Patch 2/9 (t/helper: add 'find-pack' test-tool) has been reworked to
  use the "t0081" test script number instead of "t0080" as the later
  is used by js/doc-unit-tests. I asked in:
  
  https://lore.kernel.org/git/CAP8UFD2YbYH5aZEG5NX8HLe9VeEQ+NhBfiZ9Mhy3UXTUrab3ug@mail.gmail.com/

  if someone thought another number was better, but got no answer.

I checked that CI tests passes in:

https://github.com/chriscool/git/actions/runs/6382343338

All jobs seem to have succeeded.

# Commit overview

(No changes in any of the patches compared to version 7, except on
patch 2/9.)

* 1/9 pack-objects: allow `--filter` without `--stdout`

  To be able to later repack with a filter we need `git pack-objects`
  to write packfiles when it's filtering instead of just writing the
  pack without the filtered out objects to stdout.

* 2/9 t/helper: add 'find-pack' test-tool

  For testing `git repack --filter=...` that we are going to
  implement, it's useful to have a test helper that can tell which
  packfiles contain a specific object. The only change compared to v7
  is the change in test script number.

* 3/9 repack: refactor finishing pack-objects command

  This is a small refactoring creating a new useful function, so that
  `git repack --filter=...` will be able to reuse it.

* 4/9 repack: refactor finding pack prefix

  This is another small refactoring creating a small function that
  will be reused in the next patch.

* 5/9 pack-bitmap-write: rebuild using new bitmap when remapping

  It fixes an issue when bitmaps are rebuilt that was revealed by this
  series, and caused a CI test to fail.

* 6/9 repack: add `--filter=<filter-spec>` option

  This actually adds the `--filter=<filter-spec>` option. It uses one
  `git pack-objects` process with the `--filter` option. And then
  another `git pack-objects` process with the `--stdin-packs`
  option.
  
* 7/9 gc: add `gc.repackFilter` config option

  This is a gc config option so that `git gc` can also repack using a
  filter and put the filtered out objects into a separate packfile.

* 8/9 repack: implement `--filter-to` for storing filtered out objects

  For some use cases, it's interesting to create the packfile that
  contains the filtered out objects into a separate location. This is
  similar to the `--expire-to` option for cruft packfiles.

* 9/9 gc: add `gc.repackFilterTo` config option

  This allows specifying the location of the packfile that contains
  the filtered out objects when using `gc.repackFilter`.

# Range-diff since v7

 1:  eec0c09731 =  1:  b23d216277 pack-objects: allow `--filter` without `--stdout`
 2:  19c8b8a4b9 !  2:  27e70ccf39 t/helper: add 'find-pack' test-tool
    @@ t/helper/test-tool.h: int cmd__dump_reftable(int argc, const char **argv);
      int cmd__genrandom(int argc, const char **argv);
      int cmd__genzeros(int argc, const char **argv);
     
    - ## t/t0080-find-pack.sh (new) ##
    + ## t/t0081-find-pack.sh (new) ##
     @@
     +#!/bin/sh
     +
 3:  aaaf40bd5d =  3:  7e692c4cfd repack: refactor finishing pack-objects command
 4:  1eb6bc3f7e =  4:  227159ed4e repack: refactor finding pack prefix
 5:  b9159e1803 =  5:  79786eb5e1 pack-bitmap-write: rebuild using new bitmap when remapping
 6:  f2f5bb54d3 =  6:  205d33850e repack: add `--filter=<filter-spec>` option
 7:  7ea0307628 =  7:  16b1621169 gc: add `gc.repackFilter` config option
 8:  698647815b =  8:  92a5ff7cc7 repack: implement `--filter-to` for storing filtered out objects
 9:  57b2ba444c =  9:  5bfd918c90 gc: add `gc.repackFilterTo` config option

Christian Couder (9):
  pack-objects: allow `--filter` without `--stdout`
  t/helper: add 'find-pack' test-tool
  repack: refactor finishing pack-objects command
  repack: refactor finding pack prefix
  pack-bitmap-write: rebuild using new bitmap when remapping
  repack: add `--filter=<filter-spec>` option
  gc: add `gc.repackFilter` config option
  repack: implement `--filter-to` for storing filtered out objects
  gc: add `gc.repackFilterTo` config option

 Documentation/config/gc.txt            |  16 ++
 Documentation/git-pack-objects.txt     |   4 +-
 Documentation/git-repack.txt           |  23 +++
 Makefile                               |   1 +
 builtin/gc.c                           |  10 ++
 builtin/pack-objects.c                 |   8 +-
 builtin/repack.c                       | 164 ++++++++++++++------
 pack-bitmap-write.c                    |   6 +-
 t/helper/test-find-pack.c              |  50 +++++++
 t/helper/test-tool.c                   |   1 +
 t/helper/test-tool.h                   |   1 +
 t/t0081-find-pack.sh                   |  82 ++++++++++
 t/t5317-pack-objects-filter-objects.sh |   8 +
 t/t6500-gc.sh                          |  24 +++
 t/t7700-repack.sh                      | 197 +++++++++++++++++++++++++
 15 files changed, 544 insertions(+), 51 deletions(-)
 create mode 100644 t/helper/test-find-pack.c
 create mode 100755 t/t0081-find-pack.sh

-- 
2.42.0.305.g5bfd918c90


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v8 1/9] pack-objects: allow `--filter` without `--stdout`
  2023-10-02 16:54             ` [PATCH v8 " Christian Couder
@ 2023-10-02 16:54               ` Christian Couder
  2023-10-02 16:54               ` [PATCH v8 2/9] t/helper: add 'find-pack' test-tool Christian Couder
                                 ` (8 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-10-02 16:54 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

9535ce7337 (pack-objects: add list-objects filtering, 2017-11-21)
taught `git pack-objects` to use `--filter`, but required the use of
`--stdout` since a partial clone mechanism was not yet in place to
handle missing objects. Since then, changes like 9e27beaa23
(promisor-remote: implement promisor_remote_get_direct(), 2019-06-25)
and others added support to dynamically fetch objects that were missing.

Even without a promisor remote, filtering out objects can also be useful
if we can put the filtered out objects in a separate pack, and in this
case it also makes sense for pack-objects to write the packfile directly
to an actual file rather than on stdout.

Remove the `--stdout` requirement when using `--filter`, so that in a
follow-up commit, repack can pass `--filter` to pack-objects to omit
certain objects from the resulting packfile.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-pack-objects.txt     | 4 ++--
 builtin/pack-objects.c                 | 8 ++------
 t/t5317-pack-objects-filter-objects.sh | 8 ++++++++
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index dea7eacb0f..e32404c6aa 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -296,8 +296,8 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it creates a bundle.
 	nevertheless.
 
 --filter=<filter-spec>::
-	Requires `--stdout`.  Omits certain objects (usually blobs) from
-	the resulting packfile.  See linkgit:git-rev-list[1] for valid
+	Omits certain objects (usually blobs) from the resulting
+	packfile.  See linkgit:git-rev-list[1] for valid
 	`<filter-spec>` forms.
 
 --no-filter::
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6eb9756836..89a8b5a976 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4402,12 +4402,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (filter_options.choice) {
-		if (!pack_to_stdout)
-			die(_("cannot use --filter without --stdout"));
-		if (stdin_packs)
-			die(_("cannot use --filter with --stdin-packs"));
-	}
+	if (stdin_packs && filter_options.choice)
+		die(_("cannot use --filter with --stdin-packs"));
 
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
diff --git a/t/t5317-pack-objects-filter-objects.sh b/t/t5317-pack-objects-filter-objects.sh
index b26d476c64..2ff3eef9a3 100755
--- a/t/t5317-pack-objects-filter-objects.sh
+++ b/t/t5317-pack-objects-filter-objects.sh
@@ -53,6 +53,14 @@ test_expect_success 'verify blob:none packfile has no blobs' '
 	! grep blob verify_result
 '
 
+test_expect_success 'verify blob:none packfile without --stdout' '
+	git -C r1 pack-objects --revs --filter=blob:none mypackname >packhash <<-EOF &&
+	HEAD
+	EOF
+	git -C r1 verify-pack -v "mypackname-$(cat packhash).pack" >verify_result &&
+	! grep blob verify_result
+'
+
 test_expect_success 'verify normal and blob:none packfiles have same commits/trees' '
 	git -C r1 verify-pack -v ../all.pack >verify_result &&
 	grep -E "commit|tree" verify_result |
-- 
2.42.0.305.g5bfd918c90


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v8 2/9] t/helper: add 'find-pack' test-tool
  2023-10-02 16:54             ` [PATCH v8 " Christian Couder
  2023-10-02 16:54               ` [PATCH v8 1/9] pack-objects: allow `--filter` without `--stdout` Christian Couder
@ 2023-10-02 16:54               ` Christian Couder
  2023-10-02 16:54               ` [PATCH v8 3/9] repack: refactor finishing pack-objects command Christian Couder
                                 ` (7 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-10-02 16:54 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

In a following commit, we will make it possible to separate objects in
different packfiles depending on a filter.

To make sure that the right objects are in the right packs, let's add a
new test-tool that can display which packfile(s) a given object is in.

Let's also make it possible to check if a given object is in the
expected number of packfiles with a `--check-count <n>` option.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Makefile                  |  1 +
 t/helper/test-find-pack.c | 50 ++++++++++++++++++++++++
 t/helper/test-tool.c      |  1 +
 t/helper/test-tool.h      |  1 +
 t/t0081-find-pack.sh      | 82 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 135 insertions(+)
 create mode 100644 t/helper/test-find-pack.c
 create mode 100755 t/t0081-find-pack.sh

diff --git a/Makefile b/Makefile
index 003e63b792..f267034d23 100644
--- a/Makefile
+++ b/Makefile
@@ -800,6 +800,7 @@ TEST_BUILTINS_OBJS += test-dump-untracked-cache.o
 TEST_BUILTINS_OBJS += test-env-helper.o
 TEST_BUILTINS_OBJS += test-example-decorate.o
 TEST_BUILTINS_OBJS += test-fast-rebase.o
+TEST_BUILTINS_OBJS += test-find-pack.o
 TEST_BUILTINS_OBJS += test-fsmonitor-client.o
 TEST_BUILTINS_OBJS += test-genrandom.o
 TEST_BUILTINS_OBJS += test-genzeros.o
diff --git a/t/helper/test-find-pack.c b/t/helper/test-find-pack.c
new file mode 100644
index 0000000000..e8bd793e58
--- /dev/null
+++ b/t/helper/test-find-pack.c
@@ -0,0 +1,50 @@
+#include "test-tool.h"
+#include "object-name.h"
+#include "object-store.h"
+#include "packfile.h"
+#include "parse-options.h"
+#include "setup.h"
+
+/*
+ * Display the path(s), one per line, of the packfile(s) containing
+ * the given object.
+ *
+ * If '--check-count <n>' is passed, then error out if the number of
+ * packfiles containing the object is not <n>.
+ */
+
+static const char *find_pack_usage[] = {
+	"test-tool find-pack [--check-count <n>] <object>",
+	NULL
+};
+
+int cmd__find_pack(int argc, const char **argv)
+{
+	struct object_id oid;
+	struct packed_git *p;
+	int count = -1, actual_count = 0;
+	const char *prefix = setup_git_directory();
+
+	struct option options[] = {
+		OPT_INTEGER('c', "check-count", &count, "expected number of packs"),
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, prefix, options, find_pack_usage, 0);
+	if (argc != 1)
+		usage(find_pack_usage[0]);
+
+	if (repo_get_oid(the_repository, argv[0], &oid))
+		die("cannot parse %s as an object name", argv[0]);
+
+	for (p = get_all_packs(the_repository); p; p = p->next)
+		if (find_pack_entry_one(oid.hash, p)) {
+			printf("%s\n", p->pack_name);
+			actual_count++;
+		}
+
+	if (count > -1 && count != actual_count)
+		die("bad packfile count %d instead of %d", actual_count, count);
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 621ac3dd10..9010ac6de7 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -31,6 +31,7 @@ static struct test_cmd cmds[] = {
 	{ "env-helper", cmd__env_helper },
 	{ "example-decorate", cmd__example_decorate },
 	{ "fast-rebase", cmd__fast_rebase },
+	{ "find-pack", cmd__find_pack },
 	{ "fsmonitor-client", cmd__fsmonitor_client },
 	{ "genrandom", cmd__genrandom },
 	{ "genzeros", cmd__genzeros },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index a641c3a81d..f134f96b97 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -25,6 +25,7 @@ int cmd__dump_reftable(int argc, const char **argv);
 int cmd__env_helper(int argc, const char **argv);
 int cmd__example_decorate(int argc, const char **argv);
 int cmd__fast_rebase(int argc, const char **argv);
+int cmd__find_pack(int argc, const char **argv);
 int cmd__fsmonitor_client(int argc, const char **argv);
 int cmd__genrandom(int argc, const char **argv);
 int cmd__genzeros(int argc, const char **argv);
diff --git a/t/t0081-find-pack.sh b/t/t0081-find-pack.sh
new file mode 100755
index 0000000000..67b11216a3
--- /dev/null
+++ b/t/t0081-find-pack.sh
@@ -0,0 +1,82 @@
+#!/bin/sh
+
+test_description='test `test-tool find-pack`'
+
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_commit one &&
+	test_commit two &&
+	test_commit three &&
+	test_commit four &&
+	test_commit five
+'
+
+test_expect_success 'repack everything into a single packfile' '
+	git repack -a -d --no-write-bitmap-index &&
+
+	head_commit_pack=$(test-tool find-pack HEAD) &&
+	head_tree_pack=$(test-tool find-pack HEAD^{tree}) &&
+	one_pack=$(test-tool find-pack HEAD:one.t) &&
+	three_pack=$(test-tool find-pack HEAD:three.t) &&
+	old_commit_pack=$(test-tool find-pack HEAD~4) &&
+
+	test-tool find-pack --check-count 1 HEAD &&
+	test-tool find-pack --check-count=1 HEAD^{tree} &&
+	! test-tool find-pack --check-count=0 HEAD:one.t &&
+	! test-tool find-pack -c 2 HEAD:one.t &&
+	test-tool find-pack -c 1 HEAD:three.t &&
+
+	# Packfile exists at the right path
+	case "$head_commit_pack" in
+		".git/objects/pack/pack-"*".pack") true ;;
+		*) false ;;
+	esac &&
+	test -f "$head_commit_pack" &&
+
+	# Everything is in the same pack
+	test "$head_commit_pack" = "$head_tree_pack" &&
+	test "$head_commit_pack" = "$one_pack" &&
+	test "$head_commit_pack" = "$three_pack" &&
+	test "$head_commit_pack" = "$old_commit_pack"
+'
+
+test_expect_success 'add more packfiles' '
+	git rev-parse HEAD^{tree} HEAD:two.t HEAD:four.t >objects &&
+	git pack-objects .git/objects/pack/mypackname1 >packhash1 <objects &&
+
+	git rev-parse HEAD~ HEAD~^{tree} HEAD:five.t >objects &&
+	git pack-objects .git/objects/pack/mypackname2 >packhash2 <objects &&
+
+	head_commit_pack=$(test-tool find-pack HEAD) &&
+
+	# HEAD^{tree} is in 2 packfiles
+	test-tool find-pack HEAD^{tree} >head_tree_packs &&
+	grep "$head_commit_pack" head_tree_packs &&
+	grep mypackname1 head_tree_packs &&
+	! grep mypackname2 head_tree_packs &&
+	test-tool find-pack --check-count 2 HEAD^{tree} &&
+	! test-tool find-pack --check-count 1 HEAD^{tree} &&
+
+	# HEAD:five.t is also in 2 packfiles
+	test-tool find-pack HEAD:five.t >five_packs &&
+	grep "$head_commit_pack" five_packs &&
+	! grep mypackname1 five_packs &&
+	grep mypackname2 five_packs &&
+	test-tool find-pack -c 2 HEAD:five.t &&
+	! test-tool find-pack --check-count=0 HEAD:five.t
+'
+
+test_expect_success 'add more commits (as loose objects)' '
+	test_commit six &&
+	test_commit seven &&
+
+	test -z "$(test-tool find-pack HEAD)" &&
+	test -z "$(test-tool find-pack HEAD:six.t)" &&
+	test-tool find-pack --check-count 0 HEAD &&
+	test-tool find-pack -c 0 HEAD:six.t &&
+	! test-tool find-pack -c 1 HEAD:seven.t
+'
+
+test_done
-- 
2.42.0.305.g5bfd918c90


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v8 3/9] repack: refactor finishing pack-objects command
  2023-10-02 16:54             ` [PATCH v8 " Christian Couder
  2023-10-02 16:54               ` [PATCH v8 1/9] pack-objects: allow `--filter` without `--stdout` Christian Couder
  2023-10-02 16:54               ` [PATCH v8 2/9] t/helper: add 'find-pack' test-tool Christian Couder
@ 2023-10-02 16:54               ` Christian Couder
  2023-10-02 16:54               ` [PATCH v8 4/9] repack: refactor finding pack prefix Christian Couder
                                 ` (6 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-10-02 16:54 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

Create a new finish_pack_objects_cmd() to refactor duplicated code
that handles reading the packfile names from the output of a
`git pack-objects` command and putting it into a string_list, as well as
calling finish_command().

While at it, beautify a code comment a bit in the new function.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org
---
 builtin/repack.c | 70 +++++++++++++++++++++++-------------------------
 1 file changed, 33 insertions(+), 37 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 529e13120d..d0ab55c0d9 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -806,6 +806,36 @@ static void remove_redundant_bitmaps(struct string_list *include,
 	strbuf_release(&path);
 }
 
+static int finish_pack_objects_cmd(struct child_process *cmd,
+				   struct string_list *names,
+				   int local)
+{
+	FILE *out;
+	struct strbuf line = STRBUF_INIT;
+
+	out = xfdopen(cmd->out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		struct string_list_item *item;
+
+		if (line.len != the_hash_algo->hexsz)
+			die(_("repack: Expecting full hex object ID lines only "
+			      "from pack-objects."));
+		/*
+		 * Avoid putting packs written outside of the repository in the
+		 * list of names.
+		 */
+		if (local) {
+			item = string_list_append(names, line.buf);
+			item->util = populate_pack_exts(line.buf);
+		}
+	}
+	fclose(out);
+
+	strbuf_release(&line);
+
+	return finish_command(cmd);
+}
+
 static int write_cruft_pack(const struct pack_objects_args *args,
 			    const char *destination,
 			    const char *pack_prefix,
@@ -814,9 +844,8 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 			    struct existing_packs *existing)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
-	struct strbuf line = STRBUF_INIT;
 	struct string_list_item *item;
-	FILE *in, *out;
+	FILE *in;
 	int ret;
 	const char *scratch;
 	int local = skip_prefix(destination, packdir, &scratch);
@@ -861,27 +890,7 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 		fprintf(in, "%s.pack\n", item->string);
 	fclose(in);
 
-	out = xfdopen(cmd.out, "r");
-	while (strbuf_getline_lf(&line, out) != EOF) {
-		struct string_list_item *item;
-
-		if (line.len != the_hash_algo->hexsz)
-			die(_("repack: Expecting full hex object ID lines only "
-			      "from pack-objects."));
-		/*
-		 * avoid putting packs written outside of the repository in the
-		 * list of names
-		 */
-		if (local) {
-			item = string_list_append(names, line.buf);
-			item->util = populate_pack_exts(line.buf);
-		}
-	}
-	fclose(out);
-
-	strbuf_release(&line);
-
-	return finish_command(&cmd);
+	return finish_pack_objects_cmd(&cmd, names, local);
 }
 
 int cmd_repack(int argc, const char **argv, const char *prefix)
@@ -891,10 +900,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list names = STRING_LIST_INIT_DUP;
 	struct existing_packs existing = EXISTING_PACKS_INIT;
 	struct pack_geometry geometry = { 0 };
-	struct strbuf line = STRBUF_INIT;
 	struct tempfile *refs_snapshot = NULL;
 	int i, ext, ret;
-	FILE *out;
 	int show_progress;
 
 	/* variables to be filled by option parsing */
@@ -1124,18 +1131,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		fclose(in);
 	}
 
-	out = xfdopen(cmd.out, "r");
-	while (strbuf_getline_lf(&line, out) != EOF) {
-		struct string_list_item *item;
-
-		if (line.len != the_hash_algo->hexsz)
-			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
-		item = string_list_append(&names, line.buf);
-		item->util = populate_pack_exts(item->string);
-	}
-	strbuf_release(&line);
-	fclose(out);
-	ret = finish_command(&cmd);
+	ret = finish_pack_objects_cmd(&cmd, &names, 1);
 	if (ret)
 		goto cleanup;
 
-- 
2.42.0.305.g5bfd918c90


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v8 4/9] repack: refactor finding pack prefix
  2023-10-02 16:54             ` [PATCH v8 " Christian Couder
                                 ` (2 preceding siblings ...)
  2023-10-02 16:54               ` [PATCH v8 3/9] repack: refactor finishing pack-objects command Christian Couder
@ 2023-10-02 16:54               ` Christian Couder
  2023-10-02 16:55               ` [PATCH v8 5/9] pack-bitmap-write: rebuild using new bitmap when remapping Christian Couder
                                 ` (5 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-10-02 16:54 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder

Create a new find_pack_prefix() to refactor code that handles finding
the pack prefix from the packtmp and packdir global variables, as we are
going to need this feature again in following commit.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org
---
 builtin/repack.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index d0ab55c0d9..9ef0044384 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -893,6 +893,17 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 	return finish_pack_objects_cmd(&cmd, names, local);
 }
 
+static const char *find_pack_prefix(const char *packdir, const char *packtmp)
+{
+	const char *pack_prefix;
+	if (!skip_prefix(packtmp, packdir, &pack_prefix))
+		die(_("pack prefix %s does not begin with objdir %s"),
+		    packtmp, packdir);
+	if (*pack_prefix == '/')
+		pack_prefix++;
+	return pack_prefix;
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -1139,12 +1150,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		printf_ln(_("Nothing new to pack."));
 
 	if (pack_everything & PACK_CRUFT) {
-		const char *pack_prefix;
-		if (!skip_prefix(packtmp, packdir, &pack_prefix))
-			die(_("pack prefix %s does not begin with objdir %s"),
-			    packtmp, packdir);
-		if (*pack_prefix == '/')
-			pack_prefix++;
+		const char *pack_prefix = find_pack_prefix(packdir, packtmp);
 
 		if (!cruft_po_args.window)
 			cruft_po_args.window = po_args.window;
-- 
2.42.0.305.g5bfd918c90


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v8 5/9] pack-bitmap-write: rebuild using new bitmap when remapping
  2023-10-02 16:54             ` [PATCH v8 " Christian Couder
                                 ` (3 preceding siblings ...)
  2023-10-02 16:54               ` [PATCH v8 4/9] repack: refactor finding pack prefix Christian Couder
@ 2023-10-02 16:55               ` Christian Couder
  2023-10-02 16:55               ` [PATCH v8 6/9] repack: add `--filter=<filter-spec>` option Christian Couder
                                 ` (4 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-10-02 16:55 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

`git repack` is about to learn a new `--filter=<filter-spec>` option and
we will want to check that this option is incompatible with
`--write-bitmap-index`.

Unfortunately it appears that a test like:

test_expect_success '--filter fails with --write-bitmap-index' '
       test_must_fail \
               env GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
               git -C bare.git repack -a -d --write-bitmap-index --filter=blob:none
'

sometimes fail because when rebuilding bitmaps, it appears that we are
reusing existing bitmap information. So instead of detecting that some
objects are missing and erroring out as it should, the
`git repack --write-bitmap-index --filter=...` command succeeds.

Let's fix that by making sure we rebuild bitmaps using new bitmaps
instead of existing ones.

Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 pack-bitmap-write.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index f6757c3cbf..f4ecdf8b0e 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -413,15 +413,19 @@ static int fill_bitmap_commit(struct bb_commit *ent,
 
 		if (old_bitmap && mapping) {
 			struct ewah_bitmap *old = bitmap_for_commit(old_bitmap, c);
+			struct bitmap *remapped = bitmap_new();
 			/*
 			 * If this commit has an old bitmap, then translate that
 			 * bitmap and add its bits to this one. No need to walk
 			 * parents or the tree for this commit.
 			 */
-			if (old && !rebuild_bitmap(mapping, old, ent->bitmap)) {
+			if (old && !rebuild_bitmap(mapping, old, remapped)) {
+				bitmap_or(ent->bitmap, remapped);
+				bitmap_free(remapped);
 				reused_bitmaps_nr++;
 				continue;
 			}
+			bitmap_free(remapped);
 		}
 
 		/*
-- 
2.42.0.305.g5bfd918c90


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v8 6/9] repack: add `--filter=<filter-spec>` option
  2023-10-02 16:54             ` [PATCH v8 " Christian Couder
                                 ` (4 preceding siblings ...)
  2023-10-02 16:55               ` [PATCH v8 5/9] pack-bitmap-write: rebuild using new bitmap when remapping Christian Couder
@ 2023-10-02 16:55               ` Christian Couder
  2023-10-02 16:55               ` [PATCH v8 7/9] gc: add `gc.repackFilter` config option Christian Couder
                                 ` (3 subsequent siblings)
  9 siblings, 0 replies; 161+ messages in thread
From: Christian Couder @ 2023-10-02 16:55 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Patrick Steinhardt, Christian Couder,
	Christian Couder

This new option puts the objects specified by `<filter-spec>` into a
separate packfile.

This could be useful if, for example, some blobs take up a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.

It's possible to find which new packfile contains the filtered out
objects using one of the following:

  - `git verify-pack -v ...`,
  - `test-tool find-pack ...`, which a previous commit added,
  - `--filter-to=<dir>`, which a following commit will add to specify
    where the pack containing the filtered out objects will be.

This feature is implemented by running `git pack-objects` twice in a
row. The first command is run with `--filter=<filter-spec>`, using the
specified filter. It packs objects while omitting the objects specified
by the filter. Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.

As the interactions with kept packs are a bit tricky, a few related
tests are added.

Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt |  12 ++++
 builtin/repack.c             |  70 ++++++++++++++++++
 t/t7700-repack.sh            | 135 +++++++++++++++++++++++++++++++++++
 3 files changed, 217 insertions(+)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 4017157949..6d5bec7716 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -143,6 +143,18 @@ depth is 4095.
 	a larger and slower repository; see the discussion in
 	`pack.packSizeLimit`.
 
+--filter=<filter-spec>::
+	Remove objects matching the filter specification from the
+	resulting packfile and put them into a separate packfile. Note
+	that objects used in the working directory are not filtered
+	out. So for the split to fully work, it's best to perform it
+	in a bare repo and to use the `-a` and `-d` options along with
+	this option.  Also `--no-write-bitmap-index` (or the
+	`repack.writebitmaps` config option set to `false`) should be
+	used otherwise writing bitmap index will fail, as it supposes
+	a single packfile containing all the objects. See
+	linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This