Git Mailing List Archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects'
@ 2025-03-10  1:50 Derrick Stolee via GitGitGadget
  2025-03-10  1:50 ` [PATCH 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
                   ` (14 more replies)
  0 siblings, 15 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-10  1:50 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee

Here is a full submission of the --path-walk feature for 'git pack-objects'
and 'git repack'. It's been discussed in an RFC [1], as a future application
for the path walk API [2], and is updated now that --name-hash-version=2
exists (as a replacement for the --full-name-hash option from the RFC) [3].

[1]
https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/

[2]
https://lore.kernel.org/git/pull.1818.git.1730356023.gitgitgadget@gmail.com

[3]
https://lore.kernel.org/git/pull.1813.git.1728396723.gitgitgadget@gmail.com

This patch series does the following:

 1. Add a new '--path-walk' option to 'git pack-objects' that uses the
    path-walk API instead of the revision API to collect objects for delta
    compression.

 2. Add a new '--path-walk' option to 'git repack' to pass this option along
    to 'git pack-objects'.

 3. Add a new 'pack.usePathWalk' config option to opt into this option
    implicitly, such as in 'git push'.

 4. Optimize the '--path-walk' option using threading so it better competes
    with the existing multi-threaded delta compression mechanism.

 5. Update the path-walk API with a new 'edge_aggressive' option that pairs
    close to the --edge-aggressive option in the revision API. This is
    useful when creating thin packs inside shallow clones.

This feature works by using the path-walk API to emit groups of objects that
appear at the same path. These groups are tracked so they can be tested for
delta compression with each other, and then after those groups are tested a
second pass using the name-hash attempts to find better (or first time)
deltas across path boundaries. This second pass is much faster than a fresh
pass since the existing deltas are used as a limit for the size of
potentially new deltas, short-circuiting the checks when the delta size
exceeds the current-best.

The benefits of the --path-walk feature first come into play when the name
hash functions have many collisions, so sorting by name hash value leads to
unhelpful groupings of objects. Many of these benefits are improved by
--name-hash-version=2, but collisions still exist with any hash-based
approach. There are also performance benefits in some cases due to the
isolation of delta compression testing within path groups.

All of the benefits of the --path-walk feature are less dramatic when
compared to --name-hash-version=2, but they can still exist in many cases. I
have also seen some cases where --name-hash-version=2 compresses better than
--path-walk with --name-hash-version=1, but these options can be combined to
get the best of both worlds.

Detailed statistics are provided within patch messages, but a few are
highlighted here:

The microsoft/fluentui is a public Javascript repo that suffers from many of
the name hash collisions as internal repositories I've worked with. Here is
a comparison of the compressed size and end-to-end time of the repack:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1             439.4M      87.24s
Hash v2             161.7M      21.51s
Path Walk           142.5M      28.16s


Less dramatic, but perhaps more standardly structured is the nodejs/node
repository, with these stats:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                739.9M      71.18s
Hash v2                764.6M      67.82s
Path Walk              698.0M      75.10s


Even the Linux kernel repository gains some benefits, even though the number
of hash collisions is relatively low due to a preference for short
filenames:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                  2.5G     554.41s
Hash v2                  2.5G     549.62s
Path Walk                2.2G     559.00s


The drawbacks of the --path-walk feature is that it will be harder to
integrate it with bitmap features, specifically delta islands. This is not
insurmountable, but would require more work, such as a revision walk to
paint objects with reachability information before using that during delta
computations.

However, there should still be significant benefits to Git clients trying to
save space and improve local performance.

This feature was shipped with similar features in microsoft/git as of
v2.47.0.vfs.0.3 [4]. This was used in CI machines for an internal monorepo
that had significant repository growth due to constructing a batch of
beachball [5] CHANGELOG.[md|json] files and pushing them to a release
branch. These pushes were frequently 70-200 MB due to poor delta
compression. Using the 'pack.usePathWalk=true' config, these pushes dropped
in size by 100x while improving performance. Since these CI machines were
working with a shallow clone, the 'edge_aggressive' changes were required to
enable the path-walk option.

[4] https://github.com/microsoft/git/releases/tag/v2.47.0.vfs.0.3

[5] https://github.com/microsoft/beachball

This version incorporates feedback from previous RFCs and reviewed patch
series whenever possible. It also benefits from learned experience, much of
which was already applied in the original path-walk API submission.

Thanks, -Stolee

Derrick Stolee (13):
  pack-objects: extract should_attempt_deltas()
  pack-objects: add --path-walk option
  pack-objects: update usage to match docs
  p5313: add performance tests for --path-walk
  pack-objects: introduce GIT_TEST_PACK_PATH_WALK
  t5538: add tests to confirm deltas in shallow pushes
  repack: add --path-walk option
  pack-objects: enable --path-walk via config
  scalar: enable path-walk during push via config
  pack-objects: refactor path-walk delta phase
  pack-objects: thread the path-based compression
  path-walk: add new 'edge_aggressive' option
  pack-objects: allow --shallow and --path-walk

 Documentation/config/feature.adoc          |   4 +
 Documentation/config/pack.adoc             |   8 +
 Documentation/git-pack-objects.adoc        |  25 +-
 Documentation/git-repack.adoc              |  14 +-
 Documentation/technical/api-path-walk.adoc |   9 +
 builtin/pack-objects.c                     | 411 +++++++++++++++++++--
 builtin/repack.c                           |   7 +-
 pack-objects.h                             |  12 +
 path-walk.c                                |   6 +-
 path-walk.h                                |   7 +
 repo-settings.c                            |   3 +
 repo-settings.h                            |   1 +
 scalar.c                                   |   1 +
 t/README                                   |   4 +
 t/helper/test-path-walk.c                  |   2 +
 t/perf/p5313-pack-objects.sh               |  37 +-
 t/t0411-clone-from-partial.sh              |   6 +
 t/t0450/adoc-help-mismatches               |   1 -
 t/t5300-pack-object.sh                     |  19 +
 t/t5306-pack-nobase.sh                     |   5 +
 t/t5310-pack-bitmaps.sh                    |  13 +-
 t/t5316-pack-delta-depth.sh                |   9 +-
 t/t5332-multi-pack-reuse.sh                |   7 +
 t/t5538-push-shallow.sh                    |  34 ++
 t/t6601-path-walk.sh                       |  20 +
 t/t7406-submodule-update.sh                |   3 +
 26 files changed, 601 insertions(+), 67 deletions(-)


base-commit: a36e024e989f4d35f35987a60e3af8022cac3420
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1819%2Fderrickstolee%2Fpath-walk-upstream-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1819/derrickstolee/path-walk-upstream-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1819
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 01/13] pack-objects: extract should_attempt_deltas()
  2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
@ 2025-03-10  1:50 ` Derrick Stolee via GitGitGadget
  2025-03-12 21:01   ` Taylor Blau
  2025-03-10  1:50 ` [PATCH 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-10  1:50 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

This will be helpful in a future change, which will reuse this logic.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 builtin/pack-objects.c | 53 +++++++++++++++++++++++-------------------
 1 file changed, 29 insertions(+), 24 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 58a9b161262..1d0992a8dac 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3196,6 +3196,33 @@ static int add_ref_tag(const char *tag UNUSED, const char *referent UNUSED, cons
 	return 0;
 }
 
+static int should_attempt_deltas(struct object_entry *entry)
+{
+	if (DELTA(entry))
+		return 0;
+
+	if (!entry->type_valid ||
+	    oe_size_less_than(&to_pack, entry, 50))
+		return 0;
+
+	if (entry->no_try_delta)
+		return 0;
+
+	if (!entry->preferred_base) {
+		if (oe_type(entry) < 0)
+			die(_("unable to get type of object %s"),
+				oid_to_hex(&entry->idx.oid));
+	} else if (oe_type(entry) < 0) {
+		/*
+		 * This object is not found, but we
+		 * don't have to include it anyway.
+		 */
+		return 0;
+	}
+
+	return 1;
+}
+
 static void prepare_pack(int window, int depth)
 {
 	struct object_entry **delta_list;
@@ -3226,33 +3253,11 @@ static void prepare_pack(int window, int depth)
 	for (i = 0; i < to_pack.nr_objects; i++) {
 		struct object_entry *entry = to_pack.objects + i;
 
-		if (DELTA(entry))
-			/* This happens if we decided to reuse existing
-			 * delta from a pack.  "reuse_delta &&" is implied.
-			 */
-			continue;
-
-		if (!entry->type_valid ||
-		    oe_size_less_than(&to_pack, entry, 50))
+		if (!should_attempt_deltas(entry))
 			continue;
 
-		if (entry->no_try_delta)
-			continue;
-
-		if (!entry->preferred_base) {
+		if (!entry->preferred_base)
 			nr_deltas++;
-			if (oe_type(entry) < 0)
-				die(_("unable to get type of object %s"),
-				    oid_to_hex(&entry->idx.oid));
-		} else {
-			if (oe_type(entry) < 0) {
-				/*
-				 * This object is not found, but we
-				 * don't have to include it anyway.
-				 */
-				continue;
-			}
-		}
 
 		delta_list[n++] = entry;
 	}
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 02/13] pack-objects: add --path-walk option
  2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
  2025-03-10  1:50 ` [PATCH 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
@ 2025-03-10  1:50 ` Derrick Stolee via GitGitGadget
  2025-03-12 21:14   ` Taylor Blau
  2025-03-10  1:50 ` [PATCH 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-10  1:50 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

In order to more easily compute delta bases among objects that appear at
the exact same path, add a --path-walk option to 'git pack-objects'.

This option will use the path-walk API instead of the object walk given
by the revision machinery. Since objects will be provided in batches
representing a common path, those objects can be tested for delta bases
immediately instead of waiting for a sort of the full object list by
name-hash. This has multiple benefits, including avoiding collisions by
name-hash.

The objects marked as UNINTERESTING are included in these batches, so we
are guaranteeing some locality to find good delta bases.

After the individual passes are done on a per-path basis, the default
name-hash is used to find other opportunistic delta bases that did not
match exactly by the full path name.

The current implementation performs delta calculations while walking
objects, which is not ideal for a few reasons. First, this will cause
the "Enumerating objects" phase to be much longer than usual. Second, it
does not take advantage of threading during the path-scoped delta
calculations. Even with this lack of threading, the path-walk option is
sometimes faster than the usual approach. Future changes will refactor
this code to allow for threading, but that complexity is deferred until
later to keep this patch as simple as possible.

This new walk is incompatible with some features and is ignored by
others:

 * Object filters are not currently integrated with the path-walk API,
   such as sparse-checkout or tree depth. A blobless packfile could be
   integrated easily, but that is deferred for later.

 * Server-focused features such as delta islands, shallow packs, and
   using a bitmap index are incompatible with the path-walk API.

 * The path walk API is only compatible with the --revs option, not
   taking object lists or pack lists over stdin. These alternative ways
   to specify the objects currently ignores the --path-walk option
   without even a warning.

Future changes will create performance tests that demonstrate the power
of this approach.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-pack-objects.adoc        |  13 +-
 Documentation/technical/api-path-walk.adoc |   1 +
 builtin/pack-objects.c                     | 147 +++++++++++++++++++--
 t/t5300-pack-object.sh                     |  15 +++
 4 files changed, 166 insertions(+), 10 deletions(-)

diff --git a/Documentation/git-pack-objects.adoc b/Documentation/git-pack-objects.adoc
index 7f69ae4855f..7dbbe6d54d2 100644
--- a/Documentation/git-pack-objects.adoc
+++ b/Documentation/git-pack-objects.adoc
@@ -16,7 +16,7 @@ SYNOPSIS
 	[--cruft] [--cruft-expiration=<time>]
 	[--stdout [--filter=<filter-spec>] | <base-name>]
 	[--shallow] [--keep-true-parents] [--[no-]sparse]
-	[--name-hash-version=<n>] < <object-list>
+	[--name-hash-version=<n>] [--path-walk] < <object-list>
 
 
 DESCRIPTION
@@ -375,6 +375,17 @@ many different directories. At the moment, this version is not allowed
 when writing reachability bitmap files with `--write-bitmap-index` and it
 will be automatically changed to version `1`.
 
+--path-walk::
+	By default, `git pack-objects` walks objects in an order that
+	presents trees and blobs in an order unrelated to the path they
+	appear relative to a commit's root tree. The `--path-walk` option
+	enables a different walking algorithm that organizes trees and
+	blobs by path. This has the potential to improve delta compression
+	especially in the presence of filenames that cause collisions in
+	Git's default name-hash algorithm. Due to changing how the objects
+	are walked, this option is not compatible with `--delta-islands`,
+	`--shallow`, or `--filter`.
+
 
 DELTA ISLANDS
 -------------
diff --git a/Documentation/technical/api-path-walk.adoc b/Documentation/technical/api-path-walk.adoc
index 3e089211fb4..e522695dd9f 100644
--- a/Documentation/technical/api-path-walk.adoc
+++ b/Documentation/technical/api-path-walk.adoc
@@ -69,4 +69,5 @@ Examples
 
 See example usages in:
 	`t/helper/test-path-walk.c`,
+	`builtin/pack-objects.c`,
 	`builtin/backfill.c`
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 1d0992a8dac..5596c409927 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -41,6 +41,9 @@
 #include "promisor-remote.h"
 #include "pack-mtimes.h"
 #include "parse-options.h"
+#include "blob.h"
+#include "tree.h"
+#include "path-walk.h"
 
 /*
  * Objects we are going to pack are collected in the `to_pack` structure.
@@ -217,6 +220,7 @@ static int delta_search_threads;
 static int pack_to_stdout;
 static int sparse;
 static int thin;
+static int path_walk;
 static int num_preferred_base;
 static struct progress *progress_state;
 
@@ -4188,6 +4192,105 @@ static void mark_bitmap_preferred_tips(void)
 	}
 }
 
+static inline int is_oid_interesting(struct repository *repo,
+				     struct object_id *oid)
+{
+	struct object *o = lookup_object(repo, oid);
+	return o && !(o->flags & UNINTERESTING);
+}
+
+static int add_objects_by_path(const char *path,
+			       struct oid_array *oids,
+			       enum object_type type,
+			       void *data)
+{
+	struct object_entry **delta_list;
+	size_t oe_start = to_pack.nr_objects;
+	size_t oe_end;
+	unsigned int sub_list_size;
+	unsigned int *processed = data;
+
+	/*
+	 * First, add all objects to the packing data, including the ones
+	 * marked UNINTERESTING (translated to 'exclude') as they can be
+	 * used as delta bases.
+	 */
+	for (size_t i = 0; i < oids->nr; i++) {
+		int exclude;
+		struct object_info oi = OBJECT_INFO_INIT;
+		struct object_id *oid = &oids->oid[i];
+
+		/* Skip objects that do not exist locally. */
+		if (exclude_promisor_objects &&
+		    oid_object_info_extended(the_repository, oid, &oi,
+					     OBJECT_INFO_FOR_PREFETCH) < 0)
+			continue;
+
+		exclude = !is_oid_interesting(the_repository, oid);
+
+		if (exclude && !thin)
+			continue;
+
+		add_object_entry(oid, type, path, exclude);
+	}
+
+	oe_end = to_pack.nr_objects;
+
+	/* We can skip delta calculations if it is a no-op. */
+	if (oe_end == oe_start || !window)
+		return 0;
+
+	sub_list_size = 0;
+	ALLOC_ARRAY(delta_list, oe_end - oe_start);
+
+	for (size_t i = 0; i < oe_end - oe_start; i++) {
+		struct object_entry *entry = to_pack.objects + oe_start + i;
+
+		if (!should_attempt_deltas(entry))
+			continue;
+
+		delta_list[sub_list_size++] = entry;
+	}
+
+	/*
+	 * Find delta bases among this list of objects that all match the same
+	 * path. This causes the delta compression to be interleaved in the
+	 * object walk, which can lead to confusing progress indicators. This is
+	 * also incompatible with threaded delta calculations. In the future,
+	 * consider creating a list of regions in the full to_pack.objects array
+	 * that could be picked up by the threaded delta computation.
+	 */
+	if (sub_list_size && window) {
+		QSORT(delta_list, sub_list_size, type_size_sort);
+		find_deltas(delta_list, &sub_list_size, window, depth, processed);
+	}
+
+	free(delta_list);
+	return 0;
+}
+
+static void get_object_list_path_walk(struct rev_info *revs)
+{
+	struct path_walk_info info = PATH_WALK_INFO_INIT;
+	unsigned int processed = 0;
+
+	info.revs = revs;
+	info.path_fn = add_objects_by_path;
+	info.path_fn_data = &processed;
+	revs->tag_objects = 1;
+
+	/*
+	 * Allow the --[no-]sparse option to be interesting here, if only
+	 * for testing purposes. Paths with no interesting objects will not
+	 * contribute to the resulting pack, but only create noisy preferred
+	 * base objects.
+	 */
+	info.prune_all_uninteresting = sparse;
+
+	if (walk_objects_by_path(&info))
+		die(_("failed to pack objects via path-walk"));
+}
+
 static void get_object_list(struct rev_info *revs, int ac, const char **av)
 {
 	struct setup_revision_opt s_r_opt = {
@@ -4234,7 +4337,7 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av)
 
 	warn_on_object_refname_ambiguity = save_warning;
 
-	if (use_bitmap_index && !get_object_list_from_bitmap(revs))
+	if (use_bitmap_index && !path_walk && !get_object_list_from_bitmap(revs))
 		return;
 
 	if (use_delta_islands)
@@ -4243,15 +4346,19 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av)
 	if (write_bitmap_index)
 		mark_bitmap_preferred_tips();
 
-	if (prepare_revision_walk(revs))
-		die(_("revision walk setup failed"));
-	mark_edges_uninteresting(revs, show_edge, sparse);
-
 	if (!fn_show_object)
 		fn_show_object = show_object;
-	traverse_commit_list(revs,
-			     show_commit, fn_show_object,
-			     NULL);
+
+	if (path_walk) {
+		get_object_list_path_walk(revs);
+	} else {
+		if (prepare_revision_walk(revs))
+			die(_("revision walk setup failed"));
+		mark_edges_uninteresting(revs, show_edge, sparse);
+		traverse_commit_list(revs,
+				show_commit, fn_show_object,
+				NULL);
+	}
 
 	if (unpack_unreachable_expiration) {
 		revs->ignore_missing_links = 1;
@@ -4461,6 +4568,8 @@ int cmd_pack_objects(int argc,
 			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
 			 N_("create thin packs")),
+		OPT_BOOL(0, "path-walk", &path_walk,
+			 N_("use the path-walk API to walk objects when possible")),
 		OPT_BOOL(0, "shallow", &shallow,
 			 N_("create packs suitable for shallow fetches")),
 		OPT_BOOL(0, "honor-pack-keep", &ignore_packed_keep_on_disk,
@@ -4546,7 +4655,27 @@ int cmd_pack_objects(int argc,
 		window = 0;
 
 	strvec_push(&rp, "pack-objects");
-	if (thin) {
+
+	if (path_walk && filter_options.choice) {
+		warning(_("cannot use --filter with --path-walk"));
+		path_walk = 0;
+	}
+	if (path_walk && use_delta_islands) {
+		warning(_("cannot use delta islands with --path-walk"));
+		path_walk = 0;
+	}
+	if (path_walk && shallow) {
+		warning(_("cannot use --shallow with --path-walk"));
+		path_walk = 0;
+	}
+	if (path_walk) {
+		strvec_push(&rp, "--boundary");
+		 /*
+		  * We must disable the bitmaps because we are removing
+		  * the --objects / --objects-edge[-aggressive] options.
+		  */
+		use_bitmap_index = 0;
+	} else if (thin) {
 		use_internal_rev_list = 1;
 		strvec_push(&rp, shallow
 				? "--objects-edge-aggressive"
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 5ac8d39094b..16420d12863 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -723,4 +723,19 @@ test_expect_success '--name-hash-version=2 and --write-bitmap-index are incompat
 	! test_grep "currently, --write-bitmap-index requires --name-hash-version=1" err
 '
 
+test_expect_success '--path-walk pack everything' '
+	git -C server rev-parse HEAD >in &&
+	git -C server pack-objects --stdout --revs --path-walk <in >out.pack &&
+	git -C server index-pack --stdin <out.pack
+'
+
+test_expect_success '--path-walk thin pack' '
+	cat >in <<-EOF &&
+	$(git -C server rev-parse HEAD)
+	^$(git -C server rev-parse HEAD~2)
+	EOF
+	git -C server pack-objects --thin --stdout --revs --path-walk <in >out.pack &&
+	git -C server index-pack --fix-thin --stdin <out.pack
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 03/13] pack-objects: update usage to match docs
  2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
  2025-03-10  1:50 ` [PATCH 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
  2025-03-10  1:50 ` [PATCH 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
@ 2025-03-10  1:50 ` Derrick Stolee via GitGitGadget
  2025-03-12 21:14   ` Taylor Blau
  2025-03-10  1:50 ` [PATCH 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-10  1:50 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The t0450 test script verifies that builtin usage matches the synopsis
in the documentation. Adjust the builtin to match and then remove 'git
pack-objects' from the exception list.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-pack-objects.adoc | 14 +++++++-------
 builtin/pack-objects.c              | 10 ++++++++--
 t/t0450/adoc-help-mismatches        |  1 -
 3 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/Documentation/git-pack-objects.adoc b/Documentation/git-pack-objects.adoc
index 7dbbe6d54d2..ad765334729 100644
--- a/Documentation/git-pack-objects.adoc
+++ b/Documentation/git-pack-objects.adoc
@@ -10,13 +10,13 @@ SYNOPSIS
 --------
 [verse]
 'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
-	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
-	[--local] [--incremental] [--window=<n>] [--depth=<n>]
-	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
-	[--cruft] [--cruft-expiration=<time>]
-	[--stdout [--filter=<filter-spec>] | <base-name>]
-	[--shallow] [--keep-true-parents] [--[no-]sparse]
-	[--name-hash-version=<n>] [--path-walk] < <object-list>
+		   [--no-reuse-delta] [--delta-base-offset] [--non-empty]
+		   [--local] [--incremental] [--window=<n>] [--depth=<n>]
+		   [--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
+		   [--cruft] [--cruft-expiration=<time>]
+		   [--stdout [--filter=<filter-spec>] | <base-name>]
+		   [--shallow] [--keep-true-parents] [--[no-]sparse]
+		   [--name-hash-version=<n>] [--path-walk] < <object-list>
 
 
 DESCRIPTION
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5596c409927..fe23db3c5d8 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -187,8 +187,14 @@ static inline void oe_set_delta_size(struct packing_data *pack,
 #define SET_DELTA_SIBLING(obj, val) oe_set_delta_sibling(&to_pack, obj, val)
 
 static const char *pack_usage[] = {
-	N_("git pack-objects --stdout [<options>] [< <ref-list> | < <object-list>]"),
-	N_("git pack-objects [<options>] <base-name> [< <ref-list> | < <object-list>]"),
+	N_("git pack-objects [-q | --progress | --all-progress] [--all-progress-implied]\n"
+	   "                 [--no-reuse-delta] [--delta-base-offset] [--non-empty]\n"
+	   "                 [--local] [--incremental] [--window=<n>] [--depth=<n>]\n"
+	   "                 [--revs [--unpacked | --all]] [--keep-pack=<pack-name>]\n"
+	   "                 [--cruft] [--cruft-expiration=<time>]\n"
+	   "                 [--stdout [--filter=<filter-spec>] | <base-name>]\n"
+	   "                 [--shallow] [--keep-true-parents] [--[no-]sparse]\n"
+	   "                 [--name-hash-version=<n>] [--path-walk] < <object-list>"),
 	NULL
 };
 
diff --git a/t/t0450/adoc-help-mismatches b/t/t0450/adoc-help-mismatches
index c4a15fd0cb8..06b469bdee2 100644
--- a/t/t0450/adoc-help-mismatches
+++ b/t/t0450/adoc-help-mismatches
@@ -38,7 +38,6 @@ merge-one-file
 multi-pack-index
 name-rev
 notes
-pack-objects
 push
 range-diff
 rebase
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 04/13] p5313: add performance tests for --path-walk
  2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2025-03-10  1:50 ` [PATCH 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
@ 2025-03-10  1:50 ` Derrick Stolee via GitGitGadget
  2025-03-10  1:50 ` [PATCH 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-10  1:50 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The previous change added a --path-walk option to 'git pack-objects'.
Create a performance test that demonstrates the time and space benefits
of the feature.

In order to get an appropriate comparison, we need to avoid reusing
deltas and recompute them from scratch.

Compare the creation of a thin pack representing a small push and the
creation of a relatively large non-thin pack.

Running on my copy of the Git repository results in this data (removing
the repack tests for --name-hash-version):

Test                                                     this tree
------------------------------------------------------------------------
5313.2: thin pack with --name-hash-version=1             0.02(0.01+0.01)
5313.3: thin pack size with --name-hash-version=1                   1.6K
5313.4: big pack with --name-hash-version=1              2.55(4.20+0.26)
5313.5: big pack size with --name-hash-version=1                   16.4M
5313.6: shallow fetch pack with --name-hash-version=1    1.24(2.03+0.08)
5313.7: shallow pack size with --name-hash-version=1               12.2M
5313.10: thin pack with --name-hash-version=2            0.03(0.01+0.01)
5313.11: thin pack size with --name-hash-version=2                  1.6K
5313.12: big pack with --name-hash-version=2             1.91(3.23+0.20)
5313.13: big pack size with --name-hash-version=2                  16.4M
5313.14: shallow fetch pack with --name-hash-version=2   1.06(1.57+0.10)
5313.15: shallow pack size with --name-hash-version=2              12.5M
5313.18: thin pack with --path-walk                      0.03(0.01+0.01)
5313.19: thin pack size with --path-walk                            1.6K
5313.20: big pack with --path-walk                       2.05(3.24+0.27)
5313.21: big pack size with --path-walk                            16.3M
5313.22: shallow fetch pack with --path-walk             1.08(1.66+0.07)
5313.23: shallow pack size with --path-walk                        12.4M

This can be reformatted as follows:

Pack Type            Hash v1   Hash v2     Path Walk
---------------------------------------------------
thin pack    (time)    0.02s      0.03s      0.03s
             (size)    1.6K       1.6K       1.6K
big pack     (time)    2.55s      1.91s      2.05s
             (size)   16.4M      16.4M      16.3M
shallow pack (time)    1.24s      1.06s      1.08s
             (size)   12.2M      12.5M      12.4M

Note that the timing is slower because there is no threading in the
--path-walk case (yet). Also, the shallow pack cases are really not
using the --path-walk logic right now because it is disabled until some
additions are made to the path walk API.

The cases where the --path-walk option really shines is when the default
name-hash is overwhelmed with collisions. An open source example can be
found in the microsoft/fluentui repo [1] at a certain commit [2].

[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

Running the tests on this repo results in the following comparison table:

Pack Type            Hash v1    Hash v2    Path Walk
---------------------------------------------------
thin pack    (time)    0.36s      0.12s      0.08s
             (size)    1.2M      22.0K      18.4K
big pack     (time)    2.00s      2.90s      2.21s
             (size)   20.4M      25.9M      19.5M
shallow pack (time)    1.41s      1.80s      1.65s
             (size)   34.4M      33.7M      33.6M

Notice in particular that in the small thin pack, the time performance
has improved from 0.36s for --name-hash-version=1 to 0.08s and this is
likely due to the improved size of the resulting pack: 18.4K instead of
1.2M.  The relatively new --name-hash-version=2 is competitive with
--path-walk (0.12s and 22.0K) but not quite as successful.

Finally, running this on a copy of the Linux kernel repository results
in these data points:

Pack Type            Hash v1    Hash v2    Path Walk
---------------------------------------------------
thin pack    (time)    0.03s      0.13s      0.03s
             (size)    4.6K       4.6K       4.6K
big pack     (time)   15.29s     12.32s     13.92s
             (size)  201.1M     159.1M     158.5M
shallow pack (time)   10.88s     22.93s     22.74s
             (size)  269.2M     273.8M     267.7M

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 t/perf/p5313-pack-objects.sh | 37 ++++++++++++++++++++++--------------
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/t/perf/p5313-pack-objects.sh b/t/perf/p5313-pack-objects.sh
index be5229a0ecd..cd6dd3abb71 100755
--- a/t/perf/p5313-pack-objects.sh
+++ b/t/perf/p5313-pack-objects.sh
@@ -25,46 +25,55 @@ test_expect_success 'create rev input' '
 	EOF
 '
 
-for version in 1 2
-do
-	export version
+test_all_with_args () {
+	parameter=$1
+	export parameter
 
-	test_perf "thin pack with version $version" '
+	test_perf "thin pack with $parameter" '
 		git pack-objects --thin --stdout --revs --sparse \
-			--name-hash-version=$version <in-thin >out
+			$parameter <in-thin >out
 	'
 
-	test_size "thin pack size with version $version" '
+	test_size "thin pack size with $parameter" '
 		test_file_size out
 	'
 
-	test_perf "big pack with version $version" '
+	test_perf "big pack with $parameter" '
 		git pack-objects --stdout --revs --sparse \
-			--name-hash-version=$version <in-big >out
+			$parameter <in-big >out
 	'
 
-	test_size "big pack size with version $version" '
+	test_size "big pack size with $parameter" '
 		test_file_size out
 	'
 
-	test_perf "shallow fetch pack with version $version" '
+	test_perf "shallow fetch pack with $parameter" '
 		git pack-objects --stdout --revs --sparse --shallow \
-			--name-hash-version=$version <in-shallow >out
+			$parameter <in-shallow >out
 	'
 
-	test_size "shallow pack size with version $version" '
+	test_size "shallow pack size with $parameter" '
 		test_file_size out
 	'
+}
 
-	test_perf "repack with version $version" '
+for version in 1 2
+do
+	export version
+
+	test_all_with_args --name-hash-version=$version
+
+	test_perf "repack with --name-hash-version=$version" '
 		git repack -adf --name-hash-version=$version
 	'
 
-	test_size "repack size with version $version" '
+	test_size "repack size with --name-hash-version=$version" '
 		gitdir=$(git rev-parse --git-dir) &&
 		pack=$(ls $gitdir/objects/pack/pack-*.pack) &&
 		test_file_size "$pack"
 	'
 done
 
+test_all_with_args --path-walk
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK
  2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2025-03-10  1:50 ` [PATCH 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
@ 2025-03-10  1:50 ` Derrick Stolee via GitGitGadget
  2025-03-10  1:50 ` [PATCH 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-10  1:50 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

There are many tests that validate whether 'git pack-objects' works as
expected. Instead of duplicating these tests, add a new test environment
variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
when specified.

This was useful in testing the implementation of the --path-walk
implementation, especially in conjunction with test such as:

 - t0411-clone-from-partial.sh : One test fetches from a repo that does
   not have the boundary objects. This causes the path-based walk to
   fail. Disable the variable for this test.

 - t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo
   without a boundary object.

 - t5310-pack-bitmaps.sh : One test compares the case when packing with
   bitmaps to the case when packing without them. Since we disable the
   test variable when writing bitmaps, this causes a difference in the
   object list (the --path-walk option adds an extra object). Specify
   --no-path-walk in both processes for the comparison. Another test
   checks for a specific delta base, but when computing dynamically
   without using bitmaps, the base object it too small to be considered
   in the delta calculations so no base is used.

 - t5316-pack-delta-depth.sh : This script cares about certain delta
   choices and their chain lengths. The --path-walk option changes how
   these chains are selected, and thus changes the results of this test.

 - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
   the --sparse option and how it combines with --path-walk.

 - t5332-multi-pack-reuse.sh : This test verifies that the preferred
   pack is used for delta reuse when possible. The --path-walk option is
   not currently aware of the preferred pack at all, so finds a
   different delta base.

 - t7406-submodule-update.sh : When using the variable, the --depth
   option collides with the --path-walk feature, resulting in a warning
   message. Disable the variable so this warning does not appear.

I want to call out one specific test change that is only temporary:

 - t5530-upload-pack-error.sh : One test cares specifically about an
   "unable to read" error message. Since the current implementation
   performs delta calculations within the path-walk API callback, a
   different "unable to get size" error message appears. When this
   is changed in a future refactoring, this test change can be reverted.

Similar to GIT_TEST_NAME_HASH_VERSION, we do not add this option to the
linux-TEST-vars CI build as that's already an overloaded build.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 builtin/pack-objects.c        | 12 ++++++++++--
 t/README                      |  4 ++++
 t/t0411-clone-from-partial.sh |  6 ++++++
 t/t5306-pack-nobase.sh        |  5 +++++
 t/t5310-pack-bitmaps.sh       | 13 +++++++++++--
 t/t5316-pack-delta-depth.sh   |  9 ++++++---
 t/t5332-multi-pack-reuse.sh   |  7 +++++++
 t/t5530-upload-pack-error.sh  |  6 ++++++
 t/t7406-submodule-update.sh   |  3 +++
 9 files changed, 58 insertions(+), 7 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index fe23db3c5d8..120202b05e9 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -226,7 +226,7 @@ static int delta_search_threads;
 static int pack_to_stdout;
 static int sparse;
 static int thin;
-static int path_walk;
+static int path_walk = -1;
 static int num_preferred_base;
 static struct progress *progress_state;
 
@@ -4227,7 +4227,7 @@ static int add_objects_by_path(const char *path,
 		struct object_id *oid = &oids->oid[i];
 
 		/* Skip objects that do not exist locally. */
-		if (exclude_promisor_objects &&
+		if ((exclude_promisor_objects || arg_missing_action != MA_ERROR) &&
 		    oid_object_info_extended(the_repository, oid, &oi,
 					     OBJECT_INFO_FOR_PREFETCH) < 0)
 			continue;
@@ -4645,6 +4645,14 @@ int cmd_pack_objects(int argc,
 	if (pack_to_stdout != !base_name || argc)
 		usage_with_options(pack_usage, pack_objects_options);
 
+	if (path_walk < 0) {
+		if (use_bitmap_index > 0 ||
+		    !use_internal_rev_list)
+			path_walk = 0;
+		else
+			path_walk = git_env_bool("GIT_TEST_PACK_PATH_WALK", 0);
+	}
+
 	if (depth < 0)
 		depth = 0;
 	if (depth >= (1 << OE_DEPTH_BITS)) {
diff --git a/t/README b/t/README
index 53e5b4a7107..ae06e628815 100644
--- a/t/README
+++ b/t/README
@@ -415,6 +415,10 @@ GIT_TEST_PACK_SPARSE=<boolean> if disabled will default the pack-objects
 builtin to use the non-sparse object walk. This can still be overridden by
 the --sparse command-line argument.
 
+GIT_TEST_PACK_PATH_WALK=<boolean> if enabled will default the pack-objects
+builtin to use the path-walk API for the object walk. This can still be
+overridden by the --no-path-walk command-line argument.
+
 GIT_TEST_PRELOAD_INDEX=<boolean> exercises the preload-index code path
 by overriding the minimum number of cache entries required per thread.
 
diff --git a/t/t0411-clone-from-partial.sh b/t/t0411-clone-from-partial.sh
index 196fc617843..9e6bca56255 100755
--- a/t/t0411-clone-from-partial.sh
+++ b/t/t0411-clone-from-partial.sh
@@ -59,6 +59,12 @@ test_expect_success 'pack-objects should fetch from promisor remote and execute
 
 test_expect_success 'clone from promisor remote does not lazy-fetch by default' '
 	rm -f script-executed &&
+
+	# The --path-walk feature of "git pack-objects" is not
+	# compatible with this kind of fetch from an incomplete repo.
+	GIT_TEST_PACK_PATH_WALK=0 &&
+	export GIT_TEST_PACK_PATH_WALK &&
+
 	test_must_fail git clone evil no-lazy 2>err &&
 	test_grep "lazy fetching disabled" err &&
 	test_path_is_missing script-executed
diff --git a/t/t5306-pack-nobase.sh b/t/t5306-pack-nobase.sh
index 805d60ff317..609399d54fb 100755
--- a/t/t5306-pack-nobase.sh
+++ b/t/t5306-pack-nobase.sh
@@ -59,6 +59,11 @@ test_expect_success 'indirectly clone patch_clone' '
 	 git pull ../.git &&
 	 test $(git rev-parse HEAD) = $B &&
 
+	# The --path-walk feature of "git pack-objects" is not
+	# compatible with this kind of fetch from an incomplete repo.
+	GIT_TEST_PACK_PATH_WALK=0 &&
+	export GIT_TEST_PACK_PATH_WALK &&
+
 	 git pull ../patch_clone/.git &&
 	 test $(git rev-parse HEAD) = $C
 	)
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 621bbbdd26e..e01df807a62 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -158,8 +158,9 @@ test_bitmap_cases () {
 		ls .git/objects/pack/ | grep bitmap >output &&
 		test_line_count = 1 output &&
 		# verify equivalent packs are generated with/without using bitmap index
-		packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
-		packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
+		# Be careful to not use the path-walk option in either case.
+		packasha1=$(git pack-objects --no-use-bitmap-index --no-path-walk --all packa </dev/null) &&
+		packbsha1=$(git pack-objects --use-bitmap-index --no-path-walk --all packb </dev/null) &&
 		list_packed_objects packa-$packasha1.idx >packa.objects &&
 		list_packed_objects packb-$packbsha1.idx >packb.objects &&
 		test_cmp packa.objects packb.objects
@@ -388,6 +389,14 @@ test_bitmap_cases () {
 		git init --bare client.git &&
 		(
 			cd client.git &&
+
+			# This test relies on reusing a delta, but if the
+			# path-walk machinery is engaged, the base object
+			# is considered too small to use during the
+			# dynamic computation, so is not used.
+			GIT_TEST_PACK_PATH_WALK=0 &&
+			export GIT_TEST_PACK_PATH_WALK &&
+
 			git config transfer.unpackLimit 1 &&
 			git fetch .. delta-reuse-old:delta-reuse-old &&
 			git fetch .. delta-reuse-new:delta-reuse-new &&
diff --git a/t/t5316-pack-delta-depth.sh b/t/t5316-pack-delta-depth.sh
index 32cf4227451..167c3a35234 100755
--- a/t/t5316-pack-delta-depth.sh
+++ b/t/t5316-pack-delta-depth.sh
@@ -89,15 +89,18 @@ max_chain() {
 # adjusted (or scrapped if the heuristics have become too unreliable)
 test_expect_success 'packing produces a long delta' '
 	# Use --window=0 to make sure we are seeing reused deltas,
-	# not computing a new long chain.
-	pack=$(git pack-objects --all --window=0 </dev/null pack) &&
+	# not computing a new long chain. (Also avoid the --path-walk
+	# option as it may break delta chains.)
+	pack=$(git pack-objects --all --window=0 --no-path-walk </dev/null pack) &&
 	echo 9 >expect &&
 	max_chain pack-$pack.pack >actual &&
 	test_cmp expect actual
 '
 
 test_expect_success '--depth limits depth' '
-	pack=$(git pack-objects --all --depth=5 </dev/null pack) &&
+	# Avoid --path-walk to avoid breaking delta chains across path
+	# boundaries.
+	pack=$(git pack-objects --all --depth=5 --no-path-walk </dev/null pack) &&
 	echo 5 >expect &&
 	max_chain pack-$pack.pack >actual &&
 	test_cmp expect actual
diff --git a/t/t5332-multi-pack-reuse.sh b/t/t5332-multi-pack-reuse.sh
index 57cad7708f8..395d09444ce 100755
--- a/t/t5332-multi-pack-reuse.sh
+++ b/t/t5332-multi-pack-reuse.sh
@@ -7,6 +7,13 @@ test_description='pack-objects multi-pack reuse'
 
 GIT_TEST_MULTI_PACK_INDEX=0
 GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
+
+# The --path-walk option does not consider the preferred pack
+# at all for reusing deltas, so this variable changes the
+# behavior of this test, if enabled.
+GIT_TEST_PACK_PATH_WALK=0
+export GIT_TEST_PACK_PATH_WALK
+
 objdir=.git/objects
 packdir=$objdir/pack
 
diff --git a/t/t5530-upload-pack-error.sh b/t/t5530-upload-pack-error.sh
index 558eedf25a4..8eb6fea839a 100755
--- a/t/t5530-upload-pack-error.sh
+++ b/t/t5530-upload-pack-error.sh
@@ -34,6 +34,12 @@ test_expect_success 'upload-pack fails due to error in pack-objects packing' '
 	hexsz=$(test_oid hexsz) &&
 	printf "%04xwant %s\n00000009done\n0000" \
 		$(($hexsz + 10)) $head >input &&
+
+	# The current implementation of path-walk causes a different
+	# error message. This will be changed by a future refactoring.
+	GIT_TEST_PACK_PATH_WALK=0 &&
+	export GIT_TEST_PACK_PATH_WALK &&
+
 	test_must_fail git upload-pack . <input >/dev/null 2>output.err &&
 	test_grep "unable to read" output.err &&
 	test_grep "pack-objects died" output.err
diff --git a/t/t7406-submodule-update.sh b/t/t7406-submodule-update.sh
index c562bad042a..ab76d4b6dc4 100755
--- a/t/t7406-submodule-update.sh
+++ b/t/t7406-submodule-update.sh
@@ -1095,12 +1095,15 @@ test_expect_success 'submodule update --quiet passes quietness to fetch with a s
 	(cd super5 &&
 	 # This test var can mess with the stderr output checked in this test.
 	 GIT_TEST_NAME_HASH_VERSION=1 \
+	 GIT_TEST_PACK_PATH_WALK=0 \
 		git submodule update --quiet --init --depth=1 submodule3 >out 2>err &&
 	 test_must_be_empty out &&
 	 test_must_be_empty err
 	) &&
 	git clone super4 super6 &&
 	(cd super6 &&
+	 # This test variable will create a "warning" message to stderr
+	 GIT_TEST_PACK_PATH_WALK=0 \
 	 git submodule update --init --depth=1 submodule3 >out 2>err &&
 	 test_file_not_empty out &&
 	 test_file_not_empty err
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 06/13] t5538: add tests to confirm deltas in shallow pushes
  2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2025-03-10  1:50 ` [PATCH 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
@ 2025-03-10  1:50 ` Derrick Stolee via GitGitGadget
  2025-03-10  1:50 ` [PATCH 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-10  1:50 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

It can be notoriously difficult to detect if delta bases are being
computed properly during 'git push'. Construct an example where it will
make a kilobyte worth of difference when a delta base is not found. We
can then use the progress indicators to distinguish between bytes and
KiB depending on whether the delta base is found and used.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 t/t5538-push-shallow.sh | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/t/t5538-push-shallow.sh b/t/t5538-push-shallow.sh
index e91fcc173e8..11b85cca9e8 100755
--- a/t/t5538-push-shallow.sh
+++ b/t/t5538-push-shallow.sh
@@ -123,4 +123,38 @@ EOF
 	git cat-file blob $(echo 1|git hash-object --stdin) >/dev/null
 	)
 '
+
+test_expect_success 'push new commit from shallow clone has correct object count' '
+	git init origin &&
+	test_commit -C origin a &&
+	test_commit -C origin b &&
+
+	git clone --depth=1 "file://$(pwd)/origin" client &&
+	git -C client checkout -b topic &&
+	git -C client commit --allow-empty -m "empty" &&
+	GIT_PROGRESS_DELAY=0 git -C client push --progress origin topic 2>err &&
+	test_grep "Enumerating objects: 1, done." err
+'
+
+test_expect_success 'push new commit from shallow clone has good deltas' '
+	git init base &&
+	test_seq 1 999 >base/a &&
+	test_commit -C base initial &&
+	git -C base add a &&
+	git -C base commit -m "big a" &&
+
+	git clone --depth=1 "file://$(pwd)/base" deltas &&
+	git -C deltas checkout -b deltas &&
+	test_seq 1 1000 >deltas/a &&
+	git -C deltas commit -a -m "bigger a" &&
+	GIT_TRACE2_PERF="$(pwd)/trace.txt" \
+	GIT_PROGRESS_DELAY=0 git -C deltas push --progress origin deltas 2>err &&
+
+	test_grep "Enumerating objects: 5, done" err &&
+
+	# If the delta base is found, then this message uses "bytes".
+	# If the delta base is not found, then this message uses "KiB".
+	test_grep "Writing objects: .* bytes" err
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 07/13] repack: add --path-walk option
  2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2025-03-10  1:50 ` [PATCH 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
@ 2025-03-10  1:50 ` Derrick Stolee via GitGitGadget
  2025-03-10  1:50 ` [PATCH 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-10  1:50 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.

Add the --path-walk option to the performance tests in p5313.

For the microsoft/fluentui repo [1] checked out at a specific commit [2],
the --path-walk tests in p5313 look like this:

Test                                                     this tree
-------------------------------------------------------------------------
5313.18: thin pack with --path-walk                      0.08(0.06+0.02)
5313.19: thin pack size with --path-walk                           18.4K
5313.20: big pack with --path-walk                       2.10(7.80+0.26)
5313.21: big pack size with --path-walk                            19.8M
5313.22: shallow fetch pack with --path-walk             1.62(3.38+0.17)
5313.23: shallow pack size with --path-walk                        33.6M
5313.24: repack with --path-walk                         81.29(96.08+0.71)
5313.25: repack size with --path-walk                             142.5M

[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

Along with the earlier tests in p5313, I'll instead reformat the
comparison as follows:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1             439.4M      87.24s
Hash v2             161.7M      21.51s
Path Walk           142.5M      81.29s

There are a few things to notice here:

 1. The benefits of --name-hash-version=2 over --name-hash-version=1 are
    significant, but --path-walk still compresses better than that
    option.

 2. The --path-walk command is still using --name-hash-version=1 for the
    second pass of delta computation, using the increased name hash
    collisions as a potential method for opportunistic compression on
    top of the path-focused compression.

 3. The --path-walk algorithm is currently sequential and does not use
    multiple threads for delta compression. Threading will be
    implemented in a future change so the computation time will improve
    to better compete in this metric.

There are small benefits in size for my copy of the Git repository:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1             248.8M      30.44s
Hash v2             249.0M      30.15s
Path Walk           213.2M     142.50s

As well as in the nodejs/node repository [3]:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1             739.9M      71.18s
Hash v2             764.6M      67.82s
Path Walk           698.1M     208.10s

[3] https://github.com/nodejs/node

This benefit also repeats in my copy of the Linux kernel repository:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1               2.5G     554.41s
Hash v2               2.5G     549.62s
Path Walk             2.2G    1562.36s

It is important to see that even when the repository shape does not have
many name-hash collisions, there is a slight space boost to be found
using this method.

As this repacking strategy was released in Git for Windows 2.47.0, some
users have reported cases where the --path-walk compression is slightly
worse than the --name-hash-version=2 option. In those cases, it may be
beneficial to combine the two options. However, there has not been a
released version of Git that has both options and I don't have access to
these repos for testing.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-repack.adoc | 14 +++++++++++++-
 builtin/repack.c              |  7 ++++++-
 t/perf/p5313-pack-objects.sh  | 18 ++++++++----------
 3 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/Documentation/git-repack.adoc b/Documentation/git-repack.adoc
index 5852a5c9736..2199eb3b94f 100644
--- a/Documentation/git-repack.adoc
+++ b/Documentation/git-repack.adoc
@@ -11,7 +11,7 @@ SYNOPSIS
 [verse]
 'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]
 	[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
-	[--write-midx] [--name-hash-version=<n>]
+	[--write-midx] [--name-hash-version=<n>] [--path-walk]
 
 DESCRIPTION
 -----------
@@ -255,6 +255,18 @@ linkgit:git-multi-pack-index[1]).
 	Provide this argument to the underlying `git pack-objects` process.
 	See linkgit:git-pack-objects[1] for full details.
 
+--path-walk::
+	This option passes the `--path-walk` option to the underlying
+	`git pack-options` process (see linkgit:git-pack-objects[1]).
+	By default, `git pack-objects` walks objects in an order that
+	presents trees and blobs in an order unrelated to the path they
+	appear relative to a commit's root tree. The `--path-walk` option
+	enables a different walking algorithm that organizes trees and
+	blobs by path. This has the potential to improve delta compression
+	especially in the presence of filenames that cause collisions in
+	Git's default name-hash algorithm. Due to changing how the objects
+	are walked, this option is not compatible with `--delta-islands`
+	or `--filter`.
 
 CONFIGURATION
 -------------
diff --git a/builtin/repack.c b/builtin/repack.c
index 75e3752353a..d7f798280c0 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -43,7 +43,7 @@ static char *packdir, *packtmp_name, *packtmp;
 static const char *const git_repack_usage[] = {
 	N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
 	   "[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]\n"
-	   "[--write-midx] [--name-hash-version=<n>]"),
+	   "[--write-midx] [--name-hash-version=<n>] [--path-walk]"),
 	NULL
 };
 
@@ -63,6 +63,7 @@ struct pack_objects_args {
 	int quiet;
 	int local;
 	int name_hash_version;
+	int path_walk;
 	struct list_objects_filter_options filter_options;
 };
 
@@ -313,6 +314,8 @@ static void prepare_pack_objects(struct child_process *cmd,
 		strvec_pushf(&cmd->args, "--no-reuse-object");
 	if (args->name_hash_version)
 		strvec_pushf(&cmd->args, "--name-hash-version=%d", args->name_hash_version);
+	if (args->path_walk)
+		strvec_pushf(&cmd->args, "--path-walk");
 	if (args->local)
 		strvec_push(&cmd->args,  "--local");
 	if (args->quiet)
@@ -1212,6 +1215,8 @@ int cmd_repack(int argc,
 				N_("pass --no-reuse-object to git-pack-objects")),
 		OPT_INTEGER(0, "name-hash-version", &po_args.name_hash_version,
 				N_("specify the name hash version to use for grouping similar objects by path")),
+		OPT_BOOL(0, "path-walk", &po_args.path_walk,
+				N_("pass --path-walk to git-pack-objects")),
 		OPT_NEGBIT('n', NULL, &run_update_server_info,
 				N_("do not run git-update-server-info"), 1),
 		OPT__QUIET(&po_args.quiet, N_("be quiet")),
diff --git a/t/perf/p5313-pack-objects.sh b/t/perf/p5313-pack-objects.sh
index cd6dd3abb71..98748b0e203 100755
--- a/t/perf/p5313-pack-objects.sh
+++ b/t/perf/p5313-pack-objects.sh
@@ -55,23 +55,21 @@ test_all_with_args () {
 	test_size "shallow pack size with $parameter" '
 		test_file_size out
 	'
-}
-
-for version in 1 2
-do
-	export version
-
-	test_all_with_args --name-hash-version=$version
 
-	test_perf "repack with --name-hash-version=$version" '
-		git repack -adf --name-hash-version=$version
+	test_perf "repack with $parameter" '
+		git repack -adf $parameter
 	'
 
-	test_size "repack size with --name-hash-version=$version" '
+	test_size "repack size with $parameter" '
 		gitdir=$(git rev-parse --git-dir) &&
 		pack=$(ls $gitdir/objects/pack/pack-*.pack) &&
 		test_file_size "$pack"
 	'
+}
+
+for version in 1 2
+do
+	test_all_with_args --name-hash-version=$version
 done
 
 test_all_with_args --path-walk
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 08/13] pack-objects: enable --path-walk via config
  2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
                   ` (6 preceding siblings ...)
  2025-03-10  1:50 ` [PATCH 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
@ 2025-03-10  1:50 ` Derrick Stolee via GitGitGadget
  2025-03-10  1:50 ` [PATCH 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-10  1:50 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

Users may want to enable the --path-walk option for 'git pack-objects' by
default, especially underneath commands like 'git push' or 'git repack'.

This should be limited to client repositories, since the --path-walk option
disables bitmap walks, so would be bad to include in Git servers when
serving fetches and clones. There is potential that it may be helpful to
consider when repacking the repository, to take advantage of improved deltas
across historical versions of the same files.

Much like how "pack.useSparse" was introduced and included in
"feature.experimental" before being enabled by default, use the repository
settings infrastructure to make the new "pack.usePathWalk" config enabled by
"feature.experimental" and "feature.manyFiles".

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/config/feature.adoc | 4 ++++
 Documentation/config/pack.adoc    | 8 ++++++++
 builtin/pack-objects.c            | 3 +++
 repo-settings.c                   | 3 +++
 repo-settings.h                   | 1 +
 5 files changed, 19 insertions(+)

diff --git a/Documentation/config/feature.adoc b/Documentation/config/feature.adoc
index f061b64b748..cb49ff2604a 100644
--- a/Documentation/config/feature.adoc
+++ b/Documentation/config/feature.adoc
@@ -20,6 +20,10 @@ walking fewer objects.
 +
 * `pack.allowPackReuse=multi` may improve the time it takes to create a pack by
 reusing objects from multiple packs instead of just one.
++
+* `pack.usePathWalk` may speed up packfile creation and make the packfiles be
+significantly smaller in the presence of certain filename collisions with Git's
+default name-hash.
 
 feature.manyFiles::
 	Enable config options that optimize for repos with many files in the
diff --git a/Documentation/config/pack.adoc b/Documentation/config/pack.adoc
index da527377faf..08d06271177 100644
--- a/Documentation/config/pack.adoc
+++ b/Documentation/config/pack.adoc
@@ -155,6 +155,14 @@ pack.useSparse::
 	commits contain certain types of direct renames. Default is
 	`true`.
 
+pack.usePathWalk::
+	When true, git will default to using the '--path-walk' option in
+	'git pack-objects' when the '--revs' option is present. This
+	algorithm groups objects by path to maximize the ability to
+	compute delta chains across historical versions of the same
+	object. This may disable other options, such as using bitmaps to
+	enumerate objects.
+
 pack.preferBitmapTips::
 	When selecting which commits will receive bitmaps, prefer a
 	commit at the tip of any reference that is a suffix of any value
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 120202b05e9..c756ce50dd7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4649,6 +4649,9 @@ int cmd_pack_objects(int argc,
 		if (use_bitmap_index > 0 ||
 		    !use_internal_rev_list)
 			path_walk = 0;
+		else if (the_repository->gitdir &&
+			 the_repository->settings.pack_use_path_walk)
+			path_walk = 1;
 		else
 			path_walk = git_env_bool("GIT_TEST_PACK_PATH_WALK", 0);
 	}
diff --git a/repo-settings.c b/repo-settings.c
index 67e9cfd2e63..9b5595c708e 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -47,11 +47,13 @@ void prepare_repo_settings(struct repository *r)
 		r->settings.fetch_negotiation_algorithm = FETCH_NEGOTIATION_SKIPPING;
 		r->settings.pack_use_bitmap_boundary_traversal = 1;
 		r->settings.pack_use_multi_pack_reuse = 1;
+		r->settings.pack_use_path_walk = 1;
 	}
 	if (manyfiles) {
 		r->settings.index_version = 4;
 		r->settings.index_skip_hash = 1;
 		r->settings.core_untracked_cache = UNTRACKED_CACHE_WRITE;
+		r->settings.pack_use_path_walk = 1;
 	}
 
 	/* Commit graph config or default, does not cascade (simple) */
@@ -66,6 +68,7 @@ void prepare_repo_settings(struct repository *r)
 
 	/* Boolean config or default, does not cascade (simple)  */
 	repo_cfg_bool(r, "pack.usesparse", &r->settings.pack_use_sparse, 1);
+	repo_cfg_bool(r, "pack.usepathwalk", &r->settings.pack_use_path_walk, 0);
 	repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1);
 	repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0);
 	repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash);
diff --git a/repo-settings.h b/repo-settings.h
index ddc11967e01..a31decad221 100644
--- a/repo-settings.h
+++ b/repo-settings.h
@@ -56,6 +56,7 @@ struct repo_settings {
 	enum untracked_cache_setting core_untracked_cache;
 
 	int pack_use_sparse;
+	int pack_use_path_walk;
 	enum fetch_negotiation_setting fetch_negotiation_algorithm;
 
 	int core_multi_pack_index;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 09/13] scalar: enable path-walk during push via config
  2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
                   ` (7 preceding siblings ...)
  2025-03-10  1:50 ` [PATCH 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
@ 2025-03-10  1:50 ` Derrick Stolee via GitGitGadget
  2025-03-10  1:50 ` [PATCH 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-10  1:50 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

Repositories registered with Scalar are expected to be client-only
repositories that are rather large. This means that they are more likely to
be good candidates for using the --path-walk option when running 'git
pack-objects', especially under the hood of 'git push'. Enable this config
in Scalar repositories.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 scalar.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/scalar.c b/scalar.c
index da42b4be0cc..bf638fa34b8 100644
--- a/scalar.c
+++ b/scalar.c
@@ -170,6 +170,7 @@ static int set_recommended_config(int reconfigure)
 		{ "core.autoCRLF", "false" },
 		{ "core.safeCRLF", "false" },
 		{ "fetch.showForcedUpdates", "false" },
+		{ "pack.usePathWalk", "true" },
 		{ NULL, NULL },
 	};
 	int i;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 10/13] pack-objects: refactor path-walk delta phase
  2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
                   ` (8 preceding siblings ...)
  2025-03-10  1:50 ` [PATCH 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
@ 2025-03-10  1:50 ` Derrick Stolee via GitGitGadget
  2025-03-12 21:21   ` Taylor Blau
  2025-03-10  1:50 ` [PATCH 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-10  1:50 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

Previously, the --path-walk option to 'git pack-objects' would compute
deltas inline with the path-walk logic. This would make the progress
indicator look like it is taking a long time to enumerate objects, and
then very quickly computed deltas.

Instead of computing deltas on each region of objects organized by tree,
store a list of regions corresponding to these groups. These can later
be pulled from the list for delta compression before doing the "global"
delta search.

This presents a new progress indicator that can be used in tests to
verify that this stage is happening.

The current implementation is not integrated with threads, but could be
done in a future update.

Since we do not attempt to sort objects by size until after exploring
all trees, we can remove the previous change to t5530 due to a different
error message appearing first.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 builtin/pack-objects.c       | 82 +++++++++++++++++++++++++-----------
 pack-objects.h               | 12 ++++++
 t/t5300-pack-object.sh       |  8 +++-
 t/t5530-upload-pack-error.sh |  6 ---
 4 files changed, 75 insertions(+), 33 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c756ce50dd7..c5a3129c88e 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3233,6 +3233,51 @@ static int should_attempt_deltas(struct object_entry *entry)
 	return 1;
 }
 
+static void find_deltas_for_region(struct object_entry *list UNUSED,
+				   struct packing_region *region,
+				   unsigned int *processed)
+{
+	struct object_entry **delta_list;
+	uint32_t delta_list_nr = 0;
+
+	ALLOC_ARRAY(delta_list, region->nr);
+	for (uint32_t i = 0; i < region->nr; i++) {
+		struct object_entry *entry = to_pack.objects + region->start + i;
+		if (should_attempt_deltas(entry))
+			delta_list[delta_list_nr++] = entry;
+	}
+
+	QSORT(delta_list, delta_list_nr, type_size_sort);
+	find_deltas(delta_list, &delta_list_nr, window, depth, processed);
+	free(delta_list);
+}
+
+static void find_deltas_by_region(struct object_entry *list,
+				  struct packing_region *regions,
+				  uint32_t start, uint32_t nr)
+{
+	unsigned int processed = 0;
+	uint32_t progress_nr;
+
+	if (!nr)
+		return;
+
+	progress_nr = regions[nr - 1].start + regions[nr - 1].nr;
+
+	if (progress)
+		progress_state = start_progress(the_repository,
+						_("Compressing objects by path"),
+						progress_nr);
+
+	while (nr--)
+		find_deltas_for_region(list,
+				       &regions[start++],
+				       &processed);
+
+	display_progress(progress_state, progress_nr);
+	stop_progress(&progress_state);
+}
+
 static void prepare_pack(int window, int depth)
 {
 	struct object_entry **delta_list;
@@ -3257,6 +3302,10 @@ static void prepare_pack(int window, int depth)
 	if (!to_pack.nr_objects || !window || !depth)
 		return;
 
+	if (path_walk)
+		find_deltas_by_region(to_pack.objects, to_pack.regions,
+				      0, to_pack.nr_regions);
+
 	ALLOC_ARRAY(delta_list, to_pack.nr_objects);
 	nr_deltas = n = 0;
 
@@ -4210,10 +4259,8 @@ static int add_objects_by_path(const char *path,
 			       enum object_type type,
 			       void *data)
 {
-	struct object_entry **delta_list;
 	size_t oe_start = to_pack.nr_objects;
 	size_t oe_end;
-	unsigned int sub_list_size;
 	unsigned int *processed = data;
 
 	/*
@@ -4246,32 +4293,17 @@ static int add_objects_by_path(const char *path,
 	if (oe_end == oe_start || !window)
 		return 0;
 
-	sub_list_size = 0;
-	ALLOC_ARRAY(delta_list, oe_end - oe_start);
-
-	for (size_t i = 0; i < oe_end - oe_start; i++) {
-		struct object_entry *entry = to_pack.objects + oe_start + i;
+	ALLOC_GROW(to_pack.regions,
+		   to_pack.nr_regions + 1,
+		   to_pack.nr_regions_alloc);
 
-		if (!should_attempt_deltas(entry))
-			continue;
-
-		delta_list[sub_list_size++] = entry;
-	}
+	to_pack.regions[to_pack.nr_regions].start = oe_start;
+	to_pack.regions[to_pack.nr_regions].nr = oe_end - oe_start;
+	to_pack.nr_regions++;
 
-	/*
-	 * Find delta bases among this list of objects that all match the same
-	 * path. This causes the delta compression to be interleaved in the
-	 * object walk, which can lead to confusing progress indicators. This is
-	 * also incompatible with threaded delta calculations. In the future,
-	 * consider creating a list of regions in the full to_pack.objects array
-	 * that could be picked up by the threaded delta computation.
-	 */
-	if (sub_list_size && window) {
-		QSORT(delta_list, sub_list_size, type_size_sort);
-		find_deltas(delta_list, &sub_list_size, window, depth, processed);
-	}
+	*processed += oids->nr;
+	display_progress(progress_state, *processed);
 
-	free(delta_list);
 	return 0;
 }
 
diff --git a/pack-objects.h b/pack-objects.h
index d73e3843c92..1b01304f9c4 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -119,11 +119,23 @@ struct object_entry {
 	unsigned ext_base:1; /* delta_idx points outside packlist */
 };
 
+/**
+ * A packing region is a section of the packing_data.objects array
+ * as given by a starting index and a number of elements.
+ */
+struct packing_region {
+	uint32_t start;
+	uint32_t nr;
+};
+
 struct packing_data {
 	struct repository *repo;
 	struct object_entry *objects;
 	uint32_t nr_objects, nr_alloc;
 
+	struct packing_region *regions;
+	uint32_t nr_regions, nr_regions_alloc;
+
 	int32_t *index;
 	uint32_t index_size;
 
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 16420d12863..c8df6afd784 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -725,7 +725,9 @@ test_expect_success '--name-hash-version=2 and --write-bitmap-index are incompat
 
 test_expect_success '--path-walk pack everything' '
 	git -C server rev-parse HEAD >in &&
-	git -C server pack-objects --stdout --revs --path-walk <in >out.pack &&
+	GIT_PROGRESS_DELAY=0 git -C server pack-objects \
+		--stdout --revs --path-walk --progress <in >out.pack 2>err &&
+	grep "Compressing objects by path" err &&
 	git -C server index-pack --stdin <out.pack
 '
 
@@ -734,7 +736,9 @@ test_expect_success '--path-walk thin pack' '
 	$(git -C server rev-parse HEAD)
 	^$(git -C server rev-parse HEAD~2)
 	EOF
-	git -C server pack-objects --thin --stdout --revs --path-walk <in >out.pack &&
+	GIT_PROGRESS_DELAY=0 git -C server pack-objects \
+		--thin --stdout --revs --path-walk --progress <in >out.pack 2>err &&
+	grep "Compressing objects by path" err &&
 	git -C server index-pack --fix-thin --stdin <out.pack
 '
 
diff --git a/t/t5530-upload-pack-error.sh b/t/t5530-upload-pack-error.sh
index 8eb6fea839a..558eedf25a4 100755
--- a/t/t5530-upload-pack-error.sh
+++ b/t/t5530-upload-pack-error.sh
@@ -34,12 +34,6 @@ test_expect_success 'upload-pack fails due to error in pack-objects packing' '
 	hexsz=$(test_oid hexsz) &&
 	printf "%04xwant %s\n00000009done\n0000" \
 		$(($hexsz + 10)) $head >input &&
-
-	# The current implementation of path-walk causes a different
-	# error message. This will be changed by a future refactoring.
-	GIT_TEST_PACK_PATH_WALK=0 &&
-	export GIT_TEST_PACK_PATH_WALK &&
-
 	test_must_fail git upload-pack . <input >/dev/null 2>output.err &&
 	test_grep "unable to read" output.err &&
 	test_grep "pack-objects died" output.err
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 11/13] pack-objects: thread the path-based compression
  2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
                   ` (9 preceding siblings ...)
  2025-03-10  1:50 ` [PATCH 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
@ 2025-03-10  1:50 ` Derrick Stolee via GitGitGadget
  2025-03-10  1:50 ` [PATCH 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-10  1:50 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

Adapting the implementation of ll_find_deltas(), create a threaded
version of the --path-walk compression step in 'git pack-objects'.

This involves adding a 'regions' member to the thread_params struct,
allowing each thread to own a section of paths. We can simplify the way
jobs are split because there is no value in extending the batch based on
name-hash the way sections of the object entry array are attempted to be
grouped. We re-use the 'list_size' and 'remaining' items for the purpose
of borrowing work in progress from other "victim" threads when a thread
has finished its batch of work more quickly.

Using the Git repository as a test repo, the p5313 performance test
shows that the resulting size of the repo is the same, but the threaded
implementation gives gains of varying degrees depending on the number of
objects being packed. (This was tested on a 16-core machine.)

Test                        HEAD~1      HEAD
---------------------------------------------------
5313.20: big pack             2.38      1.99 -16.4%
5313.21: big pack size       16.1M     16.0M  -0.2%
5313.24: repack             107.32     45.41 -57.7%
5313.25: repack size        213.3M    213.2M  -0.0%

(Test output is formatted to better fit in message.)

This ~60% reduction in 'git repack --path-walk' time is typical across
all repos I used for testing. What is interesting is to compare when the
overall time improves enough to outperform the --name-hash-version=1
case. These time improvements correlate with repositories with data
shapes that significantly improve their data size as well. The
--path-walk feature frequently takes longer than --name-hash-verison=2,
trading some extrac computation for some additional compression. The
natural place where this additional computation comes from is the two
compression passes that --path-walk takes, though the first pass is
naturally faster due to the path boundaries avoiding a number of delta
compression attempts.

For example, the microsoft/fluentui repo has significant size reduction
from --name-hash-version=1 to --name-hash-version=2 followed by further
improvements with --path-walk. The threaded computation makes
--path-walk more competitive in time compared to --name-hash-version=2,
though still ~31% more expensive in that metric.

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                439.4M      87.24s
Hash v2                161.7M      21.51s
Path Walk (Before)     142.5M      81.29s
Path Walk (After)      142.5M      28.16s

Similar results hold for the Git repository:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                248.8M      30.44s
Hash v2                249.0M      30.15s
Path Walk (Before)     213.2M     142.50s
Path Walk (After)      213.3M      45.41s

...as well as the nodejs/node repository:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                739.9M      71.18s
Hash v2                764.6M      67.82s
Path Walk (Before)     698.1M     208.10s
Path Walk (After)      698.0M      75.10s

Finally, the Linux kernel repository is a good test for this repacking
time change, even though the space savings is more subtle:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                  2.5G     554.41s
Hash v2                  2.5G     549.62s
Path Walk (before)       2.2G    1562.36s
Path Walk (before)       2.2G     559.00s

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 builtin/pack-objects.c | 163 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 161 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c5a3129c88e..e8b1b057ec3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2964,6 +2964,7 @@ static void find_deltas(struct object_entry **list, unsigned *list_size,
 struct thread_params {
 	pthread_t thread;
 	struct object_entry **list;
+	struct packing_region *regions;
 	unsigned list_size;
 	unsigned remaining;
 	int window;
@@ -3278,6 +3279,164 @@ static void find_deltas_by_region(struct object_entry *list,
 	stop_progress(&progress_state);
 }
 
+static void *threaded_find_deltas_by_path(void *arg)
+{
+	struct thread_params *me = arg;
+
+	progress_lock();
+	while (me->remaining) {
+		while (me->remaining) {
+			progress_unlock();
+			find_deltas_for_region(to_pack.objects,
+					       me->regions,
+					       me->processed);
+			progress_lock();
+			me->remaining--;
+			me->regions++;
+		}
+
+		me->working = 0;
+		pthread_cond_signal(&progress_cond);
+		progress_unlock();
+
+		/*
+		 * We must not set ->data_ready before we wait on the
+		 * condition because the main thread may have set it to 1
+		 * before we get here. In order to be sure that new
+		 * work is available if we see 1 in ->data_ready, it
+		 * was initialized to 0 before this thread was spawned
+		 * and we reset it to 0 right away.
+		 */
+		pthread_mutex_lock(&me->mutex);
+		while (!me->data_ready)
+			pthread_cond_wait(&me->cond, &me->mutex);
+		me->data_ready = 0;
+		pthread_mutex_unlock(&me->mutex);
+
+		progress_lock();
+	}
+	progress_unlock();
+	/* leave ->working 1 so that this doesn't get more work assigned */
+	return NULL;
+}
+
+static void ll_find_deltas_by_region(struct object_entry *list,
+				     struct packing_region *regions,
+				     uint32_t start, uint32_t nr)
+{
+	struct thread_params *p;
+	int i, ret, active_threads = 0;
+	unsigned int processed = 0;
+	uint32_t progress_nr;
+	init_threaded_search();
+
+	if (!nr)
+		return;
+
+	progress_nr =  regions[nr - 1].start + regions[nr - 1].nr;
+	if (delta_search_threads <= 1) {
+		find_deltas_by_region(list, regions, start, nr);
+		cleanup_threaded_search();
+		return;
+	}
+
+	if (progress > pack_to_stdout)
+		fprintf_ln(stderr, _("Path-based delta compression using up to %d threads"),
+			   delta_search_threads);
+	CALLOC_ARRAY(p, delta_search_threads);
+
+	if (progress)
+		progress_state = start_progress(the_repository,
+						_("Compressing objects by path"),
+						progress_nr);
+	/* Partition the work amongst work threads. */
+	for (i = 0; i < delta_search_threads; i++) {
+		unsigned sub_size = nr / (delta_search_threads - i);
+
+		p[i].window = window;
+		p[i].depth = depth;
+		p[i].processed = &processed;
+		p[i].working = 1;
+		p[i].data_ready = 0;
+
+		p[i].regions = regions;
+		p[i].list_size = sub_size;
+		p[i].remaining = sub_size;
+
+		regions += sub_size;
+		nr -= sub_size;
+	}
+
+	/* Start work threads. */
+	for (i = 0; i < delta_search_threads; i++) {
+		if (!p[i].list_size)
+			continue;
+		pthread_mutex_init(&p[i].mutex, NULL);
+		pthread_cond_init(&p[i].cond, NULL);
+		ret = pthread_create(&p[i].thread, NULL,
+				     threaded_find_deltas_by_path, &p[i]);
+		if (ret)
+			die(_("unable to create thread: %s"), strerror(ret));
+		active_threads++;
+	}
+
+	/*
+	 * Now let's wait for work completion.  Each time a thread is done
+	 * with its work, we steal half of the remaining work from the
+	 * thread with the largest number of unprocessed objects and give
+	 * it to that newly idle thread.  This ensure good load balancing
+	 * until the remaining object list segments are simply too short
+	 * to be worth splitting anymore.
+	 */
+	while (active_threads) {
+		struct thread_params *target = NULL;
+		struct thread_params *victim = NULL;
+		unsigned sub_size = 0;
+
+		progress_lock();
+		for (;;) {
+			for (i = 0; !target && i < delta_search_threads; i++)
+				if (!p[i].working)
+					target = &p[i];
+			if (target)
+				break;
+			pthread_cond_wait(&progress_cond, &progress_mutex);
+		}
+
+		for (i = 0; i < delta_search_threads; i++)
+			if (p[i].remaining > 2*window &&
+			    (!victim || victim->remaining < p[i].remaining))
+				victim = &p[i];
+		if (victim) {
+			sub_size = victim->remaining / 2;
+			target->regions = victim->regions + victim->remaining - sub_size;
+			victim->list_size -= sub_size;
+			victim->remaining -= sub_size;
+		}
+		target->list_size = sub_size;
+		target->remaining = sub_size;
+		target->working = 1;
+		progress_unlock();
+
+		pthread_mutex_lock(&target->mutex);
+		target->data_ready = 1;
+		pthread_cond_signal(&target->cond);
+		pthread_mutex_unlock(&target->mutex);
+
+		if (!sub_size) {
+			pthread_join(target->thread, NULL);
+			pthread_cond_destroy(&target->cond);
+			pthread_mutex_destroy(&target->mutex);
+			active_threads--;
+		}
+	}
+	cleanup_threaded_search();
+	free(p);
+
+	display_progress(progress_state, progress_nr);
+	stop_progress(&progress_state);
+}
+
 static void prepare_pack(int window, int depth)
 {
 	struct object_entry **delta_list;
@@ -3303,8 +3462,8 @@ static void prepare_pack(int window, int depth)
 		return;
 
 	if (path_walk)
-		find_deltas_by_region(to_pack.objects, to_pack.regions,
-				      0, to_pack.nr_regions);
+		ll_find_deltas_by_region(to_pack.objects, to_pack.regions,
+					 0, to_pack.nr_regions);
 
 	ALLOC_ARRAY(delta_list, to_pack.nr_objects);
 	nr_deltas = n = 0;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 12/13] path-walk: add new 'edge_aggressive' option
  2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
                   ` (10 preceding siblings ...)
  2025-03-10  1:50 ` [PATCH 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
@ 2025-03-10  1:50 ` Derrick Stolee via GitGitGadget
  2025-03-10  1:50 ` [PATCH 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-10  1:50 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

In preparation for allowing both the --shallow and --path-walk options
in the 'git pack-objects' builtin, create a new 'edge_aggressive' option
in the path-walk API. This option will help walk the boundary more
thoroughly and help avoid sending extra objects during fetches and
pushes.

The only use of the 'edge_hint_aggressive' option in the revision API is
within mark_edges_uninteresting(), which is usually called before
between prepare_revision_walk() and before visiting commits with
get_revision(). In prepare_revision_walk(), the UNINTERESTING commits
are walked until a boundary is found.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/technical/api-path-walk.adoc |  8 ++++++++
 path-walk.c                                |  6 +++++-
 path-walk.h                                |  7 +++++++
 t/helper/test-path-walk.c                  |  2 ++
 t/t6601-path-walk.sh                       | 20 ++++++++++++++++++++
 5 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/Documentation/technical/api-path-walk.adoc b/Documentation/technical/api-path-walk.adoc
index e522695dd9f..34c905eb9c3 100644
--- a/Documentation/technical/api-path-walk.adoc
+++ b/Documentation/technical/api-path-walk.adoc
@@ -56,6 +56,14 @@ better off using the revision walk API instead.
 	the revision walk so that the walk emits commits marked with the
 	`UNINTERESTING` flag.
 
+`edge_aggressive`::
+	For performance reasons, usually only the boundary commits are
+	explored to find UNINTERESTING objects. However, in the case of
+	shallow clones it can be helpful to mark all trees and blobs
+	reachable from UNINTERESTING tip commits as UNINTERESTING. This
+	matches the behavior of `--objects-edge-aggressive` in the
+	revision API.
+
 `pl`::
 	This pattern list pointer allows focusing the path-walk search to
 	a set of patterns, only emitting paths that match the given
diff --git a/path-walk.c b/path-walk.c
index 341bdd2ba4e..2d4ddbadd50 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -503,7 +503,11 @@ int walk_objects_by_path(struct path_walk_info *info)
 	if (prepare_revision_walk(info->revs))
 		die(_("failed to setup revision walk"));
 
-	/* Walk trees to mark them as UNINTERESTING. */
+	/*
+	 * Walk trees to mark them as UNINTERESTING.
+	 * This is particularly important when 'edge_aggressive' is set.
+	 */
+	info->revs->edge_hint_aggressive = info->edge_aggressive;
 	edge_repo = info->revs->repo;
 	edge_tree_list = root_tree_list;
 	mark_edges_uninteresting(info->revs, show_edge,
diff --git a/path-walk.h b/path-walk.h
index 473ee9d361c..5ef5a8440e6 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -50,6 +50,13 @@ struct path_walk_info {
 	 */
 	int prune_all_uninteresting;
 
+	/**
+	 * When 'edge_aggressive' is set, then the revision walk will use
+	 * the '--object-edge-aggressive' option to mark even more objects
+	 * as uninteresting.
+	 */
+	int edge_aggressive;
+
 	/**
 	 * Specify a sparse-checkout definition to match our paths to. Do not
 	 * walk outside of this sparse definition. If the patterns are in
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 61e845e5ec2..fe63002c2be 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -82,6 +82,8 @@ int cmd__path_walk(int argc, const char **argv)
 			 N_("toggle inclusion of tree objects")),
 		OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
 			 N_("toggle pruning of uninteresting paths")),
+		OPT_BOOL(0, "edge-aggressive", &info.edge_aggressive,
+			 N_("toggle aggressive edge walk")),
 		OPT_BOOL(0, "stdin-pl", &stdin_pl,
 			 N_("read a pattern list over stdin")),
 		OPT_END(),
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index c89b0f1e19d..785c2f22373 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -378,6 +378,26 @@ test_expect_success 'topic, not base, boundary with pruning' '
 	test_cmp_sorted expect out
 '
 
+test_expect_success 'topic, not base, --edge-aggressive with pruning' '
+	test-tool path-walk --prune --edge-aggressive -- topic --not base >out &&
+
+	cat >expect <<-EOF &&
+	0:commit::$(git rev-parse topic)
+	1:tree::$(git rev-parse topic^{tree})
+	1:tree::$(git rev-parse base^{tree}):UNINTERESTING
+	2:tree:right/:$(git rev-parse topic:right)
+	2:tree:right/:$(git rev-parse base:right):UNINTERESTING
+	3:blob:right/c:$(git rev-parse base:right/c):UNINTERESTING
+	3:blob:right/c:$(git rev-parse topic:right/c)
+	blobs:2
+	commits:1
+	tags:0
+	trees:4
+	EOF
+
+	test_cmp_sorted expect out
+'
+
 test_expect_success 'trees are reported exactly once' '
 	test_when_finished "rm -rf unique-trees" &&
 	test_create_repo unique-trees &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 13/13] pack-objects: allow --shallow and --path-walk
  2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
                   ` (11 preceding siblings ...)
  2025-03-10  1:50 ` [PATCH 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
@ 2025-03-10  1:50 ` Derrick Stolee via GitGitGadget
  2025-03-10 17:28 ` [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Junio C Hamano
  2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  14 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-10  1:50 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

There does not appear to be anything particularly incompatible about the
--shallow and --path-walk options of 'git pack-objects'. If shallow
commits are to be handled differently, then it is by the revision walk
that defines the commit set and which are interesting or uninteresting.

However, before the previous change, a trivial removal of the warning
would cause a failure in t5500-fetch-pack.sh when
GIT_TEST_PACK_PATH_WALK is enabled. The shallow fetch would provide more
objects than we desired, due to some incorrect behavior of the path-walk
API, especially around walking uninteresting objects.

The recently-added tests in t5538-push-shallow.sh help to confirm this
behavior is working with the --path-walk option if
GIT_TEST_PACK_PATH_WALK is enabled. These tests passed previously due to
the --path-walk feature being disabled in the presence of a shallow
clone.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 builtin/pack-objects.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index e8b1b057ec3..4bd943728b4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -209,6 +209,7 @@ static int keep_unreachable, unpack_unreachable, include_tag;
 static timestamp_t unpack_unreachable_expiration;
 static int pack_loose_unreachable;
 static int cruft;
+static int shallow = 0;
 static timestamp_t cruft_expiration;
 static int local;
 static int have_non_local_packs;
@@ -4483,6 +4484,7 @@ static void get_object_list_path_walk(struct rev_info *revs)
 	 * base objects.
 	 */
 	info.prune_all_uninteresting = sparse;
+	info.edge_aggressive = shallow;
 
 	if (walk_objects_by_path(&info))
 		die(_("failed to pack objects via path-walk"));
@@ -4684,7 +4686,6 @@ int cmd_pack_objects(int argc,
 		     struct repository *repo UNUSED)
 {
 	int use_internal_rev_list = 0;
-	int shallow = 0;
 	int all_progress_implied = 0;
 	struct strvec rp = STRVEC_INIT;
 	int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
@@ -4872,10 +4873,6 @@ int cmd_pack_objects(int argc,
 		warning(_("cannot use delta islands with --path-walk"));
 		path_walk = 0;
 	}
-	if (path_walk && shallow) {
-		warning(_("cannot use --shallow with --path-walk"));
-		path_walk = 0;
-	}
 	if (path_walk) {
 		strvec_push(&rp, "--boundary");
 		 /*
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects'
  2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
                   ` (12 preceding siblings ...)
  2025-03-10  1:50 ` [PATCH 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
@ 2025-03-10 17:28 ` Junio C Hamano
  2025-03-12 20:47   ` Taylor Blau
  2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  14 siblings, 1 reply; 38+ messages in thread
From: Junio C Hamano @ 2025-03-10 17:28 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, christian.couder, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> ... deltas across path boundaries. This second pass is much faster than a fresh
> pass since the existing deltas are used as a limit for the size of
> potentially new deltas, short-circuiting the checks when the delta size
> exceeds the current-best.

Very nice.

> The microsoft/fluentui is a public Javascript repo that suffers from many of
> the name hash collisions as internal repositories I've worked with. Here is
> a comparison of the compressed size and end-to-end time of the repack:
>
> Repack Method    Pack Size       Time
> ---------------------------------------
> Hash v1             439.4M      87.24s
> Hash v2             161.7M      21.51s
> Path Walk           142.5M      28.16s
>
>
> Less dramatic, but perhaps more standardly structured is the nodejs/node
> repository, with these stats:
>
> Repack Method       Pack Size       Time
> ------------------------------------------
> Hash v1                739.9M      71.18s
> Hash v2                764.6M      67.82s
> Path Walk              698.0M      75.10s
>
>
> Even the Linux kernel repository gains some benefits, even though the number
> of hash collisions is relatively low due to a preference for short
> filenames:
>
> Repack Method       Pack Size       Time
> ------------------------------------------
> Hash v1                  2.5G     554.41s
> Hash v2                  2.5G     549.62s
> Path Walk                2.2G     559.00s

This third one, v2 not performing much better than v1, is quite
surprising.

> The drawbacks of the --path-walk feature is that it will be harder to
> integrate it with bitmap features, specifically delta islands. This is not
> insurmountable, but would require more work, such as a revision walk to
> paint objects with reachability information before using that during delta
> computations.
>
> However, there should still be significant benefits to Git clients trying to
> save space and improve local performance.

Sure.  More experiments and more approaches will eventually give us
overall improvement.  I am hoping that we will be able to condense
the result of these different approaches and their combinations into
easy-to-choose-from canned choices (as opposed to a myriad of little
knobs the users need to futz with without really understanding what
they are tweaking).

> This feature was shipped with similar features in microsoft/git as of
> v2.47.0.vfs.0.3 [4]. This was used in CI machines for an internal monorepo
> that had significant repository growth due to constructing a batch of
> beachball [5] CHANGELOG.[md|json] files and pushing them to a release
> branch. These pushes were frequently 70-200 MB due to poor delta
> compression. Using the 'pack.usePathWalk=true' config, these pushes dropped
> in size by 100x while improving performance. Since these CI machines were
> working with a shallow clone, the 'edge_aggressive' changes were required to
> enable the path-walk option.

Nice, thanks.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects'
  2025-03-10 17:28 ` [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Junio C Hamano
@ 2025-03-12 20:47   ` Taylor Blau
  2025-03-20 20:18     ` Derrick Stolee
  0 siblings, 1 reply; 38+ messages in thread
From: Taylor Blau @ 2025-03-12 20:47 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, christian.couder,
	johannes.schindelin, johncai86, jonathantanmy, karthik.188,
	kristofferhaugsbakk, newren, peff, ps, Derrick Stolee

On Mon, Mar 10, 2025 at 10:28:22AM -0700, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > ... deltas across path boundaries. This second pass is much faster than a fresh
> > pass since the existing deltas are used as a limit for the size of
> > potentially new deltas, short-circuiting the checks when the delta size
> > exceeds the current-best.
>
> Very nice.
>
> > The microsoft/fluentui is a public Javascript repo that suffers from many of
> > the name hash collisions as internal repositories I've worked with. Here is
> > a comparison of the compressed size and end-to-end time of the repack:
> >
> > Repack Method    Pack Size       Time
> > ---------------------------------------
> > Hash v1             439.4M      87.24s
> > Hash v2             161.7M      21.51s
> > Path Walk           142.5M      28.16s

OK, so microsoft/fluentui benefits from the path-walk approach in the
size of the resulting pack, but at the cost of additional time to
generate it.

> > Less dramatic, but perhaps more standardly structured is the nodejs/node
> > repository, with these stats:
> >
> > Repack Method       Pack Size       Time
> > ------------------------------------------
> > Hash v1                739.9M      71.18s
> > Hash v2                764.6M      67.82s
> > Path Walk              698.0M      75.10s

Same here.

> > Even the Linux kernel repository gains some benefits, even though the number
> > of hash collisions is relatively low due to a preference for short
> > filenames:
> >
> > Repack Method       Pack Size       Time
> > ------------------------------------------
> > Hash v1                  2.5G     554.41s
> > Hash v2                  2.5G     549.62s
> > Path Walk                2.2G     559.00s

OK, so here the savings are a little more substantial, and the
performance hit isn't too bad.

> This third one, v2 not performing much better than v1, is quite
> surprising.

I'm not sure... I think Stolee's "the number of hash collisions is
relatively low due to preference for short filenames" is why v2 behaves
so similarly to v1 here.

> > The drawbacks of the --path-walk feature is that it will be harder to
> > integrate it with bitmap features, specifically delta islands. This is not
> > insurmountable, but would require more work, such as a revision walk to
> > paint objects with reachability information before using that during delta
> > computations.
> >
> > However, there should still be significant benefits to Git clients trying to
> > save space and improve local performance.
>
> Sure.  More experiments and more approaches will eventually give us
> overall improvement.  I am hoping that we will be able to condense
> the result of these different approaches and their combinations into
> easy-to-choose-from canned choices (as opposed to a myriad of little
> knobs the users need to futz with without really understanding what
> they are tweaking).

In the above three examples we see some trade-offs between pack size and
the time it took to generate it. I think it's worth discussing whether
or not the potential benefit of such a trade-off is worth the
significant complexity and code that this feature will introduce. (To be
clear, I don't have a strong opinion here one way or the other, but I do
think that it's at least worth discussing).

I wonder how much of the benefits of path-walk over the hash v2 approach
could be had by simply widening the pack.window during delta selection?

I tried to run a similar experiment as you did above on the
microsoft/fluentui repository and got the following:

    Repack Method       Pack Size       Time
    ------------------------------------------
    Hash v1              447.2MiB      932.41s
    Hash v2              154.1MiB      404.35s
    Hash v2 (window=20)  146.7MiB      472.66s
    Hash v2 (window=50)  138.3MiB      622.13s
    Path Walk            140.8MiB      168.86s

In your experiment above on the same repository, the path walk feature
represents an 11.873% reduction in pack size, but at the cost of a 30.9%
regression in runtime.

When I set pack.window to "50" (over the default value of "10"), I get a
~10.3% reduction in pack size at the cost of a 54% increase in runtime
(relative to just --name-hash-version=2 with the default pack.window
settings).

But when I set the pack.window to "20", the relative values (again
comparing against --name-hash-version=2 with the default pack.window)
are 4.8% reduction in pack size and a 16.9% increase in runtime.

But these numbers are pretty confusing to me, TBH. The reduction in pack
sizes makes sense, and here I see numbers that are on-par with what you
noted above for the same repository. But the runtimes are wildly
different (e.g., hash v1 takes you just 87s while mine takes 932s).

There must be something in our environment that is different. I'm
starting with a bare clone of microsoft/fluentui from GitHub, and made
several 'cp -al' copies of it for the different experiments. In the
penultimate one, I ran:

    $ time git.compile -c pack.window=50 repack --name-hash-version=2 \
        -adF --no-write-bitmap-index

, and similarly for the other experiments with appropriate values for
pack.window, --name-hash-version, and --path-walk, when applicable. All
of this was done on a -O2 build of Git with your patches on top.

So I'm not sure what to make of these results. Clearly on my machine
something is different that makes path-walk much faster than hash v2.
But on your machine it's slower, so I don't know how much I trust the
timing results from either machine.

In any event, it seems like at least in this example we can get
performance that is on-par with path-walk by simply widening the
pack.window when using hash v2. On my machine that seems to cost more
time than it does for you to the point where it's slower than my
path-walk. But I think I need to understand what the differences are
here before we can draw any conclusions on the size or timing.

If the overwhelming majority of cases where the --path-walk feature
presents a significant benefit over hash v2 at various pack.window sizes
(where we could get approximately the same reduction in pack size with
approximately the same end-to-end runtime of 'git repack'), then I feel
we might want to reconsider whether or not the complexity of this feature
is worthwhile.

But if the --path-walk feature either gives us a significant size
benefit that we can't get with hash v2 and a wider pack.window without
paying a significant runtime cost (or vice-versa), then this feature
would indeed be worthwhile.

I also have no idea how representative the above is of your intended
use-case, which seems much more oriented around pushes than from-scratch
repacks, which would also affect our conclusions here.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/13] pack-objects: extract should_attempt_deltas()
  2025-03-10  1:50 ` [PATCH 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
@ 2025-03-12 21:01   ` Taylor Blau
  2025-03-20 19:48     ` Derrick Stolee
  0 siblings, 1 reply; 38+ messages in thread
From: Taylor Blau @ 2025-03-12 21:01 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, newren, peff, ps,
	Derrick Stolee

On Mon, Mar 10, 2025 at 01:50:43AM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <stolee@gmail.com>
>
> This will be helpful in a future change, which will reuse this logic.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  builtin/pack-objects.c | 53 +++++++++++++++++++++++-------------------
>  1 file changed, 29 insertions(+), 24 deletions(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 58a9b161262..1d0992a8dac 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3196,6 +3196,33 @@ static int add_ref_tag(const char *tag UNUSED, const char *referent UNUSED, cons
>  	return 0;
>  }
>
> +static int should_attempt_deltas(struct object_entry *entry)
> +{
> +	if (DELTA(entry))
> +		return 0;
> +
> +	if (!entry->type_valid ||
> +	    oe_size_less_than(&to_pack, entry, 50))
> +		return 0;
> +
> +	if (entry->no_try_delta)
> +		return 0;
> +
> +	if (!entry->preferred_base) {
> +		if (oe_type(entry) < 0)
> +			die(_("unable to get type of object %s"),
> +				oid_to_hex(&entry->idx.oid));
> +	} else if (oe_type(entry) < 0) {
> +		/*
> +		 * This object is not found, but we
> +		 * don't have to include it anyway.
> +		 */
> +		return 0;
> +	}
> +
> +	return 1;
> +}
> +
>  static void prepare_pack(int window, int depth)
>  {
>  	struct object_entry **delta_list;
> @@ -3226,33 +3253,11 @@ static void prepare_pack(int window, int depth)
>  	for (i = 0; i < to_pack.nr_objects; i++) {
>  		struct object_entry *entry = to_pack.objects + i;
>
> -		if (DELTA(entry))
> -			/* This happens if we decided to reuse existing
> -			 * delta from a pack.  "reuse_delta &&" is implied.
> -			 */

It looks like this comment went away when this part of prepare_pack()
was extracted into should_attempt_deltas().

> -			continue;
> -
> -		if (!entry->type_valid ||
> -		    oe_size_less_than(&to_pack, entry, 50))
> +		if (!should_attempt_deltas(entry))
>  			continue;
>
> -		if (entry->no_try_delta)
> -			continue;
> -
> -		if (!entry->preferred_base) {
> +		if (!entry->preferred_base)
>  			nr_deltas++;

Makes sense; should_attempt_deltas() doesn't itself change nr_deltas, so
we want to do it ourselves here. Looking good!

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 02/13] pack-objects: add --path-walk option
  2025-03-10  1:50 ` [PATCH 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
@ 2025-03-12 21:14   ` Taylor Blau
  2025-03-20 19:46     ` Derrick Stolee
  0 siblings, 1 reply; 38+ messages in thread
From: Taylor Blau @ 2025-03-12 21:14 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, newren, peff, ps,
	Derrick Stolee

On Mon, Mar 10, 2025 at 01:50:44AM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <stolee@gmail.com>
>
> In order to more easily compute delta bases among objects that appear at
> the exact same path, add a --path-walk option to 'git pack-objects'.
>
> This option will use the path-walk API instead of the object walk given
> by the revision machinery. Since objects will be provided in batches
> representing a common path, those objects can be tested for delta bases
> immediately instead of waiting for a sort of the full object list by
> name-hash. This has multiple benefits, including avoiding collisions by
> name-hash.
>
> The objects marked as UNINTERESTING are included in these batches, so we
> are guaranteeing some locality to find good delta bases.
>
> After the individual passes are done on a per-path basis, the default
> name-hash is used to find other opportunistic delta bases that did not
> match exactly by the full path name.
>
> The current implementation performs delta calculations while walking
> objects, which is not ideal for a few reasons. First, this will cause
> the "Enumerating objects" phase to be much longer than usual. Second, it
> does not take advantage of threading during the path-scoped delta
> calculations. Even with this lack of threading, the path-walk option is
> sometimes faster than the usual approach. Future changes will refactor
> this code to allow for threading, but that complexity is deferred until
> later to keep this patch as simple as possible.
>
> This new walk is incompatible with some features and is ignored by
> others:
>
>  * Object filters are not currently integrated with the path-walk API,
>    such as sparse-checkout or tree depth. A blobless packfile could be
>    integrated easily, but that is deferred for later.
>
>  * Server-focused features such as delta islands, shallow packs, and
>    using a bitmap index are incompatible with the path-walk API.
>
>  * The path walk API is only compatible with the --revs option, not
>    taking object lists or pack lists over stdin. These alternative ways
>    to specify the objects currently ignores the --path-walk option
>    without even a warning.
>
> Future changes will create performance tests that demonstrate the power
> of this approach.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  Documentation/git-pack-objects.adoc        |  13 +-
>  Documentation/technical/api-path-walk.adoc |   1 +
>  builtin/pack-objects.c                     | 147 +++++++++++++++++++--
>  t/t5300-pack-object.sh                     |  15 +++
>  4 files changed, 166 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/git-pack-objects.adoc b/Documentation/git-pack-objects.adoc
> index 7f69ae4855f..7dbbe6d54d2 100644
> --- a/Documentation/git-pack-objects.adoc
> +++ b/Documentation/git-pack-objects.adoc
> @@ -16,7 +16,7 @@ SYNOPSIS
>  	[--cruft] [--cruft-expiration=<time>]
>  	[--stdout [--filter=<filter-spec>] | <base-name>]
>  	[--shallow] [--keep-true-parents] [--[no-]sparse]
> -	[--name-hash-version=<n>] < <object-list>
> +	[--name-hash-version=<n>] [--path-walk] < <object-list>
>
>
>  DESCRIPTION
> @@ -375,6 +375,17 @@ many different directories. At the moment, this version is not allowed
>  when writing reachability bitmap files with `--write-bitmap-index` and it
>  will be automatically changed to version `1`.
>
> +--path-walk::
> +	By default, `git pack-objects` walks objects in an order that
> +	presents trees and blobs in an order unrelated to the path they
> +	appear relative to a commit's root tree. The `--path-walk` option
> +	enables a different walking algorithm that organizes trees and
> +	blobs by path. This has the potential to improve delta compression
> +	especially in the presence of filenames that cause collisions in
> +	Git's default name-hash algorithm. Due to changing how the objects
> +	are walked, this option is not compatible with `--delta-islands`,
> +	`--shallow`, or `--filter`.

I think from reading further below that this feature is somewhat
incompatible with --use-bitmap-index, at least in the sense that we
implicitly disable the latter whenever we see the former. Would that be
worth mentioning here?

> +static int add_objects_by_path(const char *path,
> +			       struct oid_array *oids,
> +			       enum object_type type,
> +			       void *data)
> +{
> +	struct object_entry **delta_list;
> +	size_t oe_start = to_pack.nr_objects;
> +	size_t oe_end;
> +	unsigned int sub_list_size;
> +	unsigned int *processed = data;
> +
> +	/*
> +	 * First, add all objects to the packing data, including the ones
> +	 * marked UNINTERESTING (translated to 'exclude') as they can be
> +	 * used as delta bases.
> +	 */
> +	for (size_t i = 0; i < oids->nr; i++) {
> +		int exclude;
> +		struct object_info oi = OBJECT_INFO_INIT;
> +		struct object_id *oid = &oids->oid[i];
> +
> +		/* Skip objects that do not exist locally. */
> +		if (exclude_promisor_objects &&
> +		    oid_object_info_extended(the_repository, oid, &oi,
> +					     OBJECT_INFO_FOR_PREFETCH) < 0)
> +			continue;
> +
> +		exclude = !is_oid_interesting(the_repository, oid);
> +
> +		if (exclude && !thin)
> +			continue;
> +
> +		add_object_entry(oid, type, path, exclude);
> +	}
> +
> +	oe_end = to_pack.nr_objects;
> +
> +	/* We can skip delta calculations if it is a no-op. */
> +	if (oe_end == oe_start || !window)
> +		return 0;
> +
> +	sub_list_size = 0;
> +	ALLOC_ARRAY(delta_list, oe_end - oe_start);

Makes sense, and seems all reasonable.

> +	for (size_t i = 0; i < oe_end - oe_start; i++) {
> +		struct object_entry *entry = to_pack.objects + oe_start + i;
> +
> +		if (!should_attempt_deltas(entry))
> +			continue;
> +
> +		delta_list[sub_list_size++] = entry;
> +	}
> +
> +	/*
> +	 * Find delta bases among this list of objects that all match the same
> +	 * path. This causes the delta compression to be interleaved in the
> +	 * object walk, which can lead to confusing progress indicators. This is
> +	 * also incompatible with threaded delta calculations. In the future,
> +	 * consider creating a list of regions in the full to_pack.objects array
> +	 * that could be picked up by the threaded delta computation.
> +	 */
> +	if (sub_list_size && window) {
> +		QSORT(delta_list, sub_list_size, type_size_sort);
> +		find_deltas(delta_list, &sub_list_size, window, depth, processed);
> +	}

Interesting. I like the "regions in to_pack.objects" idea as a way of
threading the delta selection process in the future.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 03/13] pack-objects: update usage to match docs
  2025-03-10  1:50 ` [PATCH 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
@ 2025-03-12 21:14   ` Taylor Blau
  0 siblings, 0 replies; 38+ messages in thread
From: Taylor Blau @ 2025-03-12 21:14 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, newren, peff, ps,
	Derrick Stolee

On Mon, Mar 10, 2025 at 01:50:45AM +0000, Derrick Stolee via GitGitGadget wrote:
> ---
>  Documentation/git-pack-objects.adoc | 14 +++++++-------
>  builtin/pack-objects.c              | 10 ++++++++--
>  t/t0450/adoc-help-mismatches        |  1 -
>  3 files changed, 15 insertions(+), 10 deletions(-)

Thanks for cleaning these up.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 10/13] pack-objects: refactor path-walk delta phase
  2025-03-10  1:50 ` [PATCH 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
@ 2025-03-12 21:21   ` Taylor Blau
  2025-03-20 19:57     ` Derrick Stolee
  0 siblings, 1 reply; 38+ messages in thread
From: Taylor Blau @ 2025-03-12 21:21 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, newren, peff, ps,
	Derrick Stolee

On Mon, Mar 10, 2025 at 01:50:52AM +0000, Derrick Stolee via GitGitGadget wrote:
> ---
>  builtin/pack-objects.c       | 82 +++++++++++++++++++++++++-----------
>  pack-objects.h               | 12 ++++++
>  t/t5300-pack-object.sh       |  8 +++-
>  t/t5530-upload-pack-error.sh |  6 ---
>  4 files changed, 75 insertions(+), 33 deletions(-)

Ah, nice :-). This is where you implement the idea that you were
mentioning back in

> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index c756ce50dd7..c5a3129c88e 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3233,6 +3233,51 @@ static int should_attempt_deltas(struct object_entry *entry)
>  	return 1;
>  }
>
> +static void find_deltas_for_region(struct object_entry *list UNUSED,

Interesting, it looks like "list" here is UNUSED in this patch. On first
read I figured that you were going to make use of it in later patches,
but it looks like it remains UNUSED at the tip of my local copy of this
series.

Is this a stray change from when you were writing this? Something else?

> +				   struct packing_region *region,
> +				   unsigned int *processed)
> +{
> +	struct object_entry **delta_list;
> +	uint32_t delta_list_nr = 0;

I know that we have a lot of "_nr" and "_alloc" variables in the
pack-objects code that use uint32_t as their type, but we should prefer
size_t for these in the future, even if it breaks the existing pattern.

As much as I can grok of the implementation through the rest of the
patch makes sense to me and looks good.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 02/13] pack-objects: add --path-walk option
  2025-03-12 21:14   ` Taylor Blau
@ 2025-03-20 19:46     ` Derrick Stolee
  0 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee @ 2025-03-20 19:46 UTC (permalink / raw)
  To: Taylor Blau, Derrick Stolee via GitGitGadget
  Cc: git, christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, newren, peff, ps

On 3/12/2025 5:14 PM, Taylor Blau wrote:
> On Mon, Mar 10, 2025 at 01:50:44AM +0000, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <stolee@gmail.com>

>> +--path-walk::
>> +	By default, `git pack-objects` walks objects in an order that
>> +	presents trees and blobs in an order unrelated to the path they
>> +	appear relative to a commit's root tree. The `--path-walk` option
>> +	enables a different walking algorithm that organizes trees and
>> +	blobs by path. This has the potential to improve delta compression
>> +	especially in the presence of filenames that cause collisions in
>> +	Git's default name-hash algorithm. Due to changing how the objects
>> +	are walked, this option is not compatible with `--delta-islands`,
>> +	`--shallow`, or `--filter`.
> 
> I think from reading further below that this feature is somewhat
> incompatible with --use-bitmap-index, at least in the sense that we
> implicitly disable the latter whenever we see the former. Would that be
> worth mentioning here?

While it is not incompatible and does not even include a warning, the
--use-bitmap-index option gets silently ignored. This matches how
--shallow does similar things. I'll still add to this doc as I agree
it would be helpful.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/13] pack-objects: extract should_attempt_deltas()
  2025-03-12 21:01   ` Taylor Blau
@ 2025-03-20 19:48     ` Derrick Stolee
  0 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee @ 2025-03-20 19:48 UTC (permalink / raw)
  To: Taylor Blau, Derrick Stolee via GitGitGadget
  Cc: git, christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, newren, peff, ps


On 3/12/2025 5:01 PM, Taylor Blau wrote:
> On Mon, Mar 10, 2025 at 01:50:43AM +0000, Derrick Stolee via GitGitGadget wrote:

>> +static int should_attempt_deltas(struct object_entry *entry)
>> +{
>> +	if (DELTA(entry))
>> +		return 0;
>> +
...>> -		if (DELTA(entry))
>> -			/* This happens if we decided to reuse existing
>> -			 * delta from a pack.  "reuse_delta &&" is implied.
>> -			 */
> 
> It looks like this comment went away when this part of prepare_pack()
> was extracted into should_attempt_deltas().
> 
>> -			continue;

Good catch. 

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 10/13] pack-objects: refactor path-walk delta phase
  2025-03-12 21:21   ` Taylor Blau
@ 2025-03-20 19:57     ` Derrick Stolee
  0 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee @ 2025-03-20 19:57 UTC (permalink / raw)
  To: Taylor Blau, Derrick Stolee via GitGitGadget
  Cc: git, christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, newren, peff, ps

On 3/12/2025 5:21 PM, Taylor Blau wrote:
> On Mon, Mar 10, 2025 at 01:50:52AM +0000, Derrick Stolee via GitGitGadget wrote:

>> +static void find_deltas_for_region(struct object_entry *list UNUSED,
> 
> Interesting, it looks like "list" here is UNUSED in this patch. On first
> read I figured that you were going to make use of it in later patches,
> but it looks like it remains UNUSED at the tip of my local copy of this
> series.
> 
> Is this a stray change from when you were writing this? Something else?

Actually, the intention was to make this method less focused on globals,
but it in fact uses 'to_pack.objects' when it should use 'list'. Thanks
for the careful eye here. 
>> +				   struct packing_region *region,
>> +				   unsigned int *processed)
>> +{
>> +	struct object_entry **delta_list;
>> +	uint32_t delta_list_nr = 0;
> 
> I know that we have a lot of "_nr" and "_alloc" variables in the
> pack-objects code that use uint32_t as their type, but we should prefer
> size_t for these in the future, even if it breaks the existing pattern.
Unfortunately, the fact that these are used in the find_deltas() method
as an 'unsigned *' reference, these types need to be as they are. I'm
hesitant to use 64-bit integers that will eventually cast down to 32-bit
in these instances.

Thanks,
-Stolee



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects'
  2025-03-12 20:47   ` Taylor Blau
@ 2025-03-20 20:18     ` Derrick Stolee
  0 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee @ 2025-03-20 20:18 UTC (permalink / raw)
  To: Taylor Blau, Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, christian.couder,
	johannes.schindelin, johncai86, jonathantanmy, karthik.188,
	kristofferhaugsbakk, newren, peff, ps

On 3/12/2025 4:47 PM, Taylor Blau wrote:
> On Mon, Mar 10, 2025 at 10:28:22AM -0700, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> In the above three examples we see some trade-offs between pack size and
> the time it took to generate it. I think it's worth discussing whether
> or not the potential benefit of such a trade-off is worth the
> significant complexity and code that this feature will introduce. (To be
> clear, I don't have a strong opinion here one way or the other, but I do
> think that it's at least worth discussing).
> 
> I wonder how much of the benefits of path-walk over the hash v2 approach
> could be had by simply widening the pack.window during delta selection?
> 
> I tried to run a similar experiment as you did above on the
> microsoft/fluentui repository and got the following:
> 
>     Repack Method       Pack Size       Time
>     ------------------------------------------
>     Hash v1              447.2MiB      932.41s
>     Hash v2              154.1MiB      404.35s
>     Hash v2 (window=20)  146.7MiB      472.66s
>     Hash v2 (window=50)  138.3MiB      622.13s
>     Path Walk            140.8MiB      168.86s
> 
> In your experiment above on the same repository, the path walk feature
> represents an 11.873% reduction in pack size, but at the cost of a 30.9%
> regression in runtime.
> 
> When I set pack.window to "50" (over the default value of "10"), I get a
> ~10.3% reduction in pack size at the cost of a 54% increase in runtime
> (relative to just --name-hash-version=2 with the default pack.window
> settings).
> 
> But when I set the pack.window to "20", the relative values (again
> comparing against --name-hash-version=2 with the default pack.window)
> are 4.8% reduction in pack size and a 16.9% increase in runtime.

You're right that I wasn't including data around the --window option in
my analysis. This option presents folks with the opportunity to add CPU
time in order to improve the possibility of better compression due to
considering more object pairs.

But it's also important to note that that option still works with
--path-walk, except that the --path-walk option is focused on improving
the quality of objects being considered within a window. There's also the
aspect that there are two passes (one path-based and one name-hash-based)
so increasing the --window size has a larger impact on the --path-walk
option.

With regards to the microsoft/fluentui repo, I had previously been using
an old clone using --bare. The size changes if I use --mirror as well,
since it will get the fork hint refs corresponding to objects in public
forks that are not actually in the core repo. This changes the clone size
as well as the repacked size.

(To save time, I didn't repeat the --window option tests for name hash
v1 as name hash v2 is clearly superior to that option in this repo.)

Cloned with --bare:

| Type         | Window: 10     | Window: 20     | Window: 50     |
|--------------|----------------|----------------|----------------|
| name hash v1 | 451 M | 1m 42s |       |        |       |        |
| name hash v2 | 160 M | 35.4 s | 151 M | 25.4 s | 141 M | 31.0 s |
| --path-walk  | 141 M | 31.0 s | 136 M | 35.7 s | 129 M | 49.3 s |

Cloned with --mirror:

| Type         | Window: 10     | Window: 20     | Window: 50      |
|--------------|----------------|----------------|-----------------|
| name hash v1 | 882 M | 3m 27s |       |        |       |         |
| name hash v2 | 584 M | 70.4 s | 554 M | 54.6 s | 530 M | 69.4 s  |
| --path-walk  | 548 M | 79.8 s | 523 M | 93.9 s | 507 M | 126.2 s |

Running on a slightly-larger Javascript repo with the same CHANGLOG
filename issue, I get these results:

| Type         | Window: 10     | Window: 20     | Window: 50     |
|--------------|----------------|----------------|----------------|
| name hash v1 | 6.4 G | 36m 9s |       |        |       |        |
| name hash v2 | 920 M | 7m 39s | 767 M | 5m 49s | 665 M | 6m 12s |
| --path-walk  | 834 M | 4m 48s | 697 M | 7m 39s | 615 M | 8m 42s |

> But these numbers are pretty confusing to me, TBH. The reduction in pack
> sizes makes sense, and here I see numbers that are on-par with what you
> noted above for the same repository. But the runtimes are wildly
> different (e.g., hash v1 takes you just 87s while mine takes 932s).

I wonder if it's related to threading? I'm using as many cores as I can.

> There must be something in our environment that is different. I'm
> starting with a bare clone of microsoft/fluentui from GitHub, and made
> several 'cp -al' copies of it for the different experiments. In the
> penultimate one, I ran:
> 
>     $ time git.compile -c pack.window=50 repack --name-hash-version=2 \
>         -adF --no-write-bitmap-index

There's also some strange things with my numbers because I'm not copying
the same data into multiple places but instead running the test on the
same repo. Thus, the "input size" is changing with each run and this is
probably a big factor in the larger tests.

So the order in my tables is left-to-right, top-to-bottom, like reading
a page in English. Thus, the short time for --path-walk --window=10 in
the last example is maybe a bit faster because it is starting from the
665 M from the --name-hash-version=2 --window=50 example. 
> In any event, it seems like at least in this example we can get
> performance that is on-par with path-walk by simply widening the
> pack.window when using hash v2. On my machine that seems to cost more
> time than it does for you to the point where it's slower than my
> path-walk. But I think I need to understand what the differences are
> here before we can draw any conclusions on the size or timing.

I'd be very curious to see if more folks have bandwidth to do similar
testing. My default mode is that I like giving users more options to
explore which may work better for them. 
> If the overwhelming majority of cases where the --path-walk feature
> presents a significant benefit over hash v2 at various pack.window sizes
> (where we could get approximately the same reduction in pack size with
> approximately the same end-to-end runtime of 'git repack'), then I feel
> we might want to reconsider whether or not the complexity of this feature
> is worthwhile.
> 
> But if the --path-walk feature either gives us a significant size
> benefit that we can't get with hash v2 and a wider pack.window without
> paying a significant runtime cost (or vice-versa), then this feature
> would indeed be worthwhile.
> 
> I also have no idea how representative the above is of your intended
> use-case, which seems much more oriented around pushes than from-scratch
> repacks, which would also affect our conclusions here.The push story is valuable, but I'm also interested in helping users shrink their local repositories in whatever means they are
willing to wait for.

---

Meta-response to your patch review: I have made adjustments to my local
branch in response to the points you brought up. I'll hold off on v2
for a few more days to give more opportunity for review.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects'
  2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
                   ` (13 preceding siblings ...)
  2025-03-10 17:28 ` [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Junio C Hamano
@ 2025-03-24 15:22 ` Derrick Stolee via GitGitGadget
  2025-03-24 15:22   ` [PATCH v2 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
                     ` (12 more replies)
  14 siblings, 13 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-24 15:22 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee

Here is a full submission of the --path-walk feature for 'git pack-objects'
and 'git repack'. It's been discussed in an RFC [1], as a future application
for the path walk API [2], and is updated now that --name-hash-version=2
exists (as a replacement for the --full-name-hash option from the RFC) [3].

[1]
https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/

[2]
https://lore.kernel.org/git/pull.1818.git.1730356023.gitgitgadget@gmail.com

[3]
https://lore.kernel.org/git/pull.1813.git.1728396723.gitgitgadget@gmail.com

This patch series does the following:

 1. Add a new '--path-walk' option to 'git pack-objects' that uses the
    path-walk API instead of the revision API to collect objects for delta
    compression.

 2. Add a new '--path-walk' option to 'git repack' to pass this option along
    to 'git pack-objects'.

 3. Add a new 'pack.usePathWalk' config option to opt into this option
    implicitly, such as in 'git push'.

 4. Optimize the '--path-walk' option using threading so it better competes
    with the existing multi-threaded delta compression mechanism.

 5. Update the path-walk API with a new 'edge_aggressive' option that pairs
    close to the --edge-aggressive option in the revision API. This is
    useful when creating thin packs inside shallow clones.

This feature works by using the path-walk API to emit groups of objects that
appear at the same path. These groups are tracked so they can be tested for
delta compression with each other, and then after those groups are tested a
second pass using the name-hash attempts to find better (or first time)
deltas across path boundaries. This second pass is much faster than a fresh
pass since the existing deltas are used as a limit for the size of
potentially new deltas, short-circuiting the checks when the delta size
exceeds the current-best.

The benefits of the --path-walk feature first come into play when the name
hash functions have many collisions, so sorting by name hash value leads to
unhelpful groupings of objects. Many of these benefits are improved by
--name-hash-version=2, but collisions still exist with any hash-based
approach. There are also performance benefits in some cases due to the
isolation of delta compression testing within path groups.

All of the benefits of the --path-walk feature are less dramatic when
compared to --name-hash-version=2, but they can still exist in many cases. I
have also seen some cases where --name-hash-version=2 compresses better than
--path-walk with --name-hash-version=1, but these options can be combined to
get the best of both worlds.

Detailed statistics are provided within patch messages, but a few are
highlighted here:

The microsoft/fluentui is a public Javascript repo that suffers from many of
the name hash collisions as internal repositories I've worked with. Here is
a comparison of the compressed size and end-to-end time of the repack:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1             439.4M      87.24s
Hash v2             161.7M      21.51s
Path Walk           142.5M      28.16s


Less dramatic, but perhaps more standardly structured is the nodejs/node
repository, with these stats:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                739.9M      71.18s
Hash v2                764.6M      67.82s
Path Walk              698.0M      75.10s


Even the Linux kernel repository gains some benefits, even though the number
of hash collisions is relatively low due to a preference for short
filenames:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                  2.5G     554.41s
Hash v2                  2.5G     549.62s
Path Walk                2.2G     559.00s


The drawbacks of the --path-walk feature is that it will be harder to
integrate it with bitmap features, specifically delta islands. This is not
insurmountable, but would require more work, such as a revision walk to
paint objects with reachability information before using that during delta
computations.

However, there should still be significant benefits to Git clients trying to
save space and improve local performance.

This feature was shipped with similar features in microsoft/git as of
v2.47.0.vfs.0.3 [4]. This was used in CI machines for an internal monorepo
that had significant repository growth due to constructing a batch of
beachball [5] CHANGELOG.[md|json] files and pushing them to a release
branch. These pushes were frequently 70-200 MB due to poor delta
compression. Using the 'pack.usePathWalk=true' config, these pushes dropped
in size by 100x while improving performance. Since these CI machines were
working with a shallow clone, the 'edge_aggressive' changes were required to
enable the path-walk option.

[4] https://github.com/microsoft/git/releases/tag/v2.47.0.vfs.0.3

[5] https://github.com/microsoft/beachball


Updates in v2
=============

 * Re-added a dropped comment when moving code in patch 1.
 * Updated documentation to include interaction with --use-bitmap-index.
 * An UNUSED parameter is now used, reducing the use of global variables
   slightly.

Thanks, -Stolee

Derrick Stolee (13):
  pack-objects: extract should_attempt_deltas()
  pack-objects: add --path-walk option
  pack-objects: update usage to match docs
  p5313: add performance tests for --path-walk
  pack-objects: introduce GIT_TEST_PACK_PATH_WALK
  t5538: add tests to confirm deltas in shallow pushes
  repack: add --path-walk option
  pack-objects: enable --path-walk via config
  scalar: enable path-walk during push via config
  pack-objects: refactor path-walk delta phase
  pack-objects: thread the path-based compression
  path-walk: add new 'edge_aggressive' option
  pack-objects: allow --shallow and --path-walk

 Documentation/config/feature.adoc          |   4 +
 Documentation/config/pack.adoc             |   8 +
 Documentation/git-pack-objects.adoc        |  26 +-
 Documentation/git-repack.adoc              |  14 +-
 Documentation/technical/api-path-walk.adoc |   9 +
 builtin/pack-objects.c                     | 414 +++++++++++++++++++--
 builtin/repack.c                           |   7 +-
 pack-objects.h                             |  12 +
 path-walk.c                                |   6 +-
 path-walk.h                                |   7 +
 repo-settings.c                            |   3 +
 repo-settings.h                            |   1 +
 scalar.c                                   |   1 +
 t/README                                   |   4 +
 t/helper/test-path-walk.c                  |   2 +
 t/perf/p5313-pack-objects.sh               |  37 +-
 t/t0411-clone-from-partial.sh              |   6 +
 t/t0450/adoc-help-mismatches               |   1 -
 t/t5300-pack-object.sh                     |  19 +
 t/t5306-pack-nobase.sh                     |   5 +
 t/t5310-pack-bitmaps.sh                    |  13 +-
 t/t5316-pack-delta-depth.sh                |   9 +-
 t/t5332-multi-pack-reuse.sh                |   7 +
 t/t5538-push-shallow.sh                    |  34 ++
 t/t6601-path-walk.sh                       |  20 +
 t/t7406-submodule-update.sh                |   3 +
 26 files changed, 605 insertions(+), 67 deletions(-)


base-commit: a36e024e989f4d35f35987a60e3af8022cac3420
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1819%2Fderrickstolee%2Fpath-walk-upstream-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1819/derrickstolee/path-walk-upstream-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1819

Range-diff vs v1:

  1:  a2ed1f2d4e3 !  1:  57c1cc20de0 pack-objects: extract should_attempt_deltas()
     @@ builtin/pack-objects.c: static int add_ref_tag(const char *tag UNUSED, const cha
      +static int should_attempt_deltas(struct object_entry *entry)
      +{
      +	if (DELTA(entry))
     ++		/* This happens if we decided to reuse existing
     ++		 * delta from a pack. "reuse_delta &&" is implied.
     ++		 */
      +		return 0;
      +
      +	if (!entry->type_valid ||
  2:  9b31dc87bb6 !  2:  a271d6245c2 pack-objects: add --path-walk option
     @@ Documentation/git-pack-objects.adoc: many different directories. At the moment,
      +	especially in the presence of filenames that cause collisions in
      +	Git's default name-hash algorithm. Due to changing how the objects
      +	are walked, this option is not compatible with `--delta-islands`,
     -+	`--shallow`, or `--filter`.
     ++	`--shallow`, or `--filter`. The `--use-bitmap-index` option will
     ++	be ignored in the presence of `--path-walk.`
      +
       
       DELTA ISLANDS
  3:  bc678acb109 =  3:  dcff01392ff pack-objects: update usage to match docs
  4:  de848ebff74 =  4:  97a0b52ccee p5313: add performance tests for --path-walk
  5:  4bfd438b5fa =  5:  0d49bb3d30a pack-objects: introduce GIT_TEST_PACK_PATH_WALK
  6:  cfee2136e92 =  6:  ddf804e606a t5538: add tests to confirm deltas in shallow pushes
  7:  1e75f068281 =  7:  11767e7653e repack: add --path-walk option
  8:  073cb44a47c =  8:  dd66a5b46f2 pack-objects: enable --path-walk via config
  9:  fd9a335329e =  9:  e5624c379d5 scalar: enable path-walk during push via config
 10:  c047fbd7f27 ! 10:  622439d7855 pack-objects: refactor path-walk delta phase
     @@ builtin/pack-objects.c: static int should_attempt_deltas(struct object_entry *en
       	return 1;
       }
       
     -+static void find_deltas_for_region(struct object_entry *list UNUSED,
     ++static void find_deltas_for_region(struct object_entry *list,
      +				   struct packing_region *region,
      +				   unsigned int *processed)
      +{
     @@ builtin/pack-objects.c: static int should_attempt_deltas(struct object_entry *en
      +
      +	ALLOC_ARRAY(delta_list, region->nr);
      +	for (uint32_t i = 0; i < region->nr; i++) {
     -+		struct object_entry *entry = to_pack.objects + region->start + i;
     ++		struct object_entry *entry = list + region->start + i;
      +		if (should_attempt_deltas(entry))
      +			delta_list[delta_list_nr++] = entry;
      +	}
     @@ pack-objects.h: struct object_entry {
       	uint32_t nr_objects, nr_alloc;
       
      +	struct packing_region *regions;
     -+	uint32_t nr_regions, nr_regions_alloc;
     ++	uint64_t nr_regions, nr_regions_alloc;
      +
       	int32_t *index;
       	uint32_t index_size;
 11:  cf990b90b34 = 11:  ae73d26319a pack-objects: thread the path-based compression
 12:  bc9383ef11c = 12:  c45f35c204e path-walk: add new 'edge_aggressive' option
 13:  2eb92507213 = 13:  d5484ebd942 pack-objects: allow --shallow and --path-walk

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 01/13] pack-objects: extract should_attempt_deltas()
  2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
@ 2025-03-24 15:22   ` Derrick Stolee via GitGitGadget
  2025-03-24 15:22   ` [PATCH v2 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-24 15:22 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

This will be helpful in a future change, which will reuse this logic.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 builtin/pack-objects.c | 56 ++++++++++++++++++++++++------------------
 1 file changed, 32 insertions(+), 24 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 58a9b161262..7805429f5d1 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3196,6 +3196,36 @@ static int add_ref_tag(const char *tag UNUSED, const char *referent UNUSED, cons
 	return 0;
 }
 
+static int should_attempt_deltas(struct object_entry *entry)
+{
+	if (DELTA(entry))
+		/* This happens if we decided to reuse existing
+		 * delta from a pack. "reuse_delta &&" is implied.
+		 */
+		return 0;
+
+	if (!entry->type_valid ||
+	    oe_size_less_than(&to_pack, entry, 50))
+		return 0;
+
+	if (entry->no_try_delta)
+		return 0;
+
+	if (!entry->preferred_base) {
+		if (oe_type(entry) < 0)
+			die(_("unable to get type of object %s"),
+				oid_to_hex(&entry->idx.oid));
+	} else if (oe_type(entry) < 0) {
+		/*
+		 * This object is not found, but we
+		 * don't have to include it anyway.
+		 */
+		return 0;
+	}
+
+	return 1;
+}
+
 static void prepare_pack(int window, int depth)
 {
 	struct object_entry **delta_list;
@@ -3226,33 +3256,11 @@ static void prepare_pack(int window, int depth)
 	for (i = 0; i < to_pack.nr_objects; i++) {
 		struct object_entry *entry = to_pack.objects + i;
 
-		if (DELTA(entry))
-			/* This happens if we decided to reuse existing
-			 * delta from a pack.  "reuse_delta &&" is implied.
-			 */
-			continue;
-
-		if (!entry->type_valid ||
-		    oe_size_less_than(&to_pack, entry, 50))
+		if (!should_attempt_deltas(entry))
 			continue;
 
-		if (entry->no_try_delta)
-			continue;
-
-		if (!entry->preferred_base) {
+		if (!entry->preferred_base)
 			nr_deltas++;
-			if (oe_type(entry) < 0)
-				die(_("unable to get type of object %s"),
-				    oid_to_hex(&entry->idx.oid));
-		} else {
-			if (oe_type(entry) < 0) {
-				/*
-				 * This object is not found, but we
-				 * don't have to include it anyway.
-				 */
-				continue;
-			}
-		}
 
 		delta_list[n++] = entry;
 	}
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 02/13] pack-objects: add --path-walk option
  2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2025-03-24 15:22   ` [PATCH v2 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
@ 2025-03-24 15:22   ` Derrick Stolee via GitGitGadget
  2025-03-24 15:22   ` [PATCH v2 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
                     ` (10 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-24 15:22 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

In order to more easily compute delta bases among objects that appear at
the exact same path, add a --path-walk option to 'git pack-objects'.

This option will use the path-walk API instead of the object walk given
by the revision machinery. Since objects will be provided in batches
representing a common path, those objects can be tested for delta bases
immediately instead of waiting for a sort of the full object list by
name-hash. This has multiple benefits, including avoiding collisions by
name-hash.

The objects marked as UNINTERESTING are included in these batches, so we
are guaranteeing some locality to find good delta bases.

After the individual passes are done on a per-path basis, the default
name-hash is used to find other opportunistic delta bases that did not
match exactly by the full path name.

The current implementation performs delta calculations while walking
objects, which is not ideal for a few reasons. First, this will cause
the "Enumerating objects" phase to be much longer than usual. Second, it
does not take advantage of threading during the path-scoped delta
calculations. Even with this lack of threading, the path-walk option is
sometimes faster than the usual approach. Future changes will refactor
this code to allow for threading, but that complexity is deferred until
later to keep this patch as simple as possible.

This new walk is incompatible with some features and is ignored by
others:

 * Object filters are not currently integrated with the path-walk API,
   such as sparse-checkout or tree depth. A blobless packfile could be
   integrated easily, but that is deferred for later.

 * Server-focused features such as delta islands, shallow packs, and
   using a bitmap index are incompatible with the path-walk API.

 * The path walk API is only compatible with the --revs option, not
   taking object lists or pack lists over stdin. These alternative ways
   to specify the objects currently ignores the --path-walk option
   without even a warning.

Future changes will create performance tests that demonstrate the power
of this approach.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-pack-objects.adoc        |  14 +-
 Documentation/technical/api-path-walk.adoc |   1 +
 builtin/pack-objects.c                     | 147 +++++++++++++++++++--
 t/t5300-pack-object.sh                     |  15 +++
 4 files changed, 167 insertions(+), 10 deletions(-)

diff --git a/Documentation/git-pack-objects.adoc b/Documentation/git-pack-objects.adoc
index 7f69ae4855f..7065758eddf 100644
--- a/Documentation/git-pack-objects.adoc
+++ b/Documentation/git-pack-objects.adoc
@@ -16,7 +16,7 @@ SYNOPSIS
 	[--cruft] [--cruft-expiration=<time>]
 	[--stdout [--filter=<filter-spec>] | <base-name>]
 	[--shallow] [--keep-true-parents] [--[no-]sparse]
-	[--name-hash-version=<n>] < <object-list>
+	[--name-hash-version=<n>] [--path-walk] < <object-list>
 
 
 DESCRIPTION
@@ -375,6 +375,18 @@ many different directories. At the moment, this version is not allowed
 when writing reachability bitmap files with `--write-bitmap-index` and it
 will be automatically changed to version `1`.
 
+--path-walk::
+	By default, `git pack-objects` walks objects in an order that
+	presents trees and blobs in an order unrelated to the path they
+	appear relative to a commit's root tree. The `--path-walk` option
+	enables a different walking algorithm that organizes trees and
+	blobs by path. This has the potential to improve delta compression
+	especially in the presence of filenames that cause collisions in
+	Git's default name-hash algorithm. Due to changing how the objects
+	are walked, this option is not compatible with `--delta-islands`,
+	`--shallow`, or `--filter`. The `--use-bitmap-index` option will
+	be ignored in the presence of `--path-walk.`
+
 
 DELTA ISLANDS
 -------------
diff --git a/Documentation/technical/api-path-walk.adoc b/Documentation/technical/api-path-walk.adoc
index 3e089211fb4..e522695dd9f 100644
--- a/Documentation/technical/api-path-walk.adoc
+++ b/Documentation/technical/api-path-walk.adoc
@@ -69,4 +69,5 @@ Examples
 
 See example usages in:
 	`t/helper/test-path-walk.c`,
+	`builtin/pack-objects.c`,
 	`builtin/backfill.c`
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7805429f5d1..4f934847558 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -41,6 +41,9 @@
 #include "promisor-remote.h"
 #include "pack-mtimes.h"
 #include "parse-options.h"
+#include "blob.h"
+#include "tree.h"
+#include "path-walk.h"
 
 /*
  * Objects we are going to pack are collected in the `to_pack` structure.
@@ -217,6 +220,7 @@ static int delta_search_threads;
 static int pack_to_stdout;
 static int sparse;
 static int thin;
+static int path_walk;
 static int num_preferred_base;
 static struct progress *progress_state;
 
@@ -4191,6 +4195,105 @@ static void mark_bitmap_preferred_tips(void)
 	}
 }
 
+static inline int is_oid_interesting(struct repository *repo,
+				     struct object_id *oid)
+{
+	struct object *o = lookup_object(repo, oid);
+	return o && !(o->flags & UNINTERESTING);
+}
+
+static int add_objects_by_path(const char *path,
+			       struct oid_array *oids,
+			       enum object_type type,
+			       void *data)
+{
+	struct object_entry **delta_list;
+	size_t oe_start = to_pack.nr_objects;
+	size_t oe_end;
+	unsigned int sub_list_size;
+	unsigned int *processed = data;
+
+	/*
+	 * First, add all objects to the packing data, including the ones
+	 * marked UNINTERESTING (translated to 'exclude') as they can be
+	 * used as delta bases.
+	 */
+	for (size_t i = 0; i < oids->nr; i++) {
+		int exclude;
+		struct object_info oi = OBJECT_INFO_INIT;
+		struct object_id *oid = &oids->oid[i];
+
+		/* Skip objects that do not exist locally. */
+		if (exclude_promisor_objects &&
+		    oid_object_info_extended(the_repository, oid, &oi,
+					     OBJECT_INFO_FOR_PREFETCH) < 0)
+			continue;
+
+		exclude = !is_oid_interesting(the_repository, oid);
+
+		if (exclude && !thin)
+			continue;
+
+		add_object_entry(oid, type, path, exclude);
+	}
+
+	oe_end = to_pack.nr_objects;
+
+	/* We can skip delta calculations if it is a no-op. */
+	if (oe_end == oe_start || !window)
+		return 0;
+
+	sub_list_size = 0;
+	ALLOC_ARRAY(delta_list, oe_end - oe_start);
+
+	for (size_t i = 0; i < oe_end - oe_start; i++) {
+		struct object_entry *entry = to_pack.objects + oe_start + i;
+
+		if (!should_attempt_deltas(entry))
+			continue;
+
+		delta_list[sub_list_size++] = entry;
+	}
+
+	/*
+	 * Find delta bases among this list of objects that all match the same
+	 * path. This causes the delta compression to be interleaved in the
+	 * object walk, which can lead to confusing progress indicators. This is
+	 * also incompatible with threaded delta calculations. In the future,
+	 * consider creating a list of regions in the full to_pack.objects array
+	 * that could be picked up by the threaded delta computation.
+	 */
+	if (sub_list_size && window) {
+		QSORT(delta_list, sub_list_size, type_size_sort);
+		find_deltas(delta_list, &sub_list_size, window, depth, processed);
+	}
+
+	free(delta_list);
+	return 0;
+}
+
+static void get_object_list_path_walk(struct rev_info *revs)
+{
+	struct path_walk_info info = PATH_WALK_INFO_INIT;
+	unsigned int processed = 0;
+
+	info.revs = revs;
+	info.path_fn = add_objects_by_path;
+	info.path_fn_data = &processed;
+	revs->tag_objects = 1;
+
+	/*
+	 * Allow the --[no-]sparse option to be interesting here, if only
+	 * for testing purposes. Paths with no interesting objects will not
+	 * contribute to the resulting pack, but only create noisy preferred
+	 * base objects.
+	 */
+	info.prune_all_uninteresting = sparse;
+
+	if (walk_objects_by_path(&info))
+		die(_("failed to pack objects via path-walk"));
+}
+
 static void get_object_list(struct rev_info *revs, int ac, const char **av)
 {
 	struct setup_revision_opt s_r_opt = {
@@ -4237,7 +4340,7 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av)
 
 	warn_on_object_refname_ambiguity = save_warning;
 
-	if (use_bitmap_index && !get_object_list_from_bitmap(revs))
+	if (use_bitmap_index && !path_walk && !get_object_list_from_bitmap(revs))
 		return;
 
 	if (use_delta_islands)
@@ -4246,15 +4349,19 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av)
 	if (write_bitmap_index)
 		mark_bitmap_preferred_tips();
 
-	if (prepare_revision_walk(revs))
-		die(_("revision walk setup failed"));
-	mark_edges_uninteresting(revs, show_edge, sparse);
-
 	if (!fn_show_object)
 		fn_show_object = show_object;
-	traverse_commit_list(revs,
-			     show_commit, fn_show_object,
-			     NULL);
+
+	if (path_walk) {
+		get_object_list_path_walk(revs);
+	} else {
+		if (prepare_revision_walk(revs))
+			die(_("revision walk setup failed"));
+		mark_edges_uninteresting(revs, show_edge, sparse);
+		traverse_commit_list(revs,
+				show_commit, fn_show_object,
+				NULL);
+	}
 
 	if (unpack_unreachable_expiration) {
 		revs->ignore_missing_links = 1;
@@ -4464,6 +4571,8 @@ int cmd_pack_objects(int argc,
 			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
 			 N_("create thin packs")),
+		OPT_BOOL(0, "path-walk", &path_walk,
+			 N_("use the path-walk API to walk objects when possible")),
 		OPT_BOOL(0, "shallow", &shallow,
 			 N_("create packs suitable for shallow fetches")),
 		OPT_BOOL(0, "honor-pack-keep", &ignore_packed_keep_on_disk,
@@ -4549,7 +4658,27 @@ int cmd_pack_objects(int argc,
 		window = 0;
 
 	strvec_push(&rp, "pack-objects");
-	if (thin) {
+
+	if (path_walk && filter_options.choice) {
+		warning(_("cannot use --filter with --path-walk"));
+		path_walk = 0;
+	}
+	if (path_walk && use_delta_islands) {
+		warning(_("cannot use delta islands with --path-walk"));
+		path_walk = 0;
+	}
+	if (path_walk && shallow) {
+		warning(_("cannot use --shallow with --path-walk"));
+		path_walk = 0;
+	}
+	if (path_walk) {
+		strvec_push(&rp, "--boundary");
+		 /*
+		  * We must disable the bitmaps because we are removing
+		  * the --objects / --objects-edge[-aggressive] options.
+		  */
+		use_bitmap_index = 0;
+	} else if (thin) {
 		use_internal_rev_list = 1;
 		strvec_push(&rp, shallow
 				? "--objects-edge-aggressive"
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 5ac8d39094b..16420d12863 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -723,4 +723,19 @@ test_expect_success '--name-hash-version=2 and --write-bitmap-index are incompat
 	! test_grep "currently, --write-bitmap-index requires --name-hash-version=1" err
 '
 
+test_expect_success '--path-walk pack everything' '
+	git -C server rev-parse HEAD >in &&
+	git -C server pack-objects --stdout --revs --path-walk <in >out.pack &&
+	git -C server index-pack --stdin <out.pack
+'
+
+test_expect_success '--path-walk thin pack' '
+	cat >in <<-EOF &&
+	$(git -C server rev-parse HEAD)
+	^$(git -C server rev-parse HEAD~2)
+	EOF
+	git -C server pack-objects --thin --stdout --revs --path-walk <in >out.pack &&
+	git -C server index-pack --fix-thin --stdin <out.pack
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 03/13] pack-objects: update usage to match docs
  2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2025-03-24 15:22   ` [PATCH v2 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
  2025-03-24 15:22   ` [PATCH v2 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
@ 2025-03-24 15:22   ` Derrick Stolee via GitGitGadget
  2025-03-24 15:22   ` [PATCH v2 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
                     ` (9 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-24 15:22 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The t0450 test script verifies that builtin usage matches the synopsis
in the documentation. Adjust the builtin to match and then remove 'git
pack-objects' from the exception list.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-pack-objects.adoc | 14 +++++++-------
 builtin/pack-objects.c              | 10 ++++++++--
 t/t0450/adoc-help-mismatches        |  1 -
 3 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/Documentation/git-pack-objects.adoc b/Documentation/git-pack-objects.adoc
index 7065758eddf..7c666a14277 100644
--- a/Documentation/git-pack-objects.adoc
+++ b/Documentation/git-pack-objects.adoc
@@ -10,13 +10,13 @@ SYNOPSIS
 --------
 [verse]
 'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
-	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
-	[--local] [--incremental] [--window=<n>] [--depth=<n>]
-	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
-	[--cruft] [--cruft-expiration=<time>]
-	[--stdout [--filter=<filter-spec>] | <base-name>]
-	[--shallow] [--keep-true-parents] [--[no-]sparse]
-	[--name-hash-version=<n>] [--path-walk] < <object-list>
+		   [--no-reuse-delta] [--delta-base-offset] [--non-empty]
+		   [--local] [--incremental] [--window=<n>] [--depth=<n>]
+		   [--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
+		   [--cruft] [--cruft-expiration=<time>]
+		   [--stdout [--filter=<filter-spec>] | <base-name>]
+		   [--shallow] [--keep-true-parents] [--[no-]sparse]
+		   [--name-hash-version=<n>] [--path-walk] < <object-list>
 
 
 DESCRIPTION
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 4f934847558..75a6545cca1 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -187,8 +187,14 @@ static inline void oe_set_delta_size(struct packing_data *pack,
 #define SET_DELTA_SIBLING(obj, val) oe_set_delta_sibling(&to_pack, obj, val)
 
 static const char *pack_usage[] = {
-	N_("git pack-objects --stdout [<options>] [< <ref-list> | < <object-list>]"),
-	N_("git pack-objects [<options>] <base-name> [< <ref-list> | < <object-list>]"),
+	N_("git pack-objects [-q | --progress | --all-progress] [--all-progress-implied]\n"
+	   "                 [--no-reuse-delta] [--delta-base-offset] [--non-empty]\n"
+	   "                 [--local] [--incremental] [--window=<n>] [--depth=<n>]\n"
+	   "                 [--revs [--unpacked | --all]] [--keep-pack=<pack-name>]\n"
+	   "                 [--cruft] [--cruft-expiration=<time>]\n"
+	   "                 [--stdout [--filter=<filter-spec>] | <base-name>]\n"
+	   "                 [--shallow] [--keep-true-parents] [--[no-]sparse]\n"
+	   "                 [--name-hash-version=<n>] [--path-walk] < <object-list>"),
 	NULL
 };
 
diff --git a/t/t0450/adoc-help-mismatches b/t/t0450/adoc-help-mismatches
index c4a15fd0cb8..06b469bdee2 100644
--- a/t/t0450/adoc-help-mismatches
+++ b/t/t0450/adoc-help-mismatches
@@ -38,7 +38,6 @@ merge-one-file
 multi-pack-index
 name-rev
 notes
-pack-objects
 push
 range-diff
 rebase
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 04/13] p5313: add performance tests for --path-walk
  2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2025-03-24 15:22   ` [PATCH v2 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
@ 2025-03-24 15:22   ` Derrick Stolee via GitGitGadget
  2025-03-24 15:22   ` [PATCH v2 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
                     ` (8 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-24 15:22 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The previous change added a --path-walk option to 'git pack-objects'.
Create a performance test that demonstrates the time and space benefits
of the feature.

In order to get an appropriate comparison, we need to avoid reusing
deltas and recompute them from scratch.

Compare the creation of a thin pack representing a small push and the
creation of a relatively large non-thin pack.

Running on my copy of the Git repository results in this data (removing
the repack tests for --name-hash-version):

Test                                                     this tree
------------------------------------------------------------------------
5313.2: thin pack with --name-hash-version=1             0.02(0.01+0.01)
5313.3: thin pack size with --name-hash-version=1                   1.6K
5313.4: big pack with --name-hash-version=1              2.55(4.20+0.26)
5313.5: big pack size with --name-hash-version=1                   16.4M
5313.6: shallow fetch pack with --name-hash-version=1    1.24(2.03+0.08)
5313.7: shallow pack size with --name-hash-version=1               12.2M
5313.10: thin pack with --name-hash-version=2            0.03(0.01+0.01)
5313.11: thin pack size with --name-hash-version=2                  1.6K
5313.12: big pack with --name-hash-version=2             1.91(3.23+0.20)
5313.13: big pack size with --name-hash-version=2                  16.4M
5313.14: shallow fetch pack with --name-hash-version=2   1.06(1.57+0.10)
5313.15: shallow pack size with --name-hash-version=2              12.5M
5313.18: thin pack with --path-walk                      0.03(0.01+0.01)
5313.19: thin pack size with --path-walk                            1.6K
5313.20: big pack with --path-walk                       2.05(3.24+0.27)
5313.21: big pack size with --path-walk                            16.3M
5313.22: shallow fetch pack with --path-walk             1.08(1.66+0.07)
5313.23: shallow pack size with --path-walk                        12.4M

This can be reformatted as follows:

Pack Type            Hash v1   Hash v2     Path Walk
---------------------------------------------------
thin pack    (time)    0.02s      0.03s      0.03s
             (size)    1.6K       1.6K       1.6K
big pack     (time)    2.55s      1.91s      2.05s
             (size)   16.4M      16.4M      16.3M
shallow pack (time)    1.24s      1.06s      1.08s
             (size)   12.2M      12.5M      12.4M

Note that the timing is slower because there is no threading in the
--path-walk case (yet). Also, the shallow pack cases are really not
using the --path-walk logic right now because it is disabled until some
additions are made to the path walk API.

The cases where the --path-walk option really shines is when the default
name-hash is overwhelmed with collisions. An open source example can be
found in the microsoft/fluentui repo [1] at a certain commit [2].

[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

Running the tests on this repo results in the following comparison table:

Pack Type            Hash v1    Hash v2    Path Walk
---------------------------------------------------
thin pack    (time)    0.36s      0.12s      0.08s
             (size)    1.2M      22.0K      18.4K
big pack     (time)    2.00s      2.90s      2.21s
             (size)   20.4M      25.9M      19.5M
shallow pack (time)    1.41s      1.80s      1.65s
             (size)   34.4M      33.7M      33.6M

Notice in particular that in the small thin pack, the time performance
has improved from 0.36s for --name-hash-version=1 to 0.08s and this is
likely due to the improved size of the resulting pack: 18.4K instead of
1.2M.  The relatively new --name-hash-version=2 is competitive with
--path-walk (0.12s and 22.0K) but not quite as successful.

Finally, running this on a copy of the Linux kernel repository results
in these data points:

Pack Type            Hash v1    Hash v2    Path Walk
---------------------------------------------------
thin pack    (time)    0.03s      0.13s      0.03s
             (size)    4.6K       4.6K       4.6K
big pack     (time)   15.29s     12.32s     13.92s
             (size)  201.1M     159.1M     158.5M
shallow pack (time)   10.88s     22.93s     22.74s
             (size)  269.2M     273.8M     267.7M

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 t/perf/p5313-pack-objects.sh | 37 ++++++++++++++++++++++--------------
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/t/perf/p5313-pack-objects.sh b/t/perf/p5313-pack-objects.sh
index be5229a0ecd..cd6dd3abb71 100755
--- a/t/perf/p5313-pack-objects.sh
+++ b/t/perf/p5313-pack-objects.sh
@@ -25,46 +25,55 @@ test_expect_success 'create rev input' '
 	EOF
 '
 
-for version in 1 2
-do
-	export version
+test_all_with_args () {
+	parameter=$1
+	export parameter
 
-	test_perf "thin pack with version $version" '
+	test_perf "thin pack with $parameter" '
 		git pack-objects --thin --stdout --revs --sparse \
-			--name-hash-version=$version <in-thin >out
+			$parameter <in-thin >out
 	'
 
-	test_size "thin pack size with version $version" '
+	test_size "thin pack size with $parameter" '
 		test_file_size out
 	'
 
-	test_perf "big pack with version $version" '
+	test_perf "big pack with $parameter" '
 		git pack-objects --stdout --revs --sparse \
-			--name-hash-version=$version <in-big >out
+			$parameter <in-big >out
 	'
 
-	test_size "big pack size with version $version" '
+	test_size "big pack size with $parameter" '
 		test_file_size out
 	'
 
-	test_perf "shallow fetch pack with version $version" '
+	test_perf "shallow fetch pack with $parameter" '
 		git pack-objects --stdout --revs --sparse --shallow \
-			--name-hash-version=$version <in-shallow >out
+			$parameter <in-shallow >out
 	'
 
-	test_size "shallow pack size with version $version" '
+	test_size "shallow pack size with $parameter" '
 		test_file_size out
 	'
+}
 
-	test_perf "repack with version $version" '
+for version in 1 2
+do
+	export version
+
+	test_all_with_args --name-hash-version=$version
+
+	test_perf "repack with --name-hash-version=$version" '
 		git repack -adf --name-hash-version=$version
 	'
 
-	test_size "repack size with version $version" '
+	test_size "repack size with --name-hash-version=$version" '
 		gitdir=$(git rev-parse --git-dir) &&
 		pack=$(ls $gitdir/objects/pack/pack-*.pack) &&
 		test_file_size "$pack"
 	'
 done
 
+test_all_with_args --path-walk
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK
  2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2025-03-24 15:22   ` [PATCH v2 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
@ 2025-03-24 15:22   ` Derrick Stolee via GitGitGadget
  2025-03-24 15:22   ` [PATCH v2 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
                     ` (7 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-24 15:22 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

There are many tests that validate whether 'git pack-objects' works as
expected. Instead of duplicating these tests, add a new test environment
variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
when specified.

This was useful in testing the implementation of the --path-walk
implementation, especially in conjunction with test such as:

 - t0411-clone-from-partial.sh : One test fetches from a repo that does
   not have the boundary objects. This causes the path-based walk to
   fail. Disable the variable for this test.

 - t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo
   without a boundary object.

 - t5310-pack-bitmaps.sh : One test compares the case when packing with
   bitmaps to the case when packing without them. Since we disable the
   test variable when writing bitmaps, this causes a difference in the
   object list (the --path-walk option adds an extra object). Specify
   --no-path-walk in both processes for the comparison. Another test
   checks for a specific delta base, but when computing dynamically
   without using bitmaps, the base object it too small to be considered
   in the delta calculations so no base is used.

 - t5316-pack-delta-depth.sh : This script cares about certain delta
   choices and their chain lengths. The --path-walk option changes how
   these chains are selected, and thus changes the results of this test.

 - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
   the --sparse option and how it combines with --path-walk.

 - t5332-multi-pack-reuse.sh : This test verifies that the preferred
   pack is used for delta reuse when possible. The --path-walk option is
   not currently aware of the preferred pack at all, so finds a
   different delta base.

 - t7406-submodule-update.sh : When using the variable, the --depth
   option collides with the --path-walk feature, resulting in a warning
   message. Disable the variable so this warning does not appear.

I want to call out one specific test change that is only temporary:

 - t5530-upload-pack-error.sh : One test cares specifically about an
   "unable to read" error message. Since the current implementation
   performs delta calculations within the path-walk API callback, a
   different "unable to get size" error message appears. When this
   is changed in a future refactoring, this test change can be reverted.

Similar to GIT_TEST_NAME_HASH_VERSION, we do not add this option to the
linux-TEST-vars CI build as that's already an overloaded build.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 builtin/pack-objects.c        | 12 ++++++++++--
 t/README                      |  4 ++++
 t/t0411-clone-from-partial.sh |  6 ++++++
 t/t5306-pack-nobase.sh        |  5 +++++
 t/t5310-pack-bitmaps.sh       | 13 +++++++++++--
 t/t5316-pack-delta-depth.sh   |  9 ++++++---
 t/t5332-multi-pack-reuse.sh   |  7 +++++++
 t/t5530-upload-pack-error.sh  |  6 ++++++
 t/t7406-submodule-update.sh   |  3 +++
 9 files changed, 58 insertions(+), 7 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 75a6545cca1..a6b8a78d42a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -226,7 +226,7 @@ static int delta_search_threads;
 static int pack_to_stdout;
 static int sparse;
 static int thin;
-static int path_walk;
+static int path_walk = -1;
 static int num_preferred_base;
 static struct progress *progress_state;
 
@@ -4230,7 +4230,7 @@ static int add_objects_by_path(const char *path,
 		struct object_id *oid = &oids->oid[i];
 
 		/* Skip objects that do not exist locally. */
-		if (exclude_promisor_objects &&
+		if ((exclude_promisor_objects || arg_missing_action != MA_ERROR) &&
 		    oid_object_info_extended(the_repository, oid, &oi,
 					     OBJECT_INFO_FOR_PREFETCH) < 0)
 			continue;
@@ -4648,6 +4648,14 @@ int cmd_pack_objects(int argc,
 	if (pack_to_stdout != !base_name || argc)
 		usage_with_options(pack_usage, pack_objects_options);
 
+	if (path_walk < 0) {
+		if (use_bitmap_index > 0 ||
+		    !use_internal_rev_list)
+			path_walk = 0;
+		else
+			path_walk = git_env_bool("GIT_TEST_PACK_PATH_WALK", 0);
+	}
+
 	if (depth < 0)
 		depth = 0;
 	if (depth >= (1 << OE_DEPTH_BITS)) {
diff --git a/t/README b/t/README
index 53e5b4a7107..ae06e628815 100644
--- a/t/README
+++ b/t/README
@@ -415,6 +415,10 @@ GIT_TEST_PACK_SPARSE=<boolean> if disabled will default the pack-objects
 builtin to use the non-sparse object walk. This can still be overridden by
 the --sparse command-line argument.
 
+GIT_TEST_PACK_PATH_WALK=<boolean> if enabled will default the pack-objects
+builtin to use the path-walk API for the object walk. This can still be
+overridden by the --no-path-walk command-line argument.
+
 GIT_TEST_PRELOAD_INDEX=<boolean> exercises the preload-index code path
 by overriding the minimum number of cache entries required per thread.
 
diff --git a/t/t0411-clone-from-partial.sh b/t/t0411-clone-from-partial.sh
index 196fc617843..9e6bca56255 100755
--- a/t/t0411-clone-from-partial.sh
+++ b/t/t0411-clone-from-partial.sh
@@ -59,6 +59,12 @@ test_expect_success 'pack-objects should fetch from promisor remote and execute
 
 test_expect_success 'clone from promisor remote does not lazy-fetch by default' '
 	rm -f script-executed &&
+
+	# The --path-walk feature of "git pack-objects" is not
+	# compatible with this kind of fetch from an incomplete repo.
+	GIT_TEST_PACK_PATH_WALK=0 &&
+	export GIT_TEST_PACK_PATH_WALK &&
+
 	test_must_fail git clone evil no-lazy 2>err &&
 	test_grep "lazy fetching disabled" err &&
 	test_path_is_missing script-executed
diff --git a/t/t5306-pack-nobase.sh b/t/t5306-pack-nobase.sh
index 805d60ff317..609399d54fb 100755
--- a/t/t5306-pack-nobase.sh
+++ b/t/t5306-pack-nobase.sh
@@ -59,6 +59,11 @@ test_expect_success 'indirectly clone patch_clone' '
 	 git pull ../.git &&
 	 test $(git rev-parse HEAD) = $B &&
 
+	# The --path-walk feature of "git pack-objects" is not
+	# compatible with this kind of fetch from an incomplete repo.
+	GIT_TEST_PACK_PATH_WALK=0 &&
+	export GIT_TEST_PACK_PATH_WALK &&
+
 	 git pull ../patch_clone/.git &&
 	 test $(git rev-parse HEAD) = $C
 	)
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 621bbbdd26e..e01df807a62 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -158,8 +158,9 @@ test_bitmap_cases () {
 		ls .git/objects/pack/ | grep bitmap >output &&
 		test_line_count = 1 output &&
 		# verify equivalent packs are generated with/without using bitmap index
-		packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
-		packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
+		# Be careful to not use the path-walk option in either case.
+		packasha1=$(git pack-objects --no-use-bitmap-index --no-path-walk --all packa </dev/null) &&
+		packbsha1=$(git pack-objects --use-bitmap-index --no-path-walk --all packb </dev/null) &&
 		list_packed_objects packa-$packasha1.idx >packa.objects &&
 		list_packed_objects packb-$packbsha1.idx >packb.objects &&
 		test_cmp packa.objects packb.objects
@@ -388,6 +389,14 @@ test_bitmap_cases () {
 		git init --bare client.git &&
 		(
 			cd client.git &&
+
+			# This test relies on reusing a delta, but if the
+			# path-walk machinery is engaged, the base object
+			# is considered too small to use during the
+			# dynamic computation, so is not used.
+			GIT_TEST_PACK_PATH_WALK=0 &&
+			export GIT_TEST_PACK_PATH_WALK &&
+
 			git config transfer.unpackLimit 1 &&
 			git fetch .. delta-reuse-old:delta-reuse-old &&
 			git fetch .. delta-reuse-new:delta-reuse-new &&
diff --git a/t/t5316-pack-delta-depth.sh b/t/t5316-pack-delta-depth.sh
index 32cf4227451..167c3a35234 100755
--- a/t/t5316-pack-delta-depth.sh
+++ b/t/t5316-pack-delta-depth.sh
@@ -89,15 +89,18 @@ max_chain() {
 # adjusted (or scrapped if the heuristics have become too unreliable)
 test_expect_success 'packing produces a long delta' '
 	# Use --window=0 to make sure we are seeing reused deltas,
-	# not computing a new long chain.
-	pack=$(git pack-objects --all --window=0 </dev/null pack) &&
+	# not computing a new long chain. (Also avoid the --path-walk
+	# option as it may break delta chains.)
+	pack=$(git pack-objects --all --window=0 --no-path-walk </dev/null pack) &&
 	echo 9 >expect &&
 	max_chain pack-$pack.pack >actual &&
 	test_cmp expect actual
 '
 
 test_expect_success '--depth limits depth' '
-	pack=$(git pack-objects --all --depth=5 </dev/null pack) &&
+	# Avoid --path-walk to avoid breaking delta chains across path
+	# boundaries.
+	pack=$(git pack-objects --all --depth=5 --no-path-walk </dev/null pack) &&
 	echo 5 >expect &&
 	max_chain pack-$pack.pack >actual &&
 	test_cmp expect actual
diff --git a/t/t5332-multi-pack-reuse.sh b/t/t5332-multi-pack-reuse.sh
index 57cad7708f8..395d09444ce 100755
--- a/t/t5332-multi-pack-reuse.sh
+++ b/t/t5332-multi-pack-reuse.sh
@@ -7,6 +7,13 @@ test_description='pack-objects multi-pack reuse'
 
 GIT_TEST_MULTI_PACK_INDEX=0
 GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
+
+# The --path-walk option does not consider the preferred pack
+# at all for reusing deltas, so this variable changes the
+# behavior of this test, if enabled.
+GIT_TEST_PACK_PATH_WALK=0
+export GIT_TEST_PACK_PATH_WALK
+
 objdir=.git/objects
 packdir=$objdir/pack
 
diff --git a/t/t5530-upload-pack-error.sh b/t/t5530-upload-pack-error.sh
index 558eedf25a4..8eb6fea839a 100755
--- a/t/t5530-upload-pack-error.sh
+++ b/t/t5530-upload-pack-error.sh
@@ -34,6 +34,12 @@ test_expect_success 'upload-pack fails due to error in pack-objects packing' '
 	hexsz=$(test_oid hexsz) &&
 	printf "%04xwant %s\n00000009done\n0000" \
 		$(($hexsz + 10)) $head >input &&
+
+	# The current implementation of path-walk causes a different
+	# error message. This will be changed by a future refactoring.
+	GIT_TEST_PACK_PATH_WALK=0 &&
+	export GIT_TEST_PACK_PATH_WALK &&
+
 	test_must_fail git upload-pack . <input >/dev/null 2>output.err &&
 	test_grep "unable to read" output.err &&
 	test_grep "pack-objects died" output.err
diff --git a/t/t7406-submodule-update.sh b/t/t7406-submodule-update.sh
index c562bad042a..ab76d4b6dc4 100755
--- a/t/t7406-submodule-update.sh
+++ b/t/t7406-submodule-update.sh
@@ -1095,12 +1095,15 @@ test_expect_success 'submodule update --quiet passes quietness to fetch with a s
 	(cd super5 &&
 	 # This test var can mess with the stderr output checked in this test.
 	 GIT_TEST_NAME_HASH_VERSION=1 \
+	 GIT_TEST_PACK_PATH_WALK=0 \
 		git submodule update --quiet --init --depth=1 submodule3 >out 2>err &&
 	 test_must_be_empty out &&
 	 test_must_be_empty err
 	) &&
 	git clone super4 super6 &&
 	(cd super6 &&
+	 # This test variable will create a "warning" message to stderr
+	 GIT_TEST_PACK_PATH_WALK=0 \
 	 git submodule update --init --depth=1 submodule3 >out 2>err &&
 	 test_file_not_empty out &&
 	 test_file_not_empty err
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 06/13] t5538: add tests to confirm deltas in shallow pushes
  2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2025-03-24 15:22   ` [PATCH v2 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
@ 2025-03-24 15:22   ` Derrick Stolee via GitGitGadget
  2025-03-24 15:22   ` [PATCH v2 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
                     ` (6 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-24 15:22 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

It can be notoriously difficult to detect if delta bases are being
computed properly during 'git push'. Construct an example where it will
make a kilobyte worth of difference when a delta base is not found. We
can then use the progress indicators to distinguish between bytes and
KiB depending on whether the delta base is found and used.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 t/t5538-push-shallow.sh | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/t/t5538-push-shallow.sh b/t/t5538-push-shallow.sh
index e91fcc173e8..11b85cca9e8 100755
--- a/t/t5538-push-shallow.sh
+++ b/t/t5538-push-shallow.sh
@@ -123,4 +123,38 @@ EOF
 	git cat-file blob $(echo 1|git hash-object --stdin) >/dev/null
 	)
 '
+
+test_expect_success 'push new commit from shallow clone has correct object count' '
+	git init origin &&
+	test_commit -C origin a &&
+	test_commit -C origin b &&
+
+	git clone --depth=1 "file://$(pwd)/origin" client &&
+	git -C client checkout -b topic &&
+	git -C client commit --allow-empty -m "empty" &&
+	GIT_PROGRESS_DELAY=0 git -C client push --progress origin topic 2>err &&
+	test_grep "Enumerating objects: 1, done." err
+'
+
+test_expect_success 'push new commit from shallow clone has good deltas' '
+	git init base &&
+	test_seq 1 999 >base/a &&
+	test_commit -C base initial &&
+	git -C base add a &&
+	git -C base commit -m "big a" &&
+
+	git clone --depth=1 "file://$(pwd)/base" deltas &&
+	git -C deltas checkout -b deltas &&
+	test_seq 1 1000 >deltas/a &&
+	git -C deltas commit -a -m "bigger a" &&
+	GIT_TRACE2_PERF="$(pwd)/trace.txt" \
+	GIT_PROGRESS_DELAY=0 git -C deltas push --progress origin deltas 2>err &&
+
+	test_grep "Enumerating objects: 5, done" err &&
+
+	# If the delta base is found, then this message uses "bytes".
+	# If the delta base is not found, then this message uses "KiB".
+	test_grep "Writing objects: .* bytes" err
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 07/13] repack: add --path-walk option
  2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (5 preceding siblings ...)
  2025-03-24 15:22   ` [PATCH v2 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
@ 2025-03-24 15:22   ` Derrick Stolee via GitGitGadget
  2025-03-24 15:22   ` [PATCH v2 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
                     ` (5 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-24 15:22 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.

Add the --path-walk option to the performance tests in p5313.

For the microsoft/fluentui repo [1] checked out at a specific commit [2],
the --path-walk tests in p5313 look like this:

Test                                                     this tree
-------------------------------------------------------------------------
5313.18: thin pack with --path-walk                      0.08(0.06+0.02)
5313.19: thin pack size with --path-walk                           18.4K
5313.20: big pack with --path-walk                       2.10(7.80+0.26)
5313.21: big pack size with --path-walk                            19.8M
5313.22: shallow fetch pack with --path-walk             1.62(3.38+0.17)
5313.23: shallow pack size with --path-walk                        33.6M
5313.24: repack with --path-walk                         81.29(96.08+0.71)
5313.25: repack size with --path-walk                             142.5M

[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

Along with the earlier tests in p5313, I'll instead reformat the
comparison as follows:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1             439.4M      87.24s
Hash v2             161.7M      21.51s
Path Walk           142.5M      81.29s

There are a few things to notice here:

 1. The benefits of --name-hash-version=2 over --name-hash-version=1 are
    significant, but --path-walk still compresses better than that
    option.

 2. The --path-walk command is still using --name-hash-version=1 for the
    second pass of delta computation, using the increased name hash
    collisions as a potential method for opportunistic compression on
    top of the path-focused compression.

 3. The --path-walk algorithm is currently sequential and does not use
    multiple threads for delta compression. Threading will be
    implemented in a future change so the computation time will improve
    to better compete in this metric.

There are small benefits in size for my copy of the Git repository:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1             248.8M      30.44s
Hash v2             249.0M      30.15s
Path Walk           213.2M     142.50s

As well as in the nodejs/node repository [3]:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1             739.9M      71.18s
Hash v2             764.6M      67.82s
Path Walk           698.1M     208.10s

[3] https://github.com/nodejs/node

This benefit also repeats in my copy of the Linux kernel repository:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1               2.5G     554.41s
Hash v2               2.5G     549.62s
Path Walk             2.2G    1562.36s

It is important to see that even when the repository shape does not have
many name-hash collisions, there is a slight space boost to be found
using this method.

As this repacking strategy was released in Git for Windows 2.47.0, some
users have reported cases where the --path-walk compression is slightly
worse than the --name-hash-version=2 option. In those cases, it may be
beneficial to combine the two options. However, there has not been a
released version of Git that has both options and I don't have access to
these repos for testing.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-repack.adoc | 14 +++++++++++++-
 builtin/repack.c              |  7 ++++++-
 t/perf/p5313-pack-objects.sh  | 18 ++++++++----------
 3 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/Documentation/git-repack.adoc b/Documentation/git-repack.adoc
index 5852a5c9736..2199eb3b94f 100644
--- a/Documentation/git-repack.adoc
+++ b/Documentation/git-repack.adoc
@@ -11,7 +11,7 @@ SYNOPSIS
 [verse]
 'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]
 	[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
-	[--write-midx] [--name-hash-version=<n>]
+	[--write-midx] [--name-hash-version=<n>] [--path-walk]
 
 DESCRIPTION
 -----------
@@ -255,6 +255,18 @@ linkgit:git-multi-pack-index[1]).
 	Provide this argument to the underlying `git pack-objects` process.
 	See linkgit:git-pack-objects[1] for full details.
 
+--path-walk::
+	This option passes the `--path-walk` option to the underlying
+	`git pack-options` process (see linkgit:git-pack-objects[1]).
+	By default, `git pack-objects` walks objects in an order that
+	presents trees and blobs in an order unrelated to the path they
+	appear relative to a commit's root tree. The `--path-walk` option
+	enables a different walking algorithm that organizes trees and
+	blobs by path. This has the potential to improve delta compression
+	especially in the presence of filenames that cause collisions in
+	Git's default name-hash algorithm. Due to changing how the objects
+	are walked, this option is not compatible with `--delta-islands`
+	or `--filter`.
 
 CONFIGURATION
 -------------
diff --git a/builtin/repack.c b/builtin/repack.c
index 75e3752353a..d7f798280c0 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -43,7 +43,7 @@ static char *packdir, *packtmp_name, *packtmp;
 static const char *const git_repack_usage[] = {
 	N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
 	   "[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]\n"
-	   "[--write-midx] [--name-hash-version=<n>]"),
+	   "[--write-midx] [--name-hash-version=<n>] [--path-walk]"),
 	NULL
 };
 
@@ -63,6 +63,7 @@ struct pack_objects_args {
 	int quiet;
 	int local;
 	int name_hash_version;
+	int path_walk;
 	struct list_objects_filter_options filter_options;
 };
 
@@ -313,6 +314,8 @@ static void prepare_pack_objects(struct child_process *cmd,
 		strvec_pushf(&cmd->args, "--no-reuse-object");
 	if (args->name_hash_version)
 		strvec_pushf(&cmd->args, "--name-hash-version=%d", args->name_hash_version);
+	if (args->path_walk)
+		strvec_pushf(&cmd->args, "--path-walk");
 	if (args->local)
 		strvec_push(&cmd->args,  "--local");
 	if (args->quiet)
@@ -1212,6 +1215,8 @@ int cmd_repack(int argc,
 				N_("pass --no-reuse-object to git-pack-objects")),
 		OPT_INTEGER(0, "name-hash-version", &po_args.name_hash_version,
 				N_("specify the name hash version to use for grouping similar objects by path")),
+		OPT_BOOL(0, "path-walk", &po_args.path_walk,
+				N_("pass --path-walk to git-pack-objects")),
 		OPT_NEGBIT('n', NULL, &run_update_server_info,
 				N_("do not run git-update-server-info"), 1),
 		OPT__QUIET(&po_args.quiet, N_("be quiet")),
diff --git a/t/perf/p5313-pack-objects.sh b/t/perf/p5313-pack-objects.sh
index cd6dd3abb71..98748b0e203 100755
--- a/t/perf/p5313-pack-objects.sh
+++ b/t/perf/p5313-pack-objects.sh
@@ -55,23 +55,21 @@ test_all_with_args () {
 	test_size "shallow pack size with $parameter" '
 		test_file_size out
 	'
-}
-
-for version in 1 2
-do
-	export version
-
-	test_all_with_args --name-hash-version=$version
 
-	test_perf "repack with --name-hash-version=$version" '
-		git repack -adf --name-hash-version=$version
+	test_perf "repack with $parameter" '
+		git repack -adf $parameter
 	'
 
-	test_size "repack size with --name-hash-version=$version" '
+	test_size "repack size with $parameter" '
 		gitdir=$(git rev-parse --git-dir) &&
 		pack=$(ls $gitdir/objects/pack/pack-*.pack) &&
 		test_file_size "$pack"
 	'
+}
+
+for version in 1 2
+do
+	test_all_with_args --name-hash-version=$version
 done
 
 test_all_with_args --path-walk
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 08/13] pack-objects: enable --path-walk via config
  2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (6 preceding siblings ...)
  2025-03-24 15:22   ` [PATCH v2 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
@ 2025-03-24 15:22   ` Derrick Stolee via GitGitGadget
  2025-03-24 15:22   ` [PATCH v2 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
                     ` (4 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-24 15:22 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

Users may want to enable the --path-walk option for 'git pack-objects' by
default, especially underneath commands like 'git push' or 'git repack'.

This should be limited to client repositories, since the --path-walk option
disables bitmap walks, so would be bad to include in Git servers when
serving fetches and clones. There is potential that it may be helpful to
consider when repacking the repository, to take advantage of improved deltas
across historical versions of the same files.

Much like how "pack.useSparse" was introduced and included in
"feature.experimental" before being enabled by default, use the repository
settings infrastructure to make the new "pack.usePathWalk" config enabled by
"feature.experimental" and "feature.manyFiles".

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/config/feature.adoc | 4 ++++
 Documentation/config/pack.adoc    | 8 ++++++++
 builtin/pack-objects.c            | 3 +++
 repo-settings.c                   | 3 +++
 repo-settings.h                   | 1 +
 5 files changed, 19 insertions(+)

diff --git a/Documentation/config/feature.adoc b/Documentation/config/feature.adoc
index f061b64b748..cb49ff2604a 100644
--- a/Documentation/config/feature.adoc
+++ b/Documentation/config/feature.adoc
@@ -20,6 +20,10 @@ walking fewer objects.
 +
 * `pack.allowPackReuse=multi` may improve the time it takes to create a pack by
 reusing objects from multiple packs instead of just one.
++
+* `pack.usePathWalk` may speed up packfile creation and make the packfiles be
+significantly smaller in the presence of certain filename collisions with Git's
+default name-hash.
 
 feature.manyFiles::
 	Enable config options that optimize for repos with many files in the
diff --git a/Documentation/config/pack.adoc b/Documentation/config/pack.adoc
index da527377faf..08d06271177 100644
--- a/Documentation/config/pack.adoc
+++ b/Documentation/config/pack.adoc
@@ -155,6 +155,14 @@ pack.useSparse::
 	commits contain certain types of direct renames. Default is
 	`true`.
 
+pack.usePathWalk::
+	When true, git will default to using the '--path-walk' option in
+	'git pack-objects' when the '--revs' option is present. This
+	algorithm groups objects by path to maximize the ability to
+	compute delta chains across historical versions of the same
+	object. This may disable other options, such as using bitmaps to
+	enumerate objects.
+
 pack.preferBitmapTips::
 	When selecting which commits will receive bitmaps, prefer a
 	commit at the tip of any reference that is a suffix of any value
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a6b8a78d42a..0ea85754c52 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4652,6 +4652,9 @@ int cmd_pack_objects(int argc,
 		if (use_bitmap_index > 0 ||
 		    !use_internal_rev_list)
 			path_walk = 0;
+		else if (the_repository->gitdir &&
+			 the_repository->settings.pack_use_path_walk)
+			path_walk = 1;
 		else
 			path_walk = git_env_bool("GIT_TEST_PACK_PATH_WALK", 0);
 	}
diff --git a/repo-settings.c b/repo-settings.c
index 67e9cfd2e63..9b5595c708e 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -47,11 +47,13 @@ void prepare_repo_settings(struct repository *r)
 		r->settings.fetch_negotiation_algorithm = FETCH_NEGOTIATION_SKIPPING;
 		r->settings.pack_use_bitmap_boundary_traversal = 1;
 		r->settings.pack_use_multi_pack_reuse = 1;
+		r->settings.pack_use_path_walk = 1;
 	}
 	if (manyfiles) {
 		r->settings.index_version = 4;
 		r->settings.index_skip_hash = 1;
 		r->settings.core_untracked_cache = UNTRACKED_CACHE_WRITE;
+		r->settings.pack_use_path_walk = 1;
 	}
 
 	/* Commit graph config or default, does not cascade (simple) */
@@ -66,6 +68,7 @@ void prepare_repo_settings(struct repository *r)
 
 	/* Boolean config or default, does not cascade (simple)  */
 	repo_cfg_bool(r, "pack.usesparse", &r->settings.pack_use_sparse, 1);
+	repo_cfg_bool(r, "pack.usepathwalk", &r->settings.pack_use_path_walk, 0);
 	repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1);
 	repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0);
 	repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash);
diff --git a/repo-settings.h b/repo-settings.h
index ddc11967e01..a31decad221 100644
--- a/repo-settings.h
+++ b/repo-settings.h
@@ -56,6 +56,7 @@ struct repo_settings {
 	enum untracked_cache_setting core_untracked_cache;
 
 	int pack_use_sparse;
+	int pack_use_path_walk;
 	enum fetch_negotiation_setting fetch_negotiation_algorithm;
 
 	int core_multi_pack_index;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 09/13] scalar: enable path-walk during push via config
  2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (7 preceding siblings ...)
  2025-03-24 15:22   ` [PATCH v2 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
@ 2025-03-24 15:22   ` Derrick Stolee via GitGitGadget
  2025-03-24 15:22   ` [PATCH v2 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-24 15:22 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

Repositories registered with Scalar are expected to be client-only
repositories that are rather large. This means that they are more likely to
be good candidates for using the --path-walk option when running 'git
pack-objects', especially under the hood of 'git push'. Enable this config
in Scalar repositories.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 scalar.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/scalar.c b/scalar.c
index da42b4be0cc..bf638fa34b8 100644
--- a/scalar.c
+++ b/scalar.c
@@ -170,6 +170,7 @@ static int set_recommended_config(int reconfigure)
 		{ "core.autoCRLF", "false" },
 		{ "core.safeCRLF", "false" },
 		{ "fetch.showForcedUpdates", "false" },
+		{ "pack.usePathWalk", "true" },
 		{ NULL, NULL },
 	};
 	int i;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 10/13] pack-objects: refactor path-walk delta phase
  2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (8 preceding siblings ...)
  2025-03-24 15:22   ` [PATCH v2 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
@ 2025-03-24 15:22   ` Derrick Stolee via GitGitGadget
  2025-03-24 15:22   ` [PATCH v2 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
                     ` (2 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-24 15:22 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

Previously, the --path-walk option to 'git pack-objects' would compute
deltas inline with the path-walk logic. This would make the progress
indicator look like it is taking a long time to enumerate objects, and
then very quickly computed deltas.

Instead of computing deltas on each region of objects organized by tree,
store a list of regions corresponding to these groups. These can later
be pulled from the list for delta compression before doing the "global"
delta search.

This presents a new progress indicator that can be used in tests to
verify that this stage is happening.

The current implementation is not integrated with threads, but could be
done in a future update.

Since we do not attempt to sort objects by size until after exploring
all trees, we can remove the previous change to t5530 due to a different
error message appearing first.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 builtin/pack-objects.c       | 82 +++++++++++++++++++++++++-----------
 pack-objects.h               | 12 ++++++
 t/t5300-pack-object.sh       |  8 +++-
 t/t5530-upload-pack-error.sh |  6 ---
 4 files changed, 75 insertions(+), 33 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 0ea85754c52..d4e05ca4434 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3236,6 +3236,51 @@ static int should_attempt_deltas(struct object_entry *entry)
 	return 1;
 }
 
+static void find_deltas_for_region(struct object_entry *list,
+				   struct packing_region *region,
+				   unsigned int *processed)
+{
+	struct object_entry **delta_list;
+	uint32_t delta_list_nr = 0;
+
+	ALLOC_ARRAY(delta_list, region->nr);
+	for (uint32_t i = 0; i < region->nr; i++) {
+		struct object_entry *entry = list + region->start + i;
+		if (should_attempt_deltas(entry))
+			delta_list[delta_list_nr++] = entry;
+	}
+
+	QSORT(delta_list, delta_list_nr, type_size_sort);
+	find_deltas(delta_list, &delta_list_nr, window, depth, processed);
+	free(delta_list);
+}
+
+static void find_deltas_by_region(struct object_entry *list,
+				  struct packing_region *regions,
+				  uint32_t start, uint32_t nr)
+{
+	unsigned int processed = 0;
+	uint32_t progress_nr;
+
+	if (!nr)
+		return;
+
+	progress_nr = regions[nr - 1].start + regions[nr - 1].nr;
+
+	if (progress)
+		progress_state = start_progress(the_repository,
+						_("Compressing objects by path"),
+						progress_nr);
+
+	while (nr--)
+		find_deltas_for_region(list,
+				       &regions[start++],
+				       &processed);
+
+	display_progress(progress_state, progress_nr);
+	stop_progress(&progress_state);
+}
+
 static void prepare_pack(int window, int depth)
 {
 	struct object_entry **delta_list;
@@ -3260,6 +3305,10 @@ static void prepare_pack(int window, int depth)
 	if (!to_pack.nr_objects || !window || !depth)
 		return;
 
+	if (path_walk)
+		find_deltas_by_region(to_pack.objects, to_pack.regions,
+				      0, to_pack.nr_regions);
+
 	ALLOC_ARRAY(delta_list, to_pack.nr_objects);
 	nr_deltas = n = 0;
 
@@ -4213,10 +4262,8 @@ static int add_objects_by_path(const char *path,
 			       enum object_type type,
 			       void *data)
 {
-	struct object_entry **delta_list;
 	size_t oe_start = to_pack.nr_objects;
 	size_t oe_end;
-	unsigned int sub_list_size;
 	unsigned int *processed = data;
 
 	/*
@@ -4249,32 +4296,17 @@ static int add_objects_by_path(const char *path,
 	if (oe_end == oe_start || !window)
 		return 0;
 
-	sub_list_size = 0;
-	ALLOC_ARRAY(delta_list, oe_end - oe_start);
-
-	for (size_t i = 0; i < oe_end - oe_start; i++) {
-		struct object_entry *entry = to_pack.objects + oe_start + i;
+	ALLOC_GROW(to_pack.regions,
+		   to_pack.nr_regions + 1,
+		   to_pack.nr_regions_alloc);
 
-		if (!should_attempt_deltas(entry))
-			continue;
-
-		delta_list[sub_list_size++] = entry;
-	}
+	to_pack.regions[to_pack.nr_regions].start = oe_start;
+	to_pack.regions[to_pack.nr_regions].nr = oe_end - oe_start;
+	to_pack.nr_regions++;
 
-	/*
-	 * Find delta bases among this list of objects that all match the same
-	 * path. This causes the delta compression to be interleaved in the
-	 * object walk, which can lead to confusing progress indicators. This is
-	 * also incompatible with threaded delta calculations. In the future,
-	 * consider creating a list of regions in the full to_pack.objects array
-	 * that could be picked up by the threaded delta computation.
-	 */
-	if (sub_list_size && window) {
-		QSORT(delta_list, sub_list_size, type_size_sort);
-		find_deltas(delta_list, &sub_list_size, window, depth, processed);
-	}
+	*processed += oids->nr;
+	display_progress(progress_state, *processed);
 
-	free(delta_list);
 	return 0;
 }
 
diff --git a/pack-objects.h b/pack-objects.h
index d73e3843c92..7ba9f3448fe 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -119,11 +119,23 @@ struct object_entry {
 	unsigned ext_base:1; /* delta_idx points outside packlist */
 };
 
+/**
+ * A packing region is a section of the packing_data.objects array
+ * as given by a starting index and a number of elements.
+ */
+struct packing_region {
+	uint32_t start;
+	uint32_t nr;
+};
+
 struct packing_data {
 	struct repository *repo;
 	struct object_entry *objects;
 	uint32_t nr_objects, nr_alloc;
 
+	struct packing_region *regions;
+	uint64_t nr_regions, nr_regions_alloc;
+
 	int32_t *index;
 	uint32_t index_size;
 
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 16420d12863..c8df6afd784 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -725,7 +725,9 @@ test_expect_success '--name-hash-version=2 and --write-bitmap-index are incompat
 
 test_expect_success '--path-walk pack everything' '
 	git -C server rev-parse HEAD >in &&
-	git -C server pack-objects --stdout --revs --path-walk <in >out.pack &&
+	GIT_PROGRESS_DELAY=0 git -C server pack-objects \
+		--stdout --revs --path-walk --progress <in >out.pack 2>err &&
+	grep "Compressing objects by path" err &&
 	git -C server index-pack --stdin <out.pack
 '
 
@@ -734,7 +736,9 @@ test_expect_success '--path-walk thin pack' '
 	$(git -C server rev-parse HEAD)
 	^$(git -C server rev-parse HEAD~2)
 	EOF
-	git -C server pack-objects --thin --stdout --revs --path-walk <in >out.pack &&
+	GIT_PROGRESS_DELAY=0 git -C server pack-objects \
+		--thin --stdout --revs --path-walk --progress <in >out.pack 2>err &&
+	grep "Compressing objects by path" err &&
 	git -C server index-pack --fix-thin --stdin <out.pack
 '
 
diff --git a/t/t5530-upload-pack-error.sh b/t/t5530-upload-pack-error.sh
index 8eb6fea839a..558eedf25a4 100755
--- a/t/t5530-upload-pack-error.sh
+++ b/t/t5530-upload-pack-error.sh
@@ -34,12 +34,6 @@ test_expect_success 'upload-pack fails due to error in pack-objects packing' '
 	hexsz=$(test_oid hexsz) &&
 	printf "%04xwant %s\n00000009done\n0000" \
 		$(($hexsz + 10)) $head >input &&
-
-	# The current implementation of path-walk causes a different
-	# error message. This will be changed by a future refactoring.
-	GIT_TEST_PACK_PATH_WALK=0 &&
-	export GIT_TEST_PACK_PATH_WALK &&
-
 	test_must_fail git upload-pack . <input >/dev/null 2>output.err &&
 	test_grep "unable to read" output.err &&
 	test_grep "pack-objects died" output.err
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 11/13] pack-objects: thread the path-based compression
  2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (9 preceding siblings ...)
  2025-03-24 15:22   ` [PATCH v2 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
@ 2025-03-24 15:22   ` Derrick Stolee via GitGitGadget
  2025-03-24 15:22   ` [PATCH v2 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
  2025-03-24 15:22   ` [PATCH v2 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
  12 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-24 15:22 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

Adapting the implementation of ll_find_deltas(), create a threaded
version of the --path-walk compression step in 'git pack-objects'.

This involves adding a 'regions' member to the thread_params struct,
allowing each thread to own a section of paths. We can simplify the way
jobs are split because there is no value in extending the batch based on
name-hash the way sections of the object entry array are attempted to be
grouped. We re-use the 'list_size' and 'remaining' items for the purpose
of borrowing work in progress from other "victim" threads when a thread
has finished its batch of work more quickly.

Using the Git repository as a test repo, the p5313 performance test
shows that the resulting size of the repo is the same, but the threaded
implementation gives gains of varying degrees depending on the number of
objects being packed. (This was tested on a 16-core machine.)

Test                        HEAD~1      HEAD
---------------------------------------------------
5313.20: big pack             2.38      1.99 -16.4%
5313.21: big pack size       16.1M     16.0M  -0.2%
5313.24: repack             107.32     45.41 -57.7%
5313.25: repack size        213.3M    213.2M  -0.0%

(Test output is formatted to better fit in message.)

This ~60% reduction in 'git repack --path-walk' time is typical across
all repos I used for testing. What is interesting is to compare when the
overall time improves enough to outperform the --name-hash-version=1
case. These time improvements correlate with repositories with data
shapes that significantly improve their data size as well. The
--path-walk feature frequently takes longer than --name-hash-verison=2,
trading some extrac computation for some additional compression. The
natural place where this additional computation comes from is the two
compression passes that --path-walk takes, though the first pass is
naturally faster due to the path boundaries avoiding a number of delta
compression attempts.

For example, the microsoft/fluentui repo has significant size reduction
from --name-hash-version=1 to --name-hash-version=2 followed by further
improvements with --path-walk. The threaded computation makes
--path-walk more competitive in time compared to --name-hash-version=2,
though still ~31% more expensive in that metric.

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                439.4M      87.24s
Hash v2                161.7M      21.51s
Path Walk (Before)     142.5M      81.29s
Path Walk (After)      142.5M      28.16s

Similar results hold for the Git repository:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                248.8M      30.44s
Hash v2                249.0M      30.15s
Path Walk (Before)     213.2M     142.50s
Path Walk (After)      213.3M      45.41s

...as well as the nodejs/node repository:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                739.9M      71.18s
Hash v2                764.6M      67.82s
Path Walk (Before)     698.1M     208.10s
Path Walk (After)      698.0M      75.10s

Finally, the Linux kernel repository is a good test for this repacking
time change, even though the space savings is more subtle:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                  2.5G     554.41s
Hash v2                  2.5G     549.62s
Path Walk (before)       2.2G    1562.36s
Path Walk (before)       2.2G     559.00s

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 builtin/pack-objects.c | 163 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 161 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index d4e05ca4434..2a6246c1e78 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2964,6 +2964,7 @@ static void find_deltas(struct object_entry **list, unsigned *list_size,
 struct thread_params {
 	pthread_t thread;
 	struct object_entry **list;
+	struct packing_region *regions;
 	unsigned list_size;
 	unsigned remaining;
 	int window;
@@ -3281,6 +3282,164 @@ static void find_deltas_by_region(struct object_entry *list,
 	stop_progress(&progress_state);
 }
 
+static void *threaded_find_deltas_by_path(void *arg)
+{
+	struct thread_params *me = arg;
+
+	progress_lock();
+	while (me->remaining) {
+		while (me->remaining) {
+			progress_unlock();
+			find_deltas_for_region(to_pack.objects,
+					       me->regions,
+					       me->processed);
+			progress_lock();
+			me->remaining--;
+			me->regions++;
+		}
+
+		me->working = 0;
+		pthread_cond_signal(&progress_cond);
+		progress_unlock();
+
+		/*
+		 * We must not set ->data_ready before we wait on the
+		 * condition because the main thread may have set it to 1
+		 * before we get here. In order to be sure that new
+		 * work is available if we see 1 in ->data_ready, it
+		 * was initialized to 0 before this thread was spawned
+		 * and we reset it to 0 right away.
+		 */
+		pthread_mutex_lock(&me->mutex);
+		while (!me->data_ready)
+			pthread_cond_wait(&me->cond, &me->mutex);
+		me->data_ready = 0;
+		pthread_mutex_unlock(&me->mutex);
+
+		progress_lock();
+	}
+	progress_unlock();
+	/* leave ->working 1 so that this doesn't get more work assigned */
+	return NULL;
+}
+
+static void ll_find_deltas_by_region(struct object_entry *list,
+				     struct packing_region *regions,
+				     uint32_t start, uint32_t nr)
+{
+	struct thread_params *p;
+	int i, ret, active_threads = 0;
+	unsigned int processed = 0;
+	uint32_t progress_nr;
+	init_threaded_search();
+
+	if (!nr)
+		return;
+
+	progress_nr =  regions[nr - 1].start + regions[nr - 1].nr;
+	if (delta_search_threads <= 1) {
+		find_deltas_by_region(list, regions, start, nr);
+		cleanup_threaded_search();
+		return;
+	}
+
+	if (progress > pack_to_stdout)
+		fprintf_ln(stderr, _("Path-based delta compression using up to %d threads"),
+			   delta_search_threads);
+	CALLOC_ARRAY(p, delta_search_threads);
+
+	if (progress)
+		progress_state = start_progress(the_repository,
+						_("Compressing objects by path"),
+						progress_nr);
+	/* Partition the work amongst work threads. */
+	for (i = 0; i < delta_search_threads; i++) {
+		unsigned sub_size = nr / (delta_search_threads - i);
+
+		p[i].window = window;
+		p[i].depth = depth;
+		p[i].processed = &processed;
+		p[i].working = 1;
+		p[i].data_ready = 0;
+
+		p[i].regions = regions;
+		p[i].list_size = sub_size;
+		p[i].remaining = sub_size;
+
+		regions += sub_size;
+		nr -= sub_size;
+	}
+
+	/* Start work threads. */
+	for (i = 0; i < delta_search_threads; i++) {
+		if (!p[i].list_size)
+			continue;
+		pthread_mutex_init(&p[i].mutex, NULL);
+		pthread_cond_init(&p[i].cond, NULL);
+		ret = pthread_create(&p[i].thread, NULL,
+				     threaded_find_deltas_by_path, &p[i]);
+		if (ret)
+			die(_("unable to create thread: %s"), strerror(ret));
+		active_threads++;
+	}
+
+	/*
+	 * Now let's wait for work completion.  Each time a thread is done
+	 * with its work, we steal half of the remaining work from the
+	 * thread with the largest number of unprocessed objects and give
+	 * it to that newly idle thread.  This ensure good load balancing
+	 * until the remaining object list segments are simply too short
+	 * to be worth splitting anymore.
+	 */
+	while (active_threads) {
+		struct thread_params *target = NULL;
+		struct thread_params *victim = NULL;
+		unsigned sub_size = 0;
+
+		progress_lock();
+		for (;;) {
+			for (i = 0; !target && i < delta_search_threads; i++)
+				if (!p[i].working)
+					target = &p[i];
+			if (target)
+				break;
+			pthread_cond_wait(&progress_cond, &progress_mutex);
+		}
+
+		for (i = 0; i < delta_search_threads; i++)
+			if (p[i].remaining > 2*window &&
+			    (!victim || victim->remaining < p[i].remaining))
+				victim = &p[i];
+		if (victim) {
+			sub_size = victim->remaining / 2;
+			target->regions = victim->regions + victim->remaining - sub_size;
+			victim->list_size -= sub_size;
+			victim->remaining -= sub_size;
+		}
+		target->list_size = sub_size;
+		target->remaining = sub_size;
+		target->working = 1;
+		progress_unlock();
+
+		pthread_mutex_lock(&target->mutex);
+		target->data_ready = 1;
+		pthread_cond_signal(&target->cond);
+		pthread_mutex_unlock(&target->mutex);
+
+		if (!sub_size) {
+			pthread_join(target->thread, NULL);
+			pthread_cond_destroy(&target->cond);
+			pthread_mutex_destroy(&target->mutex);
+			active_threads--;
+		}
+	}
+	cleanup_threaded_search();
+	free(p);
+
+	display_progress(progress_state, progress_nr);
+	stop_progress(&progress_state);
+}
+
 static void prepare_pack(int window, int depth)
 {
 	struct object_entry **delta_list;
@@ -3306,8 +3465,8 @@ static void prepare_pack(int window, int depth)
 		return;
 
 	if (path_walk)
-		find_deltas_by_region(to_pack.objects, to_pack.regions,
-				      0, to_pack.nr_regions);
+		ll_find_deltas_by_region(to_pack.objects, to_pack.regions,
+					 0, to_pack.nr_regions);
 
 	ALLOC_ARRAY(delta_list, to_pack.nr_objects);
 	nr_deltas = n = 0;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 12/13] path-walk: add new 'edge_aggressive' option
  2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (10 preceding siblings ...)
  2025-03-24 15:22   ` [PATCH v2 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
@ 2025-03-24 15:22   ` Derrick Stolee via GitGitGadget
  2025-03-24 15:22   ` [PATCH v2 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
  12 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-24 15:22 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

In preparation for allowing both the --shallow and --path-walk options
in the 'git pack-objects' builtin, create a new 'edge_aggressive' option
in the path-walk API. This option will help walk the boundary more
thoroughly and help avoid sending extra objects during fetches and
pushes.

The only use of the 'edge_hint_aggressive' option in the revision API is
within mark_edges_uninteresting(), which is usually called before
between prepare_revision_walk() and before visiting commits with
get_revision(). In prepare_revision_walk(), the UNINTERESTING commits
are walked until a boundary is found.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/technical/api-path-walk.adoc |  8 ++++++++
 path-walk.c                                |  6 +++++-
 path-walk.h                                |  7 +++++++
 t/helper/test-path-walk.c                  |  2 ++
 t/t6601-path-walk.sh                       | 20 ++++++++++++++++++++
 5 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/Documentation/technical/api-path-walk.adoc b/Documentation/technical/api-path-walk.adoc
index e522695dd9f..34c905eb9c3 100644
--- a/Documentation/technical/api-path-walk.adoc
+++ b/Documentation/technical/api-path-walk.adoc
@@ -56,6 +56,14 @@ better off using the revision walk API instead.
 	the revision walk so that the walk emits commits marked with the
 	`UNINTERESTING` flag.
 
+`edge_aggressive`::
+	For performance reasons, usually only the boundary commits are
+	explored to find UNINTERESTING objects. However, in the case of
+	shallow clones it can be helpful to mark all trees and blobs
+	reachable from UNINTERESTING tip commits as UNINTERESTING. This
+	matches the behavior of `--objects-edge-aggressive` in the
+	revision API.
+
 `pl`::
 	This pattern list pointer allows focusing the path-walk search to
 	a set of patterns, only emitting paths that match the given
diff --git a/path-walk.c b/path-walk.c
index 341bdd2ba4e..2d4ddbadd50 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -503,7 +503,11 @@ int walk_objects_by_path(struct path_walk_info *info)
 	if (prepare_revision_walk(info->revs))
 		die(_("failed to setup revision walk"));
 
-	/* Walk trees to mark them as UNINTERESTING. */
+	/*
+	 * Walk trees to mark them as UNINTERESTING.
+	 * This is particularly important when 'edge_aggressive' is set.
+	 */
+	info->revs->edge_hint_aggressive = info->edge_aggressive;
 	edge_repo = info->revs->repo;
 	edge_tree_list = root_tree_list;
 	mark_edges_uninteresting(info->revs, show_edge,
diff --git a/path-walk.h b/path-walk.h
index 473ee9d361c..5ef5a8440e6 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -50,6 +50,13 @@ struct path_walk_info {
 	 */
 	int prune_all_uninteresting;
 
+	/**
+	 * When 'edge_aggressive' is set, then the revision walk will use
+	 * the '--object-edge-aggressive' option to mark even more objects
+	 * as uninteresting.
+	 */
+	int edge_aggressive;
+
 	/**
 	 * Specify a sparse-checkout definition to match our paths to. Do not
 	 * walk outside of this sparse definition. If the patterns are in
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 61e845e5ec2..fe63002c2be 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -82,6 +82,8 @@ int cmd__path_walk(int argc, const char **argv)
 			 N_("toggle inclusion of tree objects")),
 		OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
 			 N_("toggle pruning of uninteresting paths")),
+		OPT_BOOL(0, "edge-aggressive", &info.edge_aggressive,
+			 N_("toggle aggressive edge walk")),
 		OPT_BOOL(0, "stdin-pl", &stdin_pl,
 			 N_("read a pattern list over stdin")),
 		OPT_END(),
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index c89b0f1e19d..785c2f22373 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -378,6 +378,26 @@ test_expect_success 'topic, not base, boundary with pruning' '
 	test_cmp_sorted expect out
 '
 
+test_expect_success 'topic, not base, --edge-aggressive with pruning' '
+	test-tool path-walk --prune --edge-aggressive -- topic --not base >out &&
+
+	cat >expect <<-EOF &&
+	0:commit::$(git rev-parse topic)
+	1:tree::$(git rev-parse topic^{tree})
+	1:tree::$(git rev-parse base^{tree}):UNINTERESTING
+	2:tree:right/:$(git rev-parse topic:right)
+	2:tree:right/:$(git rev-parse base:right):UNINTERESTING
+	3:blob:right/c:$(git rev-parse base:right/c):UNINTERESTING
+	3:blob:right/c:$(git rev-parse topic:right/c)
+	blobs:2
+	commits:1
+	tags:0
+	trees:4
+	EOF
+
+	test_cmp_sorted expect out
+'
+
 test_expect_success 'trees are reported exactly once' '
 	test_when_finished "rm -rf unique-trees" &&
 	test_create_repo unique-trees &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 13/13] pack-objects: allow --shallow and --path-walk
  2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (11 preceding siblings ...)
  2025-03-24 15:22   ` [PATCH v2 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
@ 2025-03-24 15:22   ` Derrick Stolee via GitGitGadget
  12 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-03-24 15:22 UTC (permalink / raw)
  To: git
  Cc: christian.couder, gitster, johannes.schindelin, johncai86,
	jonathantanmy, karthik.188, kristofferhaugsbakk, me, newren, peff,
	ps, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

There does not appear to be anything particularly incompatible about the
--shallow and --path-walk options of 'git pack-objects'. If shallow
commits are to be handled differently, then it is by the revision walk
that defines the commit set and which are interesting or uninteresting.

However, before the previous change, a trivial removal of the warning
would cause a failure in t5500-fetch-pack.sh when
GIT_TEST_PACK_PATH_WALK is enabled. The shallow fetch would provide more
objects than we desired, due to some incorrect behavior of the path-walk
API, especially around walking uninteresting objects.

The recently-added tests in t5538-push-shallow.sh help to confirm this
behavior is working with the --path-walk option if
GIT_TEST_PACK_PATH_WALK is enabled. These tests passed previously due to
the --path-walk feature being disabled in the presence of a shallow
clone.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 builtin/pack-objects.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 2a6246c1e78..7db2ebc7962 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -209,6 +209,7 @@ static int keep_unreachable, unpack_unreachable, include_tag;
 static timestamp_t unpack_unreachable_expiration;
 static int pack_loose_unreachable;
 static int cruft;
+static int shallow = 0;
 static timestamp_t cruft_expiration;
 static int local;
 static int have_non_local_packs;
@@ -4486,6 +4487,7 @@ static void get_object_list_path_walk(struct rev_info *revs)
 	 * base objects.
 	 */
 	info.prune_all_uninteresting = sparse;
+	info.edge_aggressive = shallow;
 
 	if (walk_objects_by_path(&info))
 		die(_("failed to pack objects via path-walk"));
@@ -4687,7 +4689,6 @@ int cmd_pack_objects(int argc,
 		     struct repository *repo UNUSED)
 {
 	int use_internal_rev_list = 0;
-	int shallow = 0;
 	int all_progress_implied = 0;
 	struct strvec rp = STRVEC_INIT;
 	int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
@@ -4875,10 +4876,6 @@ int cmd_pack_objects(int argc,
 		warning(_("cannot use delta islands with --path-walk"));
 		path_walk = 0;
 	}
-	if (path_walk && shallow) {
-		warning(_("cannot use --shallow with --path-walk"));
-		path_walk = 0;
-	}
 	if (path_walk) {
 		strvec_push(&rp, "--boundary");
 		 /*
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2025-03-24 15:23 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2025-03-12 21:01   ` Taylor Blau
2025-03-20 19:48     ` Derrick Stolee
2025-03-10  1:50 ` [PATCH 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2025-03-12 21:14   ` Taylor Blau
2025-03-20 19:46     ` Derrick Stolee
2025-03-10  1:50 ` [PATCH 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2025-03-12 21:14   ` Taylor Blau
2025-03-10  1:50 ` [PATCH 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2025-03-12 21:21   ` Taylor Blau
2025-03-20 19:57     ` Derrick Stolee
2025-03-10  1:50 ` [PATCH 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
2025-03-10 17:28 ` [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Junio C Hamano
2025-03-12 20:47   ` Taylor Blau
2025-03-20 20:18     ` Derrick Stolee
2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).