Git Mailing List Archive mirror
 help / color / mirror / Atom feed
From: Victoria Dye <vdye@github.com>
To: Shuqi Liang <cheskaqiqi@gmail.com>, git@vger.kernel.org
Cc: gitster@pobox.com, derrickstolee@github.com
Subject: Re: [RFC][PATCH V1] diff-tree: integrate with sparse index
Date: Tue, 16 May 2023 14:19:04 -0700	[thread overview]
Message-ID: <fa24482c-7c48-9b7f-5d97-3dbf9822728c@github.com> (raw)
In-Reply-To: <20230515191836.674234-1-cheskaqiqi@gmail.com>

Shuqi Liang wrote:
> Remove full index requirement for `git diff-tree`. Add tests that verify
> that 'git diff-tree' behaves correctly when the sparse index is enabled
> and test to ensure the index is not expanded.
> 
> The `p2000` tests demonstrate a ~98% execution time reduction for
> 'git diff-tree' using a sparse index:
> 
> Test                                                before  after
> ------------------------------------------------------------------------
> 2000.94: git diff-tree HEAD (full-v3)                0.05   0.04 -20.0%
> 2000.95: git diff-tree HEAD (full-v4)                0.06   0.05 -16.7%
> 2000.96: git diff-tree HEAD (sparse-v3)              0.59   0.01 -98.3%
> 2000.97: git diff-tree HEAD (sparse-v4)              0.61   0.01 -98.4%
> 2000.98: git diff-tree HEAD -- f2/f4/a (full-v3)     0.05   0.05 +0.0%
> 2000.99: git diff-tree HEAD -- f2/f4/a (full-v4)     0.05   0.04 -20.0%
> 2000.100: git diff-tree HEAD -- f2/f4/a (sparse-v3)  0.58   0.01 -98.3%
> 2000.101: git diff-tree HEAD -- f2/f4/a (sparse-v4)  0.55   0.01 -98.2%

These performance results look great! This is generally what we'd expect,
too, since 'diff-tree' should be fast enough that index expansion is the
majority of its runtime.

> 
> Signed-off-by: Shuqi Liang <cheskaqiqi@gmail.com>
> ---
>  builtin/diff-tree.c                      |  4 ++
>  t/perf/p2000-sparse-operations.sh        |  2 +
>  t/t1092-sparse-checkout-compatibility.sh | 62 ++++++++++++++++++++++++
>  3 files changed, 68 insertions(+)
> 
> diff --git a/builtin/diff-tree.c b/builtin/diff-tree.c
> index 385c2d0230..c5d5730ebf 100644
> --- a/builtin/diff-tree.c
> +++ b/builtin/diff-tree.c
> @@ -121,6 +121,10 @@ int cmd_diff_tree(int argc, const char **argv, const char *prefix)
>  		usage(diff_tree_usage);
>  
>  	git_config(git_diff_basic_config, NULL); /* no "diff" UI options */
> +
> +	prepare_repo_settings(the_repository);
> +	the_repository->settings.command_requires_full_index = 0;

tl;dr: this does appear to be all you need to integrate the sparse index
with 'diff-tree', although there's an opportunity to clean up some
(seemingly) unused code to make that clearer.
 
Longer version: 

Looking at the documentation for 'diff-tree', it's not immediately obvious
why we'd need the index at all - it "[c]ompares the content and mode of
blobs found via two tree objects," per the command documentation. However,
the index is read in 'cmd_diff_tree' at two points, so to be reasonably
certain that this sparse index integration is correct we should verify that
the intended usage associated with those two reads will still work with a
sparse index.

Reading the index for attributes
================================
The first index read was added in fd66bcc31ff (diff-tree: read the index so
attribute checks work in bare repositories, 2017-12-06) to deal with reading
'.gitattributes' content. Per the 'gitattributes.txt' documentation:

"Git consults `$GIT_DIR/info/attributes` file (which has the highest
precedence), `.gitattributes` file in the same directory as the path in
question, and its parent directories up to the toplevel of the work
tree...When the `.gitattributes` file is missing from the work tree, the
path in the index is used as a fall-back."

However, 77efbb366ab (attr: be careful about sparse directories, 2021-09-08)
established that, in a sparse index, we do _not_ try to load a
'.gitattributes' file from within a sparse directory. Therefore, we don't
need to expand the index or change anything about reading attributes in
'diff-tree'. Good!

Reading the index for rename detection(?)
=========================================
The second one is read only on the condition that we're reading from stdin:

if (opt->diffopt.detect_rename) {
	if (!the_index.cache)
		repo_read_index(the_repository);
	opt->diffopt.setup |= DIFF_SETUP_USE_SIZE_CACHE;
}

This was initially added in f0c6b2a2fd9 ([PATCH] Optimize diff-tree -[CM]
--stdin, 2005-05-27), where 'setup' was set to 'DIFF_SETUP_USE_SIZE_CACHE |
DIFF_SETUP_USE_CACHE'. That assignment was later modified to drop the
'DIFF_SETUP_USE_CACHE' in ff7fe37b053 (diff.c: move read_index() code back
to the caller, 2018-08-13). 

However, 'DIFF_SETUP_USE_SIZE_CACHE' seems to be unused as of 6e0b8ed6d35
(diff.c: do not use a separate "size cache"., 2007-05-07) and nothing about
'detect_rename' otherwise indicates index usage, so AFAICT that whole
condition can be dropped (along with DIFF_SETUP_USE_SIZE_CACHE,
DIFF_SETUP_REVERSE, and diff_options.setup). Note that, if you want to make
that change in this series, it should be done in a separate patch _before_
this one (since dropping the deprecated setup infrastructure isn't really
part of the sparse index integration).

> +
>  	repo_init_revisions(the_repository, opt, prefix);
>  	if (repo_read_index(the_repository) < 0)
>  		die(_("index file corrupt"));
> diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
> index 60d1de0662..14caf01718 100755
> --- a/t/perf/p2000-sparse-operations.sh
> +++ b/t/perf/p2000-sparse-operations.sh
> @@ -129,5 +129,7 @@ test_perf_on_all git grep --cached bogus -- "f2/f1/f1/*"
>  test_perf_on_all git write-tree
>  test_perf_on_all git describe --dirty
>  test_perf_on_all 'echo >>new && git describe --dirty'
> +test_perf_on_all git diff-tree HEAD
> +test_perf_on_all git diff-tree HEAD -- $SPARSE_CONE/a

These tests cover both the whole tree & a specific pathspec, looks good.

>  
>  test_done
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 0c784813f1..f08edcbf8e 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -2108,4 +2108,66 @@ test_expect_success 'sparse-index is not expanded: write-tree' '
>  	ensure_not_expanded write-tree
>  '
>  
> +test_expect_success 'diff-tree' '
> +	init_repos &&
> +
> +	write_script edit-contents <<-\EOF &&
> +	echo text >>"$1"
> +	EOF
> +
> +	# Get the tree SHA for the current HEAD
> +	tree1=$(git -C sparse-index rev-parse HEAD^{tree}) &&
> +
> +	# make a change inside the sparse cone
> +	run_on_all ../edit-contents deep/a &&
> +	test_all_match git add deep/a &&
> +	test_all_match git commit -m "Change deep/a" &&
> +
> +	# Get the tree SHA for the new HEAD
> +	tree2=$(git -C sparse-index rev-parse HEAD^{tree}) &&

Commits with changes similar to what you create here here already exist in
the test repo; you can simplify the test by using them:

test_expect_success 'diff-tree' '
        init_repos &&

        # Test change inside sparse cone
        test_all_match git diff-tree HEAD update-deep &&
        test_all_match git diff-tree HEAD update-deep -- deep/a &&

        # Test change outside sparse cone
        test_all_match git diff-tree HEAD update-folder1 &&
        test_all_match git diff-tree HEAD update-folder1 -- folder1/a &&

	# Check that SKIP_WORKTREE files are not materialized
	test_path_is_missing sparse-checkout/folder1/a &&
	test_path_is_missing sparse-index/folder1/a
'

This also has the benefit of avoiding the creation of 'folder1/a' on disk in
the sparse-checkout test repos.

> +
> +

nit: extra newline should be removed

> +	test_all_match git diff-tree $tree1 $tree2 &&
> +	test_all_match git diff-tree HEAD &&
> +	test_all_match git diff-tree HEAD -- deep/a &&

You don't have a wildcard pathspec tested here, but I think that's okay in
this case; unlike e.g. 'git grep', there's no sparse index-related behavior
here that depends on the pathspec's contents.

> +
> +	# make a change outside the sparse cone
> +	run_on_all mkdir -p folder1 &&
> +	run_on_all cp a folder1/a &&
> +	run_on_all ../edit-contents folder1/a &&
> +	test_all_match git update-index folder1/a &&

'update-index' will work here, but I think using the more porcelain command
'add --sparse' might be better for demonstrating typical user behavior.

> +	test_all_match git commit -m "Change folder1/a" &&
> +
> +	# Get the tree SHA for the new HEAD
> +	tree3=$(git -C sparse-index rev-parse HEAD^{tree}) &&
> +
> +	test_all_match git diff-tree $tree1 $tree3 &&
> +	test_all_match git diff-tree $tree1 $tree3 -- folder1/a &&
> +	test_all_match git diff-tree HEAD &&
> +	test_all_match git diff-tree HEAD -- folder1/a &&
> +
> +	# check that SKIP_WORKTREE files are not materialized
> +	test_path_is_missing sparse-checkout/folder2/a &&
> +	test_path_is_missing sparse-index/folder2/a

At first I wasn't sure about whether these checks were necessary (we don't
'diff-tree' on any 'folder2/' pathspec), but this check verifies that we
don't materialize the files in the case with no pathspec. While that's
unlikely to happen, it doesn't hurt to have this test to be sure.

> +'
> +
> +test_expect_success 'sparse-index is not expanded: diff-tree' '
> +	init_repos &&
> +
> +	# Get the tree SHA for the current HEAD
> +	tree1=$(git -C sparse-index rev-parse HEAD^{tree}) &&

As with the previous test, you can use the 'update-deep' and
'update-folder1' branches to simplify here.

> +
> +	echo "test1" >>sparse-index/deep/a &&
> +	git -C sparse-index add deep/a &&
> +	git -C sparse-index commit -m "Change deep/a" &&
> +
> +	# Get the tree SHA for the new HEAD
> +	tree2=$(git -C sparse-index rev-parse HEAD^{tree}) &&
> +
> +	ensure_not_expanded diff-tree $tree1 $tree2 &&
> +	ensure_not_expanded diff-tree $tree1 $tree2 -- deep/a &&
> +	ensure_not_expanded diff-tree HEAD &&
> +	ensure_not_expanded diff-tree HEAD -- deep/a

Since the index won't expand regardless of whether you 'diff-tree'
a file inside or outside the cone, it'd be nice to have a test like:

ensure_not_expanded diff-tree update-folder1 &&
ensure_not_expanded diff-tree update-folder1 -- folder1/a

> +'
> +
>  test_done


  reply	other threads:[~2023-05-16 21:19 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-15 19:18 [RFC][PATCH V1] diff-tree: integrate with sparse index Shuqi Liang
2023-05-16 21:19 ` Victoria Dye [this message]
2023-05-16 23:14   ` Junio C Hamano
2023-05-17 18:47     ` Victoria Dye
2023-05-18 15:44 ` [PATCH v2] " Shuqi Liang
2023-05-22 19:07   ` Victoria Dye
2023-05-23  4:38     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fa24482c-7c48-9b7f-5d97-3dbf9822728c@github.com \
    --to=vdye@github.com \
    --cc=cheskaqiqi@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).