Git Mailing List Archive mirror
 help / color / mirror / Atom feed
From: Jeff Hostetler <git@jeffhostetler.com>
To: Junio C Hamano <gitster@pobox.com>,
	Anh Le via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, Timothy Jones <timothy@canva.com>,
	Jeff Hostetler <jeffhost@microsoft.com>, Anh Le <anh@canva.com>
Subject: Re: [PATCH] index: add trace2 region for clear skip worktree
Date: Wed, 26 Oct 2022 10:13:18 -0400	[thread overview]
Message-ID: <d4103788-5153-11f2-487f-5cc795d261a8@jeffhostetler.com> (raw)
In-Reply-To: <xmqq35bbmfz6.fsf@gitster.g>



On 10/25/22 11:16 PM, Junio C Hamano wrote:
> "Anh Le via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Anh Le <anh@canva.com>
>>
>> In a large repository using sparse checkout, checking
>> whether a file marked with skip worktree is present
>> on disk and its skip worktree bit should be cleared
>> can take a considerable amount of time. Add a trace2
>> region to surface this information.
>>
>> Signed-off-by: Anh Le <anh@canva.com>
>> ---
>>      index: add trace2 region for clear skip worktree
>>      
>>      In large repository using sparse checkout, checking whether a file
>>      marked with skip worktree is present on disk and its skip worktree bit
>>      should be cleared can take a considerable amount of time. Add a trace2
>>      region to surface this information.
> 
> It is easy to see that the change is no-op from functionality's
> standpoint.  The condition under which ce->ce_flags loses the
> CE_SKIP_WORKTREE bit is the same as before, and the only change is
> that in the iteration a couple of variables are incremented, which
> may (or may not) have performance impact, but shouldn't break
> correctness.
> 
> I am not sure about the value of these counters, honestly.

I suggested adding the counters while the series was still over
on GitGitGadget.

The original goal was to measure the time spent in that loop
with the region_enter/region_leave events.  They will tell you
whether the loop was slow or not and you can compare that with
other (peer or containing) regions to see if it significant.

However, it doesn't tell us anything about WHY it was slow or
was slower than another instance.  The path_count gives us a
simple estimate on the number of lstat() calls -- which is
one of frequent sources of slowness.  Yes, it is bounded by
`istate->cache_nr`, but if I have a repo with 1M cache-entries
and 95% are sparse, this loop is going to be much slower than
if only 5% are sparse.

The restart_count (which we only expect to be 0 or 1), will
tell us whether or not at some point mid-loop we had to fully
inflate the index -- which is another source of slowness.
Granted, `ensure_full_index()` probably contains a region of
its own, so the restart_count here may not absolutely necessary,
but in the context of this loop -- this counter may point to the
other smoking gun.

In the worst case, we walk the entire index and lstat() for a
significant number of skipped-and-not-present files, then near
the end of the loop, we find a skipped-but-present directory
and have to restart the loop.  The second pass will still run
the full loop again.  Will the second pass actually see any
skipped cache-entries?  Will it re-lstat() them?  Could the
`goto restart` just be a `break` or `return`?

I haven't had time to look under the hood here, but I was
hoping that these two counters would help the series author
collect telemetry over many runs and gain more insight into
the perf problem.

Continuing the example from above, if we've already paid the
costs to lstat() the 95% sparse files AND THEN near the bottom
of the loop we have to do a restart, then we should expect
this loop to be doubly slow.  It was my hope that this combination
of counters would help us understand the variations in perf.

WRT the `intmax_t` vs just `int`: either is fine.  I suggested
the former because that is what `trace2_data_intmax()` takes.
I really don't expect any usage to overflow a regular 32bit int.

[...]
>> +	trace2_data_intmax("index", istate->repo, "clear_skip_worktree_from_present_files/path_count", path_count);
>> +	trace2_data_intmax("index", istate->repo, "clear_skip_worktree_from_present_files/restart_count", restart_count);

One thing I forgot to mention in my GGG suggestion was to only
emit these data events if the counter is > 0.  We've been trying
to avoid logging zeroes.

>> +	trace2_region_leave("index", "clear_skip_worktree_from_present_files", istate->repo);
[...]

Jeff

  reply	other threads:[~2022-10-26 14:13 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-26  0:05 [PATCH] index: add trace2 region for clear skip worktree Anh Le via GitGitGadget
2022-10-26  3:16 ` Junio C Hamano
2022-10-26 14:13   ` Jeff Hostetler [this message]
2022-10-26 16:01     ` Junio C Hamano
2022-10-26 18:29       ` Jeff Hostetler
2022-10-27  0:04         ` Anh Le
2022-10-28  0:46 ` [PATCH v2] " Anh Le via GitGitGadget
2022-10-28 15:49   ` Derrick Stolee
2022-10-28 17:17     ` Junio C Hamano
2022-10-30 23:28       ` Anh Le
2022-10-28 16:50   ` Jeff Hostetler
2022-10-31  0:56   ` [PATCH v3] " Anh Le via GitGitGadget
2022-10-31 22:34     ` Taylor Blau
2022-11-03 23:04     ` [PATCH v4 0/2] " Anh Le via GitGitGadget
2022-11-03 23:05       ` [PATCH v4 1/2] " Anh Le via GitGitGadget
2022-11-03 23:05       ` [PATCH v4 2/2] index: raise a bug if the index is materialised more than once Anh Le via GitGitGadget
2022-11-05  0:29       ` [PATCH v4 0/2] index: add trace2 region for clear skip worktree Taylor Blau
2022-11-07 20:50         ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d4103788-5153-11f2-487f-5cc795d261a8@jeffhostetler.com \
    --to=git@jeffhostetler.com \
    --cc=anh@canva.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=jeffhost@microsoft.com \
    --cc=timothy@canva.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).