|
With my partial git.kernel.org mirror, this brings a full prune
down from ~75 minutes to under 5 minutes using git 2.19+. This
speedup even applies to users on slow storage (rotational HDD).
First off, xapian-delve(1) is nearly 10x faster for dumping
boolean terms by prefix than the equivalent Perl code with
Xapian bindings. This performance difference is critical since
we need to check over 5 million commits for pruning a partial
git.kernel.org mirror.
We can use sed(1) and sort(1) to massage delve output into
something suitable for the first comm(1) input.
For the second comm(1) input, the output of `git cat-file
--batch-check --batch-all-objects' against all indexed git repos
with awk(1) filtering provides the necessary output for
generating a list of indexed-but-no-longer accessible commits.
sed(1) and awk(1) are POSIX standard tools which can be roughly
2x faster than equivalent Perl for simple filters, while
sort(1) is designed to handle larger-than-memory datasets
efficiently (unlike the `sort' perlop).
With slow storage and git <2.19, the switch to --batch-all-objects
actually results in a performance regression since having git
perform sorting results in worse disk locality than the previous
sequential iteration by Xapian docid. git 2.19+ users with
`--unordered' support benefits from improved storage locality;
and speedups from storage locality dwarfs the extra overhead of
an extra external sort(1) invocation.
Even with consumer-grade SATA-II SSDs, the combo of --unordered
and sort(1) provides a noticeable speedup since SSD latency
remains a factor for --batch-all-objects.
git <2.19 users must upgrade git to get acceptable performance
on slow storage and giant indexes, but git 2.19 was released
nearly 5 years ago so it's probably a reasonable requirement for
performance.
The only remaining downside of this change for all users
the extra temporary disk space for sort(1) and comm(1);
but the speedup provided with git 2.19+ is well worth it.
|