about summary refs log tree commit homepage
path: root/lib/PublicInbox/CidxComm.pm
diff options
context:
space:
mode:
authorEric Wong <e@80x24.org>2023-04-22 10:33:42 +0000
committerEric Wong <e@80x24.org>2023-04-22 19:33:08 +0000
commit10f31b26e010243ab919dbafeb6f95c6e30640e9 (patch)
treefa793044a409f009c69d741bf34fb4ea42f9178a /lib/PublicInbox/CidxComm.pm
parentc23413abbe1db6e96af4a028f69613d1340e880c (diff)
downloadpublic-inbox-10f31b26e010243ab919dbafeb6f95c6e30640e9.tar.gz
With my partial git.kernel.org mirror, this brings a full prune
down from ~75 minutes to under 5 minutes using git 2.19+.  This
speedup even applies to users on slow storage (rotational HDD).

First off, xapian-delve(1) is nearly 10x faster for dumping
boolean terms by prefix than the equivalent Perl code with
Xapian bindings.  This performance difference is critical since
we need to check over 5 million commits for pruning a partial
git.kernel.org mirror.

We can use sed(1) and sort(1) to massage delve output into
something suitable for the first comm(1) input.

For the second comm(1) input, the output of `git cat-file
--batch-check --batch-all-objects' against all indexed git repos
with awk(1) filtering provides the necessary output for
generating a list of indexed-but-no-longer accessible commits.

sed(1) and awk(1) are POSIX standard tools which can be roughly
2x faster than equivalent Perl for simple filters, while
sort(1) is designed to handle larger-than-memory datasets
efficiently (unlike the `sort' perlop).

With slow storage and git <2.19, the switch to --batch-all-objects
actually results in a performance regression since having git
perform sorting results in worse disk locality than the previous
sequential iteration by Xapian docid.  git 2.19+ users with
`--unordered' support benefits from improved storage locality;
and speedups from storage locality dwarfs the extra overhead of
an extra external sort(1) invocation.

Even with consumer-grade SATA-II SSDs, the combo of --unordered
and sort(1) provides a noticeable speedup since SSD latency
remains a factor for --batch-all-objects.

git <2.19 users must upgrade git to get acceptable performance
on slow storage and giant indexes, but git 2.19 was released
nearly 5 years ago so it's probably a reasonable requirement for
performance.

The only remaining downside of this change for all users
the extra temporary disk space for sort(1) and comm(1);
but the speedup provided with git 2.19+ is well worth it.
Diffstat (limited to 'lib/PublicInbox/CidxComm.pm')
-rw-r--r--lib/PublicInbox/CidxComm.pm28
1 files changed, 28 insertions, 0 deletions
diff --git a/lib/PublicInbox/CidxComm.pm b/lib/PublicInbox/CidxComm.pm
new file mode 100644
index 00000000..fb7be0aa
--- /dev/null
+++ b/lib/PublicInbox/CidxComm.pm
@@ -0,0 +1,28 @@
+# Copyright (C) all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+#
+# Waits for initial comm(1) output for PublicInbox::CodeSearchIdx.
+# The initial output from `comm' can take a while to generate because
+# it needs to wait on:
+# `git cat-file --batch-all-objects --batch-check --unordered | sort'
+# We still rely on blocking reads, here, since comm should be fast once
+# it's seeing input.  (`--unordered | sort' is intentional for HDDs)
+package PublicInbox::CidxComm;
+use v5.12;
+use parent qw(PublicInbox::DS);
+use PublicInbox::Syscall qw(EPOLLIN EPOLLONESHOT);
+
+sub new {
+        my ($cls, $rd, $cidx) = @_;
+        my $self = bless { cidx => $cidx }, $cls;
+        $self->SUPER::new($rd, EPOLLIN|EPOLLONESHOT);
+}
+
+sub event_step {
+        my ($self) = @_;
+        my $rd = $self->{sock} // return warn('BUG?: no {sock}');
+        $self->close; # PublicInbox::DS::close, deferred, so $sock is usable
+        delete($self->{cidx})->cidx_read_comm($rd);
+}
+
+1;