about summary refs log tree commit homepage
path: root/lib/PublicInbox/CidxComm.pm
DateCommit message (Collapse)
2023-11-13cindex: delay associate until prune+indexing finish
Prune can get rid of invalid commits while indexing can add new candidates for association, so we don't dump coderepo roots for association until those are squared away. However, we can dump inbox info since we don't touch inboxes while -cindex is running.
2023-11-01ds: make ->close behave like CORE::close
Matching existing Perl IO semantics seems like a good idea to reduce confusion in the future. We'll also fix some outdated comments and update indentation to match the rest of our code base since we're far detached from Danga::Socket at this point.
2023-04-22cindex: rewrite prune (again) for speed
With my partial git.kernel.org mirror, this brings a full prune down from ~75 minutes to under 5 minutes using git 2.19+. This speedup even applies to users on slow storage (rotational HDD). First off, xapian-delve(1) is nearly 10x faster for dumping boolean terms by prefix than the equivalent Perl code with Xapian bindings. This performance difference is critical since we need to check over 5 million commits for pruning a partial git.kernel.org mirror. We can use sed(1) and sort(1) to massage delve output into something suitable for the first comm(1) input. For the second comm(1) input, the output of `git cat-file --batch-check --batch-all-objects' against all indexed git repos with awk(1) filtering provides the necessary output for generating a list of indexed-but-no-longer accessible commits. sed(1) and awk(1) are POSIX standard tools which can be roughly 2x faster than equivalent Perl for simple filters, while sort(1) is designed to handle larger-than-memory datasets efficiently (unlike the `sort' perlop). With slow storage and git <2.19, the switch to --batch-all-objects actually results in a performance regression since having git perform sorting results in worse disk locality than the previous sequential iteration by Xapian docid. git 2.19+ users with `--unordered' support benefits from improved storage locality; and speedups from storage locality dwarfs the extra overhead of an extra external sort(1) invocation. Even with consumer-grade SATA-II SSDs, the combo of --unordered and sort(1) provides a noticeable speedup since SSD latency remains a factor for --batch-all-objects. git <2.19 users must upgrade git to get acceptable performance on slow storage and giant indexes, but git 2.19 was released nearly 5 years ago so it's probably a reasonable requirement for performance. The only remaining downside of this change for all users the extra temporary disk space for sort(1) and comm(1); but the speedup provided with git 2.19+ is well worth it.