about summary refs log tree commit homepage
path: root/lib/PublicInbox/MiscIdx.pm
DateCommit message (Collapse)
2023-03-25codesearch: initial cut w/ -cindex tool
It seems relying on root commits is a reasonable way to deduplicate and handle repositories with common history. I initially wanted to shoehorn this into extindex, but decided a separate Xapian index layout capable of being EITHER external to handle many forks or internal (in $GIT_DIR/public-inbox-cindex) for small projects is the right way to go. Unlike most existing parts of public-inbox, this relies on absolute paths of $GIT_DIR stored in the Xapian DB and does not rely on the config file. We'll be relying on the config file to map absolute paths to public URL paths for WWW.
2022-10-24treewide: replace /^I: / prefix with /^# /
This is like more familiar to readers of TAP (Test Anywhere Protocol) output, as well as shell and Perl scripters which also use `#' for comments. AFAIK, nobody is parsing our stderr, and I'm not sure how standardized the `I:' prefix is (nor `W:' and `E:' are). It's already the prevailing style in Lei* code, too, so things have been moving in that direction for a bit.
2022-08-04miscidx: index inbox min/max article numbers
This will be used to speed up NNTP group listings and IMAP startup with thousands of inboxes.
2022-03-08index|extindex: support --dangerous flag
This enables Xapian::DB_DANGEROUS to support in-place updates. This can speed up the initial index and reduce I/O at the cost of preventing concurrent readers and being unsafe in the face of any abnormal terminations. This is more dangerous than --no-fsync. --no-fsync is only unsafe in the event of a power loss or kernel crash; --dangerous is unsafe even on SIGKILL.
2022-01-31rewrite Linux nodatacow use in pure Perl w/o system
btrfs is Linux-only at the moment (and likely to remain that way for practical purposes). So rely on Linux ABI stability and use the `syscall' and `ioctl' perlops rather than relying on Inline::C. Inline::C (and gcc||clang) are monstrous dependencies which we can't expect users to have. This makes supporting new architectures more difficult, but new architectures come along rarely and this reduces the burden for the majority of Linux users on popular architectures (while still avoiding the distribution of pre-built binaries). Link: https://public-inbox.org/meta/YbCPWGaJEkV6eWfo@codewreck.org/
2021-01-26miscidx: switch to lazy transactions
This fixes a sporadic failure on a 1/2 core VM where "git cat-file --batch" hasn't started up by the time $cleanup->() destroys the ALL.git directory in t/lei.t (but not t/lei-oneshot.t). This happens because dwaitpid() runs inside the event loop asynchronously and we were able to return to the client before the cat-file process could even start. I could not reproduce this failure on my usual 4-core workstation via "schedtool -a 0x1" to force the entire test to use a single core. Lazy transactions matches OverIdx and SearchIdx behavior, and I've verified this lets us avoid problems with old Xapian versions (on CentOS 7.x) which failed to set FD_CLOEXEC.
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-12-23miscsearch: index UIDVALIDITY, use as startup cache
This brings -nntpd startup time down from ~35s to ~5s with 50K inboxes. Further improvements ought to be possible with deeper changes to MiscIdx, since -mda having to load every inbox seems unreasonable; but this general change is fairly unintrusive.
2020-11-29extindex: support `--gc' to remove dead inboxes
Inboxes may be removed or newsgroups renamed over time. Introduce a switch to do garbage collection and eliminate stale search and xref3 results based on inboxes which remain in the config file. This may also fixup stale results leftover from any bugs which may leave stale data around. This is also useful in case a clumsy BOFH (me :P) is swapping between several PI_CONFIGs and accidentally indexed a bunch of inboxes they didn't intend to.
2020-11-24miscidx: store absolute git_dir of each epoch in docdata
This will make it possible to map reference repos in case somebody uses the feature.
2020-11-24miscidx: cleanup git processes after manifest indexing
We shouldn't leave "cat-file --batch" processes around when we're done with an epoch or inbox, since there could be many thousands.
2020-11-24miscidx: put grokmirror manifest entries in Xapian docdata
This should make it possible for us quickly generate manifest.js.gz files with less random I/O and process spawning in the WWW code.
2020-11-24miscsearch: a new Xapian sub-DB for extindex
This will be used to index and search Inbox objects and perhaps individual git repositories/epochs for grokmirror manifest.js.gz generation. There is no sharding planned for this at the moment since inbox count should remain low (~100K to 1M) compared to message count. Folding this into the existing sharded DBs could be possible; but would likely increase query and maintenance costs, as well as development complexity. So we'll use a few more inodes and FDs at runtime, instead.