dumping ground for random patches and texts
 help / color / mirror / Atom feed
* [PATCH 01/10] search: support per-inbox indexheader directive
@ 2024-05-16  0:12 Eric Wong
  2024-05-16  0:12 ` [PATCH 02/10] indexheader: deduplicate common values Eric Wong
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Eric Wong @ 2024-05-16  0:12 UTC (permalink / raw)
  To: spew

This allows indexing arbitrary headers to allow filtering by
boolean terms or existing text rules.  Disabling RFC 2047
decoding is supported, as well.

This also refactors AltId support to rely on the same mechanisms
as the IndexHeader class for indexing, user help, and
Xapian::QueryParser setup via both bindings and external
XapHelper process to avoid adding complexity to Search.pm and
SearchIdx.pm.

We'll finally document altid support in public-inbox-config(5)
since we're in the area, as it's been a stable feature for many
years, now.
---
 Documentation/public-inbox-config.pod | 62 ++++++++++++++++++
 MANIFEST                              |  2 +
 lib/PublicInbox/AltId.pm              | 60 +++++++++--------
 lib/PublicInbox/Config.pm             |  2 +-
 lib/PublicInbox/IndexHeader.pm        | 73 +++++++++++++++++++++
 lib/PublicInbox/Search.pm             | 43 +++++++------
 lib/PublicInbox/SearchIdx.pm          | 34 +++++-----
 t/watch_indexheader.t                 | 92 +++++++++++++++++++++++++++
 8 files changed, 306 insertions(+), 62 deletions(-)
 create mode 100644 lib/PublicInbox/IndexHeader.pm
 create mode 100644 t/watch_indexheader.t

diff --git a/Documentation/public-inbox-config.pod b/Documentation/public-inbox-config.pod
index b4a1d94d..50746b21 100644
--- a/Documentation/public-inbox-config.pod
+++ b/Documentation/public-inbox-config.pod
@@ -172,6 +172,68 @@ link to the line numbers of blobs.
 
 Default: none
 
+=item publicinbox.<name>.altid
+
+Index by an alternative ID mechanism as a Xapian search prefix e.g.
+C<gmane:1234>.  This is useful to allow looking up legacy serial IDs
+(e.g. gmane article numbers).
+
+It must be specified in the form of
+C<serial:$USER_PREFIX:file=$SQLITE_FILENAME> where C<$USER_PREFIX> is a
+lowercase prefix like C<gmane> for search queries, and
+C<$SQLITE_FILENAME> is points to an SQLite DB.  C<$SQLITE_FILENAME> may
+be an absolute path or a path relative to C<INBOXDIR> for v2 inboxes or
+C<INBOXDIR/public-inbox> for v1 inboxes.
+
+The schema of C<$SQLITE_FILENAME> should be the same as a
+C<msgmap.sqlite3>.  See C<scripts/xhdr-num2mid> in the public-inbox
+source tree for an example of how to generate such a mapping from
+via NNTP.
+
+This is a noop with C<indexlevel=basic>
+
+Default: none
+
+=item publicinbox.<name>.indexheader
+
+Supports indexing of arbitrary mail headers in Xapian.
+
+It must be specified in the form of
+C<$TYPE:$USER_PREFIX:$MAIL_HEADER:$PARAMS>
+where C<$TYPE> determines how it's indexed and queried;
+C<$USER_PREFIX> is a lowercase prefix for search queries,
+C<$MAIL_HEADER> is the header to index (e.g. C<X-Label>),
+C<$PARAMS> is a URL-style query string for optional parameters.
+
+Valid C<$TYPE> values (in ascending order of storage cost) are as follows:
+
+* C<boolean_term> - index for simple filtering (not sortable by relevance)
+
+* C<text> - add frequency information to allow sorting by relevance
+
+* C<phrase> - add positional information to match sentences or phrases
+
+In other words: C<phrase> forces indexing of a particular header to
+behave like it used C<indexlevel=full>; while C<text> indexes as if
+that header used C<indexlevel=medium>.
+
+Valid keys in C<$PARAMS> include:
+
+* raw - do not perform RFC2047 decoding of headers
+
+Example:
+
+	[publicinbox "foo"]
+		indexheader = boolean_term:xlabel:X-Label:raw=1
+
+Support for other parameters is not finalized and subject to change.
+
+This is a noop with C<indexlevel=basic>
+
+New in public-inbox 2.0.0 (PENDING)
+
+Default: none
+
 =item publicinbox.<name>.replyto
 
 May be used to control how reply instructions in the PSGI
diff --git a/MANIFEST b/MANIFEST
index fb175e5f..87db7276 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -228,6 +228,7 @@ lib/PublicInbox/In3Watch.pm
 lib/PublicInbox/Inbox.pm
 lib/PublicInbox/InboxIdle.pm
 lib/PublicInbox/InboxWritable.pm
+lib/PublicInbox/IndexHeader.pm
 lib/PublicInbox/Inotify.pm
 lib/PublicInbox/Inotify3.pm
 lib/PublicInbox/InputPipe.pm
@@ -628,6 +629,7 @@ t/v2writable.t
 t/view.t
 t/watch_filter_rubylang.t
 t/watch_imap.t
+t/watch_indexheader.t
 t/watch_maildir.t
 t/watch_maildir_v2.t
 t/watch_mh.t
diff --git a/lib/PublicInbox/AltId.pm b/lib/PublicInbox/AltId.pm
index 80757ceb..bd6cf973 100644
--- a/lib/PublicInbox/AltId.pm
+++ b/lib/PublicInbox/AltId.pm
@@ -1,4 +1,4 @@
-# Copyright (C) 2016-2021 all contributors <meta@public-inbox.org>
+# Copyright (C) all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 
 # Used for giving serial numbers to messages.  This can be tied to
@@ -10,25 +10,20 @@
 # it leads to reliance on centralization.  However, being able
 # to use existing serial numbers is beneficial.
 package PublicInbox::AltId;
-use strict;
-use warnings;
-use URI::Escape qw(uri_unescape);
-use PublicInbox::Msgmap;
+use v5.12;
+use parent qw(PublicInbox::IndexHeader);
 
 # spec: TYPE:PREFIX:param1=value1&param2=value2&...
 # The PREFIX will be a searchable boolean prefix in Xapian
 # Example: serial:gmane:file=/path/to/altmsgmap.sqlite3
 sub new {
 	my ($class, $ibx, $spec, $writable) = @_;
-	my ($type, $prefix, $query) = split(/:/, $spec, 3);
-	$type eq 'serial' or die "non-serial not supported, yet\n";
-	$prefix =~ /\A\w+\z/ or warn "non-word prefix not searchable\n";
-	my %params = map {
-		my ($k, $v) = split(/=/, uri_unescape($_), 2);
-		$v = '' unless defined $v;
-		($k, $v);
-	} split(/[&;]/, $query);
-	my $f = $params{file} or die "file: required for $type spec $spec\n";
+	my ($type, $pfx, $query) = split /:/, $spec, 3;
+	$type eq 'serial' or die "E: non-serial not supported, yet ($spec)\n";
+	my $self = bless {}, $class;
+	my $params = $self->extra_indexer_new_common($spec, $pfx, $query);
+	my $f = delete $params->{file} or
+		die "E: file= required for $type spec $spec\n";
 	unless (index($f, '/') == 0) {
 		if ($ibx->version == 1) {
 			$f = "$ibx->{inboxdir}/public-inbox/$f";
@@ -36,26 +31,37 @@ sub new {
 			$f = "$ibx->{inboxdir}/$f";
 		}
 	}
-	bless {
-		filename => $f,
-		writable => $writable,
-		prefix => $prefix,
-		xprefix => 'X'.uc($prefix),
-	}, $class;
+	my @k = keys %$params;
+	warn "W: unknown params in `$spec': ", join(', ', @k), "\n" if @k;
+	$self->{filename} = $f;
+	$self->{writable} = $writable if $writable;
+	$self;
 }
 
-sub mm_alt {
+sub mm_alt ($) {
 	my ($self) = @_;
 	$self->{mm_alt} ||= eval {
-		my $f = $self->{filename};
-		my $writable = $self->{writable};
-		PublicInbox::Msgmap->new_file($f, $writable);
+		require PublicInbox::Msgmap;
+		PublicInbox::Msgmap->new_file(@$self{qw(filename writable)});
 	};
 }
 
-sub mid2alt {
-	my ($self, $mid) = @_;
-	$self->mm_alt->num_for($mid);
+sub index_extra { # for PublicInbox::SearchIdx
+	my ($self, $sidx, $eml, $mids) = @_;
+	for my $mid (@$mids) {
+		my $id = mm_alt($self)->num_for($mid) // next;
+		$sidx->index_boolean_term($self->{xprefix}, $id);
+	}
 }
 
+sub user_help { # for PublicInbox::Search
+	my ($self) = @_;
+	("$self->{prefix}:", <<EOF);
+alternate serial number  e.g. $self->{prefix}:12345 (boolean)
+EOF
+}
+
+# callback for PublicInbox::Search
+sub query_parser_method { 'add_boolean_prefix' }
+
 1;
diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index 49659a2e..641e2925 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -481,7 +481,7 @@ sub _fill_ibx {
 	# more things to encourage decentralization
 	for my $k (qw(address altid nntpmirror imapmirror
 			coderepo hide listid url
-			infourl watchheader
+			infourl watchheader indexheader
 			nntpserver imapserver pop3server)) {
 		my $v = $self->{"$pfx.$k"} // next;
 		$ibx->{$k} = _array($v);
diff --git a/lib/PublicInbox/IndexHeader.pm b/lib/PublicInbox/IndexHeader.pm
new file mode 100644
index 00000000..53e9373b
--- /dev/null
+++ b/lib/PublicInbox/IndexHeader.pm
@@ -0,0 +1,73 @@
+# Copyright (C) all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+# allow searching on arbitrary headers as text
+package PublicInbox::IndexHeader;
+use v5.12;
+use URI::Escape qw(uri_unescape);
+
+my %T2IDX = ( # map to PublicInbox::SearchIdx methods
+	phrase => 'index_phrase1',
+	boolean_term => 'index_boolean_term',
+	text => 'index_text1',
+);
+
+# also called by AltId->new
+sub extra_indexer_new_common ($$$$) {
+	my ($self, $spec, $pfx, $query) = @_;
+	$pfx =~ /\A[a-z][a-z0-9]*\z/ or
+		warn "W: non-word prefix in `$spec' not searchable\n";
+	$self->{prefix} = $pfx;
+	my %params = map {
+		my ($k, $v) = split /=/, uri_unescape($_), 2;
+		($k, $v // '');
+	} split /[&;]/, $query // '';
+	my $xpfx = delete($params{index_prefix}) // "X\U$pfx";
+	$xpfx =~ /\A[A-Z][A-Z0-9]*\z/ or die
+		die "E: `index_prefix' in `$spec' must be ALL CAPS\n";
+	$self->{xprefix} = $xpfx;
+	\%params;
+}
+
+sub new {
+	my ($cls, $ibx, $spec) = @_;
+	my ($type, $pfx, $header, $query) = split /:/, $spec, 4;
+	$pfx // die "E: `$spec' has no user prefix\n";
+	$header // die "E: `$spec' has no mail header\n";
+	my $self = bless { header => $header, type => $type }, $cls;
+	my $params = extra_indexer_new_common $self, $spec, $pfx, $query;
+	$self->{hdr_method} = delete $params->{raw} ? 'header_raw' : 'header';
+	my @k = keys %$params;
+	warn "W: unknown params in `$spec': ", join(', ', @k), "\n" if @k;
+	$T2IDX{$type} // die
+		"E: `$type' not supported in $spec, must be one of: ",
+		join(', ', sort keys %T2IDX), "\n";
+	$self;
+}
+
+sub index_extra { # for PublicInbox::SearchIdx
+	my ($self, $sidx, $eml, $mids) = @_;
+	my $idx_method = $self->{-idx_method} //= $T2IDX{$self->{type}};
+	my $hdr_method = $self->{hdr_method};
+	for my $val ($eml->$hdr_method($self->{header})) {
+		$sidx->$idx_method($self->{xprefix}, $val);
+	}
+}
+
+sub user_help { # for PublicInbox::Search
+	my ($self) = @_;
+	("$self->{prefix}:", <<EOF);
+the `$self->{header}' mail header  e.g. $self->{prefix}:stable
+EOF
+}
+
+my %TYPE_2_QPMETHOD = (
+	phrase => 'add_prefix',
+	boolean_term => 'add_boolean_prefix',
+	text => 'add_prefix',
+);
+
+# callback for PublicInbox::Search
+sub query_parser_method { $TYPE_2_QPMETHOD{$_[0]->{type}} }
+
+1;
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index e5c5d6ab..84bc3e75 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -289,13 +289,25 @@ sub xdb ($) {
 	};
 }
 
+sub load_extra_indexers ($$) {
+	my ($self, $ibx) = @_;
+	my @extra;
+	for my $f (qw(IndexHeader AltId)) {
+		my $specs = $ibx->{lc $f} // next;
+		my $cls = "PublicInbox::$f";
+		eval "require $cls" or die $@;
+		push @extra, map { $cls->new($ibx, $_) } @$specs;
+	}
+	$self->{-extra} = \@extra if @extra;
+}
+
 sub new {
 	my ($class, $ibx) = @_;
 	ref $ibx or die "BUG: expected PublicInbox::Inbox object: $ibx";
 	my $xap = $ibx->version > 1 ? 'xap' : 'public-inbox/xapian';
 	my $xpfx = "$ibx->{inboxdir}/$xap".SCHEMA_VERSION;
 	my $self = bless { xpfx => $xpfx }, $class;
-	$self->{altid} = $ibx->{altid} if defined($ibx->{altid});
+	$self->load_extra_indexers($ibx);
 	$self;
 }
 
@@ -436,6 +448,8 @@ sub xhc_start_maybe (@) {
 	$xhc;
 }
 
+my %QPMETHOD_2_SYM = (add_prefix => ':', add_boolean_prefix => '=');
+
 sub xh_opt ($$) {
 	my ($self, $opt) = @_;
 	my $lim = $opt->{limit} || 50;
@@ -461,9 +475,9 @@ sub xh_opt ($$) {
 	push @ret, '-O', $opt->{eidx_key} if defined $opt->{eidx_key};
 	my $apfx = $self->{-alt_pfx} //= do {
 		my @tmp;
-		for (grep /\Aserial:/, @{$self->{altid} // []}) {
-			my (undef, $pfx) = split /:/, $_;
-			push @tmp, '-Q', "$pfx=X\U$pfx";
+		for my $x (@{$self->{-extra} // []}) {
+			my $sym = $QPMETHOD_2_SYM{$x->query_parser_method};
+			push @tmp, '-Q', $x->{prefix}.$sym.$x->{xprefix};
 		}
 		# TODO: arbitrary header indexing goes here
 		\@tmp;
@@ -590,21 +604,12 @@ sub qparse_new {
 		$qp->add_boolean_prefix($name, $_) foreach split(/ /, $prefix);
 	}
 
-	# we do not actually create AltId objects,
-	# just parse the spec to avoid the extra DB handles for now.
-	if (my $altid = $self->{altid}) {
+	if (my $extra = $self->{-extra}) {
 		my $user_pfx = $self->{-user_pfx} = [];
-		for (@$altid) {
-			# $_ = 'serial:gmane:/path/to/gmane.msgmap.sqlite3'
-			# note: Xapian supports multibyte UTF-8, /^[0-9]+$/,
-			# and '_' with prefixes matching \w+
-			/\Aserial:(\w+):/ or next;
-			my $pfx = $1;
-			push @$user_pfx, "$pfx:", <<EOF;
-alternate serial number  e.g. $pfx:12345 (boolean)
-EOF
-			# gmane => XGMANE
-			$qp->add_boolean_prefix($pfx, 'X'.uc($pfx));
+		for my $x (@$extra) {
+			push @$user_pfx, $x->user_help;
+			my $m = $x->query_parser_method;
+			$qp->$m(@$x{qw(prefix xprefix)});
 		}
 		chomp @$user_pfx;
 	}
@@ -651,7 +656,7 @@ EOM
 
 sub help {
 	my ($self) = @_;
-	$self->{qp} //= $self->qparse_new; # parse altids
+	$self->{qp} //= $self->qparse_new; # parse altids + indexheaders
 	my @ret = @HELP;
 	if (my $user_pfx = $self->{-user_pfx}) {
 		push @ret, @$user_pfx;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 4fd493d9..b2576e52 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -52,11 +52,6 @@ sub new {
 	my $inboxdir = $ibx->{inboxdir};
 	my $version = $ibx->version;
 	my $indexlevel = 'full';
-	my $altid = $ibx->{altid};
-	if ($altid) {
-		require PublicInbox::AltId;
-		$altid = [ map { PublicInbox::AltId->new($ibx, $_); } @$altid ];
-	}
 	if ($ibx->{indexlevel}) {
 		if ($ibx->{indexlevel} =~ $INDEXLEVELS) {
 			$indexlevel = $ibx->{indexlevel};
@@ -69,7 +64,7 @@ sub new {
 	my $self = PublicInbox::Search->new($ibx);
 	bless $self, $class;
 	$self->{ibx} = $ibx;
-	$self->{-altid} = $altid;
+	$self->load_extra_indexers($ibx);
 	$self->{indexlevel} = $indexlevel;
 	$self->{-set_indexlevel_once} = 1 if $indexlevel eq 'medium';
 	if ($ibx->{-skip_docdata}) {
@@ -184,6 +179,22 @@ sub index_phrase ($$$$) {
 	$self->{term_generator}->increase_termpos;
 }
 
+sub index_phrase1 { # called by various ->index_extra
+	my ($self, $pfx, $text) = @_;
+	index_phrase $self, $text, 1, $pfx;
+}
+
+sub index_text1 { # called by various ->index_extra
+	my ($self, $pfx, $text) = @_;
+	$self->{term_generator}->index_text_without_positions($text, 1, $pfx);
+}
+
+sub index_boolean_term { # called by various ->index_extra
+	my ($self, $pfx, $term) = @_;
+	my $doc = $self->{term_generator}->get_document;
+	$doc->add_boolean_term($pfx.$term);
+}
+
 sub index_text ($$$$) {
 	my ($self, $text, $wdf_inc, $prefix) = @_;
 
@@ -481,15 +492,8 @@ sub eml2doc ($$$;$) {
 		$doc->set_data($data);
 	}
 
-	if (my $altid = $self->{-altid}) {
-		foreach my $alt (@$altid) {
-			my $pfx = $alt->{xprefix};
-			foreach my $mid (@$mids) {
-				my $id = $alt->mid2alt($mid);
-				next unless defined $id;
-				$doc->add_boolean_term($pfx . $id);
-			}
-		}
+	for my $extra (@{$self->{-extra} // []}) {
+		$extra->index_extra($self, $eml, $mids);
 	}
 	$doc;
 }
diff --git a/t/watch_indexheader.t b/t/watch_indexheader.t
new file mode 100644
index 00000000..e815fca9
--- /dev/null
+++ b/t/watch_indexheader.t
@@ -0,0 +1,92 @@
+# Copyright (C) all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use v5.12;
+use autodie;
+use PublicInbox::TestCommon;
+use PublicInbox::Eml;
+use PublicInbox::Emergency;
+use PublicInbox::IO qw(write_file);
+use PublicInbox::InboxIdle;
+use PublicInbox::Inbox;
+use PublicInbox::DS;
+use PublicInbox::Config;
+require_mods(qw(DBD::SQLite Xapian));
+my $tmpdir = tmpdir;
+my $config = "$tmpdir/pi_config";
+local $ENV{PI_CONFIG} = $config;
+delete local $ENV{PI_DIR};
+my @V = (1);
+my @creat_opt = (indexlevel => 'medium', sub {});
+my $v1 = create_inbox 'v1', tmpdir => "$tmpdir/v1", @creat_opt;
+my $fh = write_file '>', $config, <<EOM;
+[publicinbox "v1"]
+	inboxdir = $v1->{inboxdir}
+	address = v1\@example.com
+	watch = maildir:$tmpdir/v1-md
+	indexheader = boolean_term:xarchiveshash:X-Archives-Hash
+EOM
+
+SKIP: {
+	require_git(v2.6, 1);
+	push @V, 2;
+	my $v2 = create_inbox 'v2', tmpdir => "$tmpdir/v2", @creat_opt;
+	print $fh <<EOM;
+[publicinbox "v2"]
+	inboxdir = $tmpdir/v2
+	address = v2\@example.com
+	watch = maildir:$tmpdir/v2-md
+	indexheader = boolean_term:xarchiveshash:X-Archives-Hash
+EOM
+}
+close $fh;
+my $cfg = PublicInbox::Config->new;
+for my $v (@V) { for ('', qw(cur new tmp)) { mkdir "$tmpdir/v$v-md/$_" } }
+my $wm = start_script([qw(-watch)]);
+my $h1 = 'deadbeef' x 4;
+my @em = map {
+	my $v = $_;
+	my $em = PublicInbox::Emergency->new("$tmpdir/v$v-md");
+	$em->prepare(\(PublicInbox::Eml->new(<<EOM)->as_string));
+From: x\@example.com
+Message-ID: <i-1$v\@example.com>
+To: <v$v\@example.com>
+Date: Sat, 02 Oct 2010 00:00:00 +0000
+X-Archives-Hash: $h1
+
+EOM
+	$em;
+} @V;
+
+my $delivered = 0;
+my $cb = sub {
+	diag "message delivered to `$_[0]->{name}'";
+	++$delivered;
+};
+PublicInbox::DS->Reset;
+my $ii = PublicInbox::InboxIdle->new($cfg);
+my $obj = bless \$cb, 'PublicInbox::TestCommon::InboxWakeup';
+$cfg->each_inbox(sub { $_[0]->subscribe_unlock('ident', $obj) });
+local @PublicInbox::DS::post_loop_do = (sub { $delivered != @V });
+$_->commit for @em;
+diag 'waiting for -watch to import new message(s)';
+PublicInbox::DS::event_loop();
+$wm->join('TERM');
+$ii->close;
+
+$cfg->each_inbox(sub {
+	my ($ibx) = @_;
+	my $srch = $ibx->search;
+	my $mset = $srch->mset('xarchiveshash:miss');
+	is($mset->size, 0, 'got xarchiveshash:miss non-result');
+	$mset = $srch->mset("xarchiveshash:$h1");
+	is($mset->size, 1, 'got xarchiveshash: hit result') or return;
+	my $num = $srch->mset_to_artnums($mset);
+	my $eml = $ibx->smsg_eml($ibx->over->get_art($num->[0]));
+	is($eml->header_raw('X-Archives-Hash'), $h1,
+		'stored message with X-Archives-Hash');
+	my @opt = $srch->xh_opt;
+	is $opt[-2], '-Q', 'xap_helper -Q switch';
+	is $opt[-1], 'xarchiveshash=XXARCHIVESHASH', 'xap_helper -Q arg';
+});
+
+done_testing;

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 02/10] indexheader: deduplicate common values
  2024-05-16  0:12 [PATCH 01/10] search: support per-inbox indexheader directive Eric Wong
@ 2024-05-16  0:12 ` Eric Wong
  2024-05-16  0:12 ` [PATCH 03/10] search: help: avoid ':' in user prefixes Eric Wong
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2024-05-16  0:12 UTC (permalink / raw)
  To: spew

Since we plan on sharing IndexHeader across multiple inboxes for
large installations with thousands of inboxes, it makes sense to
deduplicate the values to save some memory at the cost of
increased startup time.
---
 lib/PublicInbox/IndexHeader.pm | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/lib/PublicInbox/IndexHeader.pm b/lib/PublicInbox/IndexHeader.pm
index 53e9373b..07827959 100644
--- a/lib/PublicInbox/IndexHeader.pm
+++ b/lib/PublicInbox/IndexHeader.pm
@@ -17,7 +17,8 @@ sub extra_indexer_new_common ($$$$) {
 	my ($self, $spec, $pfx, $query) = @_;
 	$pfx =~ /\A[a-z][a-z0-9]*\z/ or
 		warn "W: non-word prefix in `$spec' not searchable\n";
-	$self->{prefix} = $pfx;
+	my %dedupe = ($pfx => undef);
+	($self->{prefix}) = keys %dedupe;
 	my %params = map {
 		my ($k, $v) = split /=/, uri_unescape($_), 2;
 		($k, $v // '');
@@ -25,7 +26,8 @@ sub extra_indexer_new_common ($$$$) {
 	my $xpfx = delete($params{index_prefix}) // "X\U$pfx";
 	$xpfx =~ /\A[A-Z][A-Z0-9]*\z/ or die
 		die "E: `index_prefix' in `$spec' must be ALL CAPS\n";
-	$self->{xprefix} = $xpfx;
+	%dedupe = ($xpfx => undef);
+	($self->{xprefix}) = keys %dedupe;
 	\%params;
 }
 
@@ -34,14 +36,18 @@ sub new {
 	my ($type, $pfx, $header, $query) = split /:/, $spec, 4;
 	$pfx // die "E: `$spec' has no user prefix\n";
 	$header // die "E: `$spec' has no mail header\n";
+	$T2IDX{$type} // die
+		"E: `$type' not supported in $spec, must be one of: ",
+		join(', ', sort keys %T2IDX), "\n";
+	my %dedupe = ($type => undef);
+	($type) = keys %dedupe;
+	%dedupe = ($header => undef);
+	($header) = keys %dedupe;
 	my $self = bless { header => $header, type => $type }, $cls;
 	my $params = extra_indexer_new_common $self, $spec, $pfx, $query;
 	$self->{hdr_method} = delete $params->{raw} ? 'header_raw' : 'header';
 	my @k = keys %$params;
 	warn "W: unknown params in `$spec': ", join(', ', @k), "\n" if @k;
-	$T2IDX{$type} // die
-		"E: `$type' not supported in $spec, must be one of: ",
-		join(', ', sort keys %T2IDX), "\n";
 	$self;
 }
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 03/10] search: help: avoid ':' in user prefixes
  2024-05-16  0:12 [PATCH 01/10] search: support per-inbox indexheader directive Eric Wong
  2024-05-16  0:12 ` [PATCH 02/10] indexheader: deduplicate common values Eric Wong
@ 2024-05-16  0:12 ` Eric Wong
  2024-05-16  0:12 ` [PATCH 04/10] config: dedupe ibx->{newsgroup} Eric Wong
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2024-05-16  0:12 UTC (permalink / raw)
  To: spew

The non-':'-suffixed variation of the string is already used as
hash keys and literals elsewhere.  Theoretically, a Perl
implementation can save some allocations this way (though Perl 5
currently doesn't).

In any case, we'll introduce a help2txt method to allow sharing
code between the callers in WwwText and Documentation/common.perl
---
 Documentation/common.perl      | 22 ++---------
 lib/PublicInbox/AltId.pm       |  2 +-
 lib/PublicInbox/IndexHeader.pm |  2 +-
 lib/PublicInbox/Isearch.pm     |  2 +-
 lib/PublicInbox/Search.pm      | 71 ++++++++++++++++++++--------------
 lib/PublicInbox/WwwText.pm     | 25 +-----------
 6 files changed, 49 insertions(+), 75 deletions(-)

diff --git a/Documentation/common.perl b/Documentation/common.perl
index 3a6617c4..53bae495 100755
--- a/Documentation/common.perl
+++ b/Documentation/common.perl
@@ -3,7 +3,7 @@
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 use strict;
 use Fcntl qw(SEEK_SET);
-my $have_search = eval { require PublicInbox::Search; 1 };
+use PublicInbox::Search;
 my $addr = 'meta@public-inbox.org';
 for my $pod (@ARGV) {
 	open my $fh, '+<', $pod or die "open($pod): $!";
@@ -27,7 +27,7 @@ L<http://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/meta/>
 
 =head1$1
 		!ms;
-	$have_search and $s =~ s!^=for\scomment\n
+	$s =~ s!^=for\scomment\n
 			^AUTO-GENERATED-SEARCH-TERMS-BEGIN\n
 			.+?
 			^=for\scomment\n
@@ -46,23 +46,7 @@ L<http://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/meta/>
 }
 
 sub search_terms {
-	my $help = eval('\@PublicInbox::Search::HELP');
-	my $s = '';
-	my $pad = 0;
-	my $i;
-	for ($i = 0; $i < @$help; $i += 2) {
-		my $pfx = $help->[$i];
-		my $n = length($pfx);
-		$pad = $n if $n > $pad;
-		$s .= $pfx . "\0";
-		$s .= $help->[$i + 1];
-		$s .= "\f\n";
-	}
-	$pad += 2;
-	my $padding = ' ' x ($pad + 4);
-	$s =~ s/^/$padding/gms;
-	$s =~ s/^$padding(\S+)\0/"    $1".(' ' x ($pad - length($1)))/egms;
-	$s =~ s/\f\n/\n/gs;
+	my $s = PublicInbox::Search::help2txt(@PublicInbox::Search::HELP);
 	$s =~ s/^  //gms;
 	substr($s, 0, 0, "=for comment\nAUTO-GENERATED-SEARCH-TERMS-BEGIN\n\n");
 	$s .= "\n=for comment\nAUTO-GENERATED-SEARCH-TERMS-END\n";
diff --git a/lib/PublicInbox/AltId.pm b/lib/PublicInbox/AltId.pm
index bd6cf973..76dc23e6 100644
--- a/lib/PublicInbox/AltId.pm
+++ b/lib/PublicInbox/AltId.pm
@@ -56,7 +56,7 @@ sub index_extra { # for PublicInbox::SearchIdx
 
 sub user_help { # for PublicInbox::Search
 	my ($self) = @_;
-	("$self->{prefix}:", <<EOF);
+	($self->{prefix}, <<EOF);
 alternate serial number  e.g. $self->{prefix}:12345 (boolean)
 EOF
 }
diff --git a/lib/PublicInbox/IndexHeader.pm b/lib/PublicInbox/IndexHeader.pm
index 07827959..a67080f9 100644
--- a/lib/PublicInbox/IndexHeader.pm
+++ b/lib/PublicInbox/IndexHeader.pm
@@ -62,7 +62,7 @@ sub index_extra { # for PublicInbox::SearchIdx
 
 sub user_help { # for PublicInbox::Search
 	my ($self) = @_;
-	("$self->{prefix}:", <<EOF);
+	($self->{prefix}, <<EOF);
 the `$self->{header}' mail header  e.g. $self->{prefix}:stable
 EOF
 }
diff --git a/lib/PublicInbox/Isearch.pm b/lib/PublicInbox/Isearch.pm
index 20808d6d..9566f710 100644
--- a/lib/PublicInbox/Isearch.pm
+++ b/lib/PublicInbox/Isearch.pm
@@ -131,7 +131,7 @@ sub mset_to_smsg {
 
 sub has_threadid { 1 }
 
-sub help { $_[0]->{es}->help }
+sub help_txt { $_[0]->{es}->help_txt }
 
 sub xh_args { # prep getopt args to feed to xap_helper.h socket
 	my ($self, $opt) = @_; # TODO uid_range
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 84bc3e75..385b35f8 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -187,33 +187,33 @@ my %prob_prefix = (
 # especially since we don't offer boolean searches for To/Cc/From
 # headers, either
 our @HELP = (
-	's:' => 'match within Subject  e.g. s:"a quick brown fox"',
-	'd:' => <<EOF,
+	s => 'match within Subject  e.g. s:"a quick brown fox"',
+	d => <<EOF,
 match date-time range, git "approxidate" formats supported
 Open-ended ranges such as `d:last.week..' and
 `d:..2.days.ago' are supported
 EOF
-	'b:' => 'match within message body, including text attachments',
-	'nq:' => 'match non-quoted text within message body',
-	'q:' => 'match quoted text within message body',
-	'n:' => 'match filename of attachment(s)',
-	't:' => 'match within the To header',
-	'c:' => 'match within the Cc header',
-	'f:' => 'match within the From header',
-	'a:' => 'match within the To, Cc, and From headers',
-	'tc:' => 'match within the To and Cc headers',
-	'l:' => 'match contents of the List-Id header',
-	'bs:' => 'match within the Subject and body',
-	'dfn:' => 'match filename from diff',
-	'dfa:' => 'match diff removed (-) lines',
-	'dfb:' => 'match diff added (+) lines',
-	'dfhh:' => 'match diff hunk header context (usually a function name)',
-	'dfctx:' => 'match diff context lines',
-	'dfpre:' => 'match pre-image git blob ID',
-	'dfpost:' => 'match post-image git blob ID',
-	'dfblob:' => 'match either pre or post-image git blob ID',
-	'patchid:' => "match `git patch-id --stable' output",
-	'rt:' => <<EOF,
+	b => 'match within message body, including text attachments',
+	nq => 'match non-quoted text within message body',
+	q => 'match quoted text within message body',
+	n => 'match filename of attachment(s)',
+	t => 'match within the To header',
+	c => 'match within the Cc header',
+	f => 'match within the From header',
+	a => 'match within the To, Cc, and From headers',
+	tc => 'match within the To and Cc headers',
+	l => 'match contents of the List-Id header',
+	bs => 'match within the Subject and body',
+	dfn => 'match filename from diff',
+	dfa => 'match diff removed (-) lines',
+	dfb => 'match diff added (+) lines',
+	dfhh => 'match diff hunk header context (usually a function name)',
+	dfctx => 'match diff context lines',
+	dfpre => 'match pre-image git blob ID',
+	dfpost => 'match post-image git blob ID',
+	dfblob => 'match either pre or post-image git blob ID',
+	patchid => "match `git patch-id --stable' output",
+	rt => <<EOF,
 match received time, like `d:' if sender's clock was correct
 EOF
 );
@@ -654,14 +654,27 @@ EOM
 	$ret .= "}\n";
 }
 
-sub help {
+sub help2txt (@) { # also used by Documentation/common.perl
+	my @help = @_;
+	my $pad = 0;
+	my $htxt = '';
+	while (defined(my $pfx = shift @help)) {
+		my $n = length($pfx) + 1;
+		$pad = $n if $n > $pad;
+		$htxt .= $pfx . ":\0" . shift(@help) . "\f\n";
+	}
+	$pad += 2;
+	my $padding = ' ' x ($pad + 4);
+	$htxt =~ s/^/$padding/gms;
+	$htxt =~ s/^$padding(\S+)\0/"    $1".(' ' x ($pad - length($1)))/egms;
+	$htxt =~ s/\f\n/\n/gs;
+	$htxt;
+}
+
+sub help_txt {
 	my ($self) = @_;
 	$self->{qp} //= $self->qparse_new; # parse altids + indexheaders
-	my @ret = @HELP;
-	if (my $user_pfx = $self->{-user_pfx}) {
-		push @ret, @$user_pfx;
-	}
-	\@ret;
+	help2txt(@HELP, @{$self->{-user_pfx} // []});
 }
 
 # always returns a scalar value
diff --git a/lib/PublicInbox/WwwText.pm b/lib/PublicInbox/WwwText.pm
index 5e23005e..84068f5a 100644
--- a/lib/PublicInbox/WwwText.pm
+++ b/lib/PublicInbox/WwwText.pm
@@ -75,29 +75,6 @@ sub get_text {
 	PublicInbox::WwwStream::html_oneshot($ctx, $code, $txt);
 }
 
-sub _srch_prefix ($$) {
-	my ($ibx, $txt) = @_;
-	my $pad = 0;
-	my $htxt = '';
-	my $help = $ibx->isrch->help;
-	my $i;
-	for ($i = 0; $i < @$help; $i += 2) {
-		my $pfx = $help->[$i];
-		my $n = length($pfx);
-		$pad = $n if $n > $pad;
-		$htxt .= $pfx . "\0";
-		$htxt .= $help->[$i + 1];
-		$htxt .= "\f\n";
-	}
-	$pad += 2;
-	my $padding = ' ' x ($pad + 4);
-	$htxt =~ s/^/$padding/gms;
-	$htxt =~ s/^$padding(\S+)\0/"    $1".(' ' x ($pad - length($1)))/egms;
-	$htxt =~ s/\f\n/\n/gs;
-	$$txt .= $htxt;
-	1;
-}
-
 sub _colors_help ($$) {
 	my ($ctx, $txt) = @_;
 	my $ibx = $ctx->{ibx};
@@ -461,7 +438,7 @@ search
   Prefixes supported in this installation include:
 
 EOF
-		_srch_prefix($ibx, $txt);
+		$$txt .= $ibx->isrch->help_txt;
 		$$txt .= <<EOF;
 
   Most prefixes are probabilistic, meaning they support stemming

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 04/10] config: dedupe ibx->{newsgroup}
  2024-05-16  0:12 [PATCH 01/10] search: support per-inbox indexheader directive Eric Wong
  2024-05-16  0:12 ` [PATCH 02/10] indexheader: deduplicate common values Eric Wong
  2024-05-16  0:12 ` [PATCH 03/10] search: help: avoid ':' in user prefixes Eric Wong
@ 2024-05-16  0:12 ` Eric Wong
  2024-05-16  0:12 ` [PATCH 05/10] search: move QueryParser mappings to xh_args Eric Wong
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2024-05-16  0:12 UTC (permalink / raw)
  To: spew

We definitely use newsgroup names as hash keys, so get rid
of the duplicate value for some memory savings when we have
hundreds or thousands of newsgroups.
---
 lib/PublicInbox/Config.pm | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index 641e2925..9badd6dc 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -517,7 +517,9 @@ sub _fill_ibx {
 			delete $ibx->{newsgroup};
 			warn "newsgroup name invalid: `$ngname'\n";
 		} else {
-			my $lc = $ibx->{newsgroup} = lc $ngname;
+			%dedupe = (lc($ngname) => undef);
+			my ($lc) = keys %dedupe;
+			$ibx->{newsgroup} = $lc;
 			warn <<EOM if $lc ne $ngname;
 W: newsgroup=`$ngname' lowercased to `$lc'
 EOM

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 05/10] search: move QueryParser mappings to xh_args
  2024-05-16  0:12 [PATCH 01/10] search: support per-inbox indexheader directive Eric Wong
                   ` (2 preceding siblings ...)
  2024-05-16  0:12 ` [PATCH 04/10] config: dedupe ibx->{newsgroup} Eric Wong
@ 2024-05-16  0:12 ` Eric Wong
  2024-05-16  0:12 ` [PATCH 06/10] xap_helper: key search instances by -Q params, too Eric Wong
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2024-05-16  0:12 UTC (permalink / raw)
  To: spew

These are stable per-search instance.  We'll also deduplicate
the strings in case multiple inboxes share the same mappings.
---
 lib/PublicInbox/Search.pm | 23 ++++++++++++-----------
 t/watch_indexheader.t     |  2 +-
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 385b35f8..ae696735 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -473,16 +473,7 @@ sub xh_opt ($$) {
 	push @ret, '-t' if $opt->{threads};
 	push @ret, '-T', $opt->{threadid} if defined $opt->{threadid};
 	push @ret, '-O', $opt->{eidx_key} if defined $opt->{eidx_key};
-	my $apfx = $self->{-alt_pfx} //= do {
-		my @tmp;
-		for my $x (@{$self->{-extra} // []}) {
-			my $sym = $QPMETHOD_2_SYM{$x->query_parser_method};
-			push @tmp, '-Q', $x->{prefix}.$sym.$x->{xprefix};
-		}
-		# TODO: arbitrary header indexing goes here
-		\@tmp;
-	};
-	(@ret, @$apfx);
+	@ret;
 }
 
 # returns a true value if actually handled asynchronously,
@@ -727,7 +718,17 @@ sub all_terms {
 }
 
 sub xh_args { # prep getopt args to feed to xap_helper.h socket
-	map { ('-d', $_) } shard_dirs($_[0]);
+	my ($self) = @_;
+	my $apfx = $self->{-alt_pfx} //= do {
+		my %dedupe;
+		for my $x (@{$self->{-extra} // []}) {
+			my $sym = $QPMETHOD_2_SYM{$x->query_parser_method};
+			$dedupe{$x->{prefix}.$sym.$x->{xprefix}} = undef;
+		}
+		# TODO: arbitrary header indexing goes here
+		[ sort keys %dedupe ];
+	};
+	((map { ('-d', $_) } shard_dirs($self)), map { ('-Q', $_) } @$apfx);
 }
 
 sub docids_by_postlist ($$) {
diff --git a/t/watch_indexheader.t b/t/watch_indexheader.t
index e815fca9..623698e7 100644
--- a/t/watch_indexheader.t
+++ b/t/watch_indexheader.t
@@ -84,7 +84,7 @@ $cfg->each_inbox(sub {
 	my $eml = $ibx->smsg_eml($ibx->over->get_art($num->[0]));
 	is($eml->header_raw('X-Archives-Hash'), $h1,
 		'stored message with X-Archives-Hash');
-	my @opt = $srch->xh_opt;
+	my @opt = $srch->xh_args;
 	is $opt[-2], '-Q', 'xap_helper -Q switch';
 	is $opt[-1], 'xarchiveshash=XXARCHIVESHASH', 'xap_helper -Q arg';
 });

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 06/10] xap_helper: key search instances by -Q params, too
  2024-05-16  0:12 [PATCH 01/10] search: support per-inbox indexheader directive Eric Wong
                   ` (3 preceding siblings ...)
  2024-05-16  0:12 ` [PATCH 05/10] search: move QueryParser mappings to xh_args Eric Wong
@ 2024-05-16  0:12 ` Eric Wong
  2024-05-16  0:12 ` [PATCH 07/10] xap_helper.h: use khashl.h instead of hsearch(3) Eric Wong
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2024-05-16  0:12 UTC (permalink / raw)
  To: spew

In addition to the shards which comprise the xap_helper search
instance, we also account for changes in altid and indexheader
in case xap_helper lifetime exceeds the given
PublicInbox::Config.

xap_helper will be Config lifetime agnostic since it's possible
to run -netd and -httpd instances with multiple Config files,
but a single xap_helper instance (with workers) should be able
to service all of them.
---
 lib/PublicInbox/XapHelper.pm |  3 ++-
 lib/PublicInbox/xap_helper.h | 45 +++++++++++++++++++-----------------
 2 files changed, 26 insertions(+), 22 deletions(-)

diff --git a/lib/PublicInbox/XapHelper.pm b/lib/PublicInbox/XapHelper.pm
index c9957f64..f1311bd4 100644
--- a/lib/PublicInbox/XapHelper.pm
+++ b/lib/PublicInbox/XapHelper.pm
@@ -190,7 +190,8 @@ sub dispatch {
 	$GLP->getoptionsfromarray(\@argv, $req, @PublicInbox::Search::XH_SPEC)
 		or return;
 	my $dirs = delete $req->{d} or die 'no -d args';
-	my $key = join("\0", @$dirs);
+	my $key = "-d\0".join("\0-d\0", @$dirs);
+	$key .= "\0".join("\0", map { ('-Q', $_) } @{$req->{Q}}) if $req->{Q};
 	my $new;
 	$req->{srch} = $SRCH{$key} //= do {
 		$new = { qp_flags => $PublicInbox::Search::QP_FLAGS };
diff --git a/lib/PublicInbox/xap_helper.h b/lib/PublicInbox/xap_helper.h
index a30a8768..8bfd7ab6 100644
--- a/lib/PublicInbox/xap_helper.h
+++ b/lib/PublicInbox/xap_helper.h
@@ -112,12 +112,12 @@ enum exc_iter {
 };
 
 struct srch {
-	int paths_len; // int for comparisons
+	int ckey_len; // int for comparisons
 	unsigned qp_flags;
 	bool qp_extra_done;
 	Xapian::Database *db;
 	Xapian::QueryParser *qp;
-	char paths[]; // $shard_path0\0$shard_path1\0...
+	char ckey[]; // $shard_path0\0$shard_path1\0...
 };
 
 #define MY_ARG_MAX 256
@@ -128,6 +128,7 @@ struct req { // argv and pfxv point into global rbuf
 	char *argv[MY_ARG_MAX];
 	char *pfxv[MY_ARG_MAX]; // -A <prefix>
 	char *qpfxv[MY_ARG_MAX]; // -Q <user_prefix>[:=]<INTERNAL_PREFIX>
+	char *dirv[MY_ARG_MAX]; // -d /path/to/XDB(shard)
 	size_t *lenv; // -A <prefix>LENGTH
 	struct srch *srch;
 	char *Pgit_dir;
@@ -139,9 +140,7 @@ struct req { // argv and pfxv point into global rbuf
 	unsigned long timeout_sec;
 	size_t nr_out;
 	long sort_col; // value column, negative means BoolWeight
-	int argc;
-	int pfxc;
-	int qpfxc;
+	int argc, pfxc, qpfxc, dirc;
 	FILE *fp[2]; // [0] response pipe or sock, [1] status/errors (optional)
 	bool has_input; // fp[0] is bidirectional
 	bool collapse_threads;
@@ -516,9 +515,9 @@ static int srch_cmp(const void *pa, const void *pb) // for tfind|tsearch
 {
 	const struct srch *a = (const struct srch *)pa;
 	const struct srch *b = (const struct srch *)pb;
-	int diff = a->paths_len - b->paths_len;
+	int diff = a->ckey_len - b->ckey_len;
 
-	return diff ? diff : memcmp(a->paths, b->paths, (size_t)a->paths_len);
+	return diff ? diff : memcmp(a->ckey, b->ckey, (size_t)a->ckey_len);
 }
 
 static bool is_chert(const char *dir)
@@ -536,31 +535,30 @@ static bool is_chert(const char *dir)
 
 static bool srch_init(struct req *req)
 {
-	char *dirv[MY_ARG_MAX];
 	int i;
 	struct srch *srch = req->srch;
-	int dirc = (int)SPLIT2ARGV(dirv, srch->paths, (size_t)srch->paths_len);
 	const unsigned FLAG_PHRASE = Xapian::QueryParser::FLAG_PHRASE;
 	srch->qp_flags = FLAG_PHRASE |
 			Xapian::QueryParser::FLAG_BOOLEAN |
 			Xapian::QueryParser::FLAG_LOVEHATE |
 			Xapian::QueryParser::FLAG_WILDCARD;
-	if (is_chert(dirv[0]))
+	if (is_chert(req->dirv[0]))
 		srch->qp_flags &= ~FLAG_PHRASE;
 	try {
-		srch->db = new Xapian::Database(dirv[0]);
+		srch->db = new Xapian::Database(req->dirv[0]);
 	} catch (...) {
-		warn("E: Xapian::Database(%s)", dirv[0]);
+		warn("E: Xapian::Database(%s)", req->dirv[0]);
 		return false;
 	}
 	try {
-		for (i = 1; i < dirc; i++) {
-			if (srch->qp_flags & FLAG_PHRASE && is_chert(dirv[i]))
+		for (i = 1; i < req->dirc; i++) {
+			const char *dir = req->dirv[i];
+			if (srch->qp_flags & FLAG_PHRASE && is_chert(dir))
 				srch->qp_flags &= ~FLAG_PHRASE;
-			srch->db->add_database(Xapian::Database(dirv[i]));
+			srch->db->add_database(Xapian::Database(dir));
 		}
 	} catch (...) {
-		warn("E: add_database(%s)", dirv[i]);
+		warn("E: add_database(%s)", req->dirv[i]);
 		return false;
 	}
 	try {
@@ -644,7 +642,7 @@ static void dispatch(struct req *req)
 	kfp = open_memstream(&kbuf.ptr, &size);
 	if (!kfp) err(EXIT_FAILURE, "open_memstream(kbuf)");
 	// write padding, first (contents don't matter)
-	fwrite(&req->argv[0], offsetof(struct srch, paths), 1, kfp);
+	fwrite(&req->argv[0], offsetof(struct srch, ckey), 1, kfp);
 
 	// global getopt variables:
 	optopt = 0;
@@ -656,7 +654,11 @@ static void dispatch(struct req *req)
 		switch (c) {
 		case 'a': req->asc = true; break;
 		case 'c': req->code_search = true; break;
-		case 'd': fwrite(optarg, strlen(optarg) + 1, 1, kfp); break;
+		case 'd':
+			req->dirv[req->dirc++] = optarg;
+			if (MY_ARG_MAX == req->dirc) ABORT("too many -d");
+			fprintf(kfp, "-d%c%s%c", 0, optarg, 0);
+			break;
 		case 'g': req->Pgit_dir = optarg - 1; break; // pad "P" prefix
 		case 'k':
 			req->sort_col = strtol(optarg, &end, 10);
@@ -696,6 +698,7 @@ static void dispatch(struct req *req)
 		case 'Q':
 			req->qpfxv[req->qpfxc++] = optarg;
 			if (MY_ARG_MAX == req->qpfxc) ABORT("too many -Q");
+			fprintf(kfp, "-Q%c%s%c", 0, optarg, 0);
 			break;
 		default: ABORT("bad switch `-%c'", c);
 		}
@@ -704,9 +707,9 @@ static void dispatch(struct req *req)
 	kbuf.srch->db = NULL;
 	kbuf.srch->qp = NULL;
 	kbuf.srch->qp_extra_done = false;
-	kbuf.srch->paths_len = size - offsetof(struct srch, paths);
-	if (kbuf.srch->paths_len <= 0)
-		ABORT("no -d args");
+	kbuf.srch->ckey_len = size - offsetof(struct srch, ckey);
+	if (kbuf.srch->ckey_len <= 0 || !req->dirc)
+		ABORT("no -d args (or too many)");
 	s = (struct srch **)tsearch(kbuf.srch, &srch_tree, srch_cmp);
 	if (!s) err(EXIT_FAILURE, "tsearch"); // likely ENOMEM
 	req->srch = *s;

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 07/10] xap_helper.h: use khashl.h instead of hsearch(3)
  2024-05-16  0:12 [PATCH 01/10] search: support per-inbox indexheader directive Eric Wong
                   ` (4 preceding siblings ...)
  2024-05-16  0:12 ` [PATCH 06/10] xap_helper: key search instances by -Q params, too Eric Wong
@ 2024-05-16  0:12 ` Eric Wong
  2024-05-16  0:12 ` [PATCH 08/10] xap_helper.h: use xcalloc to simplify error checking Eric Wong
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2024-05-16  0:12 UTC (permalink / raw)
  To: spew

hsearch(3) and friends are just too horrid of APIs and subject
to fatal problems due to system-dependent ENTRY.key use of
strdup(3).  So replace it with khashl (which is a newer, smaller
version of the widely-used khash in git.git).

We'll also be able to use khashl in the future for
the FUSE shim if liburcu isn't available.
---
 MANIFEST                     |   1 +
 lib/PublicInbox/khashl.h     | 502 +++++++++++++++++++++++++++++++++++
 lib/PublicInbox/xap_helper.h |  55 ++--
 lib/PublicInbox/xh_cidx.h    |  79 ++++--
 4 files changed, 590 insertions(+), 47 deletions(-)
 create mode 100644 lib/PublicInbox/khashl.h

diff --git a/MANIFEST b/MANIFEST
index 87db7276..d153ba3c 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -386,6 +386,7 @@ lib/PublicInbox/Xapcmd.pm
 lib/PublicInbox/XhcMset.pm
 lib/PublicInbox/XhcMsetIterator.pm
 lib/PublicInbox/gcf2_libgit2.h
+lib/PublicInbox/khashl.h
 lib/PublicInbox/xap_helper.h
 lib/PublicInbox/xh_cidx.h
 lib/PublicInbox/xh_mset.h
diff --git a/lib/PublicInbox/khashl.h b/lib/PublicInbox/khashl.h
new file mode 100644
index 00000000..170b81ff
--- /dev/null
+++ b/lib/PublicInbox/khashl.h
@@ -0,0 +1,502 @@
+/* The MIT License
+
+   Copyright (c) 2019-2023 by Attractive Chaos <attractor@live.co.uk>
+
+   Permission is hereby granted, free of charge, to any person obtaining
+   a copy of this software and associated documentation files (the
+   "Software"), to deal in the Software without restriction, including
+   without limitation the rights to use, copy, modify, merge, publish,
+   distribute, sublicense, and/or sell copies of the Software, and to
+   permit persons to whom the Software is furnished to do so, subject to
+   the following conditions:
+
+   The above copyright notice and this permission notice shall be
+   included in all copies or substantial portions of the Software.
+
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+   NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+   BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+   ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+   CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+   SOFTWARE.
+*/
+
+#ifndef __AC_KHASHL_H
+#define __AC_KHASHL_H
+
+#define AC_VERSION_KHASHL_H "0.2"
+
+typedef uint32_t khint32_t;
+typedef uint64_t khint64_t;
+
+typedef khint32_t khint_t;
+typedef khint_t khiter_t;
+
+#define kh_inline inline
+#define KH_LOCAL static kh_inline
+
+#ifndef kcalloc
+#define kcalloc(N,Z) xcalloc(N,Z)
+#endif
+#ifndef kfree
+#define kfree(P) free(P)
+#endif
+
+/****************************
+ * Simple private functions *
+ ****************************/
+
+#define __kh_used(flag, i)       (flag[i>>5] >> (i&0x1fU) & 1U)
+#define __kh_set_used(flag, i)   (flag[i>>5] |= 1U<<(i&0x1fU))
+#define __kh_set_unused(flag, i) (flag[i>>5] &= ~(1U<<(i&0x1fU)))
+
+#define __kh_fsize(m) ((m) < 32? 1 : (m)>>5)
+
+static kh_inline khint_t __kh_h2b(khint_t hash, khint_t bits) { return hash * 2654435769U >> (32 - bits); }
+
+/*******************
+ * Hash table base *
+ *******************/
+
+#define __KHASHL_TYPE(HType, khkey_t) \
+	typedef struct HType { \
+		khint_t bits, count; \
+		khint32_t *used; \
+		khkey_t *keys; \
+	} HType;
+
+#define __KHASHL_PROTOTYPES(HType, prefix, khkey_t) \
+	extern HType *prefix##_init(void); \
+	extern void prefix##_destroy(HType *h); \
+	extern void prefix##_clear(HType *h); \
+	extern khint_t prefix##_getp(const HType *h, const khkey_t *key); \
+	extern void prefix##_resize(HType *h, khint_t new_n_buckets); \
+	extern khint_t prefix##_putp(HType *h, const khkey_t *key, int *absent); \
+	extern void prefix##_del(HType *h, khint_t k);
+
+#define __KHASHL_IMPL_BASIC(SCOPE, HType, prefix) \
+	SCOPE HType *prefix##_init(void) { \
+		return (HType*)kcalloc(1, sizeof(HType)); \
+	} \
+	SCOPE void prefix##_release(HType *h) { \
+		kfree((void *)h->keys); kfree(h->used); \
+	} \
+	SCOPE void prefix##_destroy(HType *h) { \
+		if (!h) return; \
+		prefix##_release(h); \
+		kfree(h); \
+	} \
+	SCOPE void prefix##_clear(HType *h) { \
+		if (h && h->used) { \
+			khint_t n_buckets = (khint_t)1U << h->bits; \
+			memset(h->used, 0, __kh_fsize(n_buckets) * sizeof(khint32_t)); \
+			h->count = 0; \
+		} \
+	}
+
+#define __KHASHL_IMPL_GET(SCOPE, HType, prefix, khkey_t, __hash_fn, __hash_eq) \
+	SCOPE khint_t prefix##_getp_core(const HType *h, const khkey_t *key, khint_t hash) { \
+		khint_t i, last, n_buckets, mask; \
+		if (!h->keys) return 0; \
+		n_buckets = (khint_t)1U << h->bits; \
+		mask = n_buckets - 1U; \
+		i = last = __kh_h2b(hash, h->bits); \
+		while (__kh_used(h->used, i) && !__hash_eq(h->keys[i], *key)) { \
+			i = (i + 1U) & mask; \
+			if (i == last) return n_buckets; \
+		} \
+		return !__kh_used(h->used, i)? n_buckets : i; \
+	} \
+	SCOPE khint_t prefix##_getp(const HType *h, const khkey_t *key) { return prefix##_getp_core(h, key, __hash_fn(*key)); } \
+	SCOPE khint_t prefix##_get(const HType *h, khkey_t key) { return prefix##_getp_core(h, &key, __hash_fn(key)); }
+
+#define __KHASHL_IMPL_RESIZE(SCOPE, HType, prefix, khkey_t, __hash_fn, __hash_eq) \
+	SCOPE void prefix##_resize(HType *h, khint_t new_n_buckets) { \
+		khint32_t *new_used = NULL; \
+		khint_t j = 0, x = new_n_buckets, n_buckets, new_bits, new_mask; \
+		while ((x >>= 1) != 0) ++j; \
+		if (new_n_buckets & (new_n_buckets - 1)) ++j; \
+		new_bits = j > 2? j : 2; \
+		new_n_buckets = (khint_t)1U << new_bits; \
+		if (h->count > (new_n_buckets>>1) + (new_n_buckets>>2)) return; /* noop, requested size is too small */ \
+		new_used = (khint32_t*)kcalloc(__kh_fsize(new_n_buckets), sizeof(khint32_t)); \
+		n_buckets = h->keys? (khint_t)1U<<h->bits : 0U; \
+		if (n_buckets < new_n_buckets) { /* expand */ \
+			h->keys = (khkey_t *)xreallocarray(h->keys, \
+					new_n_buckets, sizeof(khkey_t)); \
+		} /* otherwise shrink */ \
+		new_mask = new_n_buckets - 1; \
+		for (j = 0; j != n_buckets; ++j) { \
+			khkey_t key; \
+			if (!__kh_used(h->used, j)) continue; \
+			key = h->keys[j]; \
+			__kh_set_unused(h->used, j); \
+			while (1) { /* kick-out process; sort of like in Cuckoo hashing */ \
+				khint_t i; \
+				i = __kh_h2b(__hash_fn(key), new_bits); \
+				while (__kh_used(new_used, i)) i = (i + 1) & new_mask; \
+				__kh_set_used(new_used, i); \
+				if (i < n_buckets && __kh_used(h->used, i)) { /* kick out the existing element */ \
+					{ khkey_t tmp = h->keys[i]; h->keys[i] = key; key = tmp; } \
+					__kh_set_unused(h->used, i); /* mark it as deleted in the old hash table */ \
+				} else { /* write the element and jump out of the loop */ \
+					h->keys[i] = key; \
+					break; \
+				} \
+			} \
+		} \
+		if (n_buckets > new_n_buckets) /* shrink the hash table */ \
+			h->keys = (khkey_t *)xreallocarray(h->keys, \
+					new_n_buckets, sizeof(khkey_t)); \
+		kfree(h->used); /* free the working space */ \
+		h->used = new_used, h->bits = new_bits; \
+	}
+
+#define __KHASHL_IMPL_PUT(SCOPE, HType, prefix, khkey_t, __hash_fn, __hash_eq) \
+	SCOPE khint_t prefix##_putp_core(HType *h, const khkey_t *key, khint_t hash, int *absent) { \
+		khint_t n_buckets, i, last, mask; \
+		n_buckets = h->keys? (khint_t)1U<<h->bits : 0U; \
+		*absent = -1; \
+		if (h->count >= (n_buckets>>1) + (n_buckets>>2)) { /* rehashing */ \
+			prefix##_resize(h, n_buckets + 1U); \
+			n_buckets = (khint_t)1U<<h->bits; \
+		} /* TODO: to implement automatically shrinking; resize() already support shrinking */ \
+		mask = n_buckets - 1; \
+		i = last = __kh_h2b(hash, h->bits); \
+		while (__kh_used(h->used, i) && !__hash_eq(h->keys[i], *key)) { \
+			i = (i + 1U) & mask; \
+			if (i == last) break; \
+		} \
+		if (!__kh_used(h->used, i)) { /* not present at all */ \
+			h->keys[i] = *key; \
+			__kh_set_used(h->used, i); \
+			++h->count; \
+			*absent = 1; \
+		} else *absent = 0; /* Don't touch h->keys[i] if present */ \
+		return i; \
+	} \
+	SCOPE khint_t prefix##_putp(HType *h, const khkey_t *key, int *absent) { return prefix##_putp_core(h, key, __hash_fn(*key), absent); } \
+	SCOPE khint_t prefix##_put(HType *h, khkey_t key, int *absent) { return prefix##_putp_core(h, &key, __hash_fn(key), absent); }
+
+#define __KHASHL_IMPL_DEL(SCOPE, HType, prefix, khkey_t, __hash_fn) \
+	SCOPE int prefix##_del(HType *h, khint_t i) { \
+		khint_t j = i, k, mask, n_buckets; \
+		if (!h->keys) return 0; \
+		n_buckets = (khint_t)1U<<h->bits; \
+		mask = n_buckets - 1U; \
+		while (1) { \
+			j = (j + 1U) & mask; \
+			if (j == i || !__kh_used(h->used, j)) break; /* j==i only when the table is completely full */ \
+			k = __kh_h2b(__hash_fn(h->keys[j]), h->bits); \
+			if ((j > i && (k <= i || k > j)) || (j < i && (k <= i && k > j))) \
+				h->keys[i] = h->keys[j], i = j; \
+		} \
+		__kh_set_unused(h->used, i); \
+		--h->count; \
+		return 1; \
+	}
+
+#define KHASHL_DECLARE(HType, prefix, khkey_t) \
+	__KHASHL_TYPE(HType, khkey_t) \
+	__KHASHL_PROTOTYPES(HType, prefix, khkey_t)
+
+#define KHASHL_INIT(SCOPE, HType, prefix, khkey_t, __hash_fn, __hash_eq) \
+	__KHASHL_TYPE(HType, khkey_t) \
+	__KHASHL_IMPL_BASIC(SCOPE, HType, prefix) \
+	__KHASHL_IMPL_GET(SCOPE, HType, prefix, khkey_t, __hash_fn, __hash_eq) \
+	__KHASHL_IMPL_RESIZE(SCOPE, HType, prefix, khkey_t, __hash_fn, __hash_eq) \
+	__KHASHL_IMPL_PUT(SCOPE, HType, prefix, khkey_t, __hash_fn, __hash_eq) \
+	__KHASHL_IMPL_DEL(SCOPE, HType, prefix, khkey_t, __hash_fn)
+
+#define KHASHE_INIT(SCOPE, HType, prefix, khkey_t, __hash_fn, __hash_eq) \
+	KHASHL_INIT(KH_LOCAL, HType##_sub, prefix##_sub, khkey_t, __hash_fn, __hash_eq) \
+	typedef struct HType { \
+		khint64_t count:54, bits:8; \
+		HType##_sub *sub; \
+	} HType; \
+	SCOPE HType *prefix##_init_sub(HType *g, size_t bits) { \
+		g->bits = bits; \
+		g->sub = (HType##_sub*)kcalloc(1U<<bits, sizeof(*g->sub)); \
+		return g; \
+	} \
+	SCOPE HType *prefix##_init(void) { \
+		HType *g; \
+		g = (HType*)kcalloc(1, sizeof(*g)); \
+		return prefix##_init_sub(g, 0); /* unsure about default */ \
+	} \
+	SCOPE void prefix##_release(HType *g) { \
+		int t; \
+		for (t = 0; t < 1<<g->bits; ++t) \
+			prefix##_sub_release(&g->sub[t]); \
+		kfree(g->sub); \
+	} \
+	SCOPE void prefix##_destroy(HType *g) { \
+		if (!g) return; \
+		prefix##_release(g); \
+		kfree(g); \
+	} \
+	SCOPE void prefix##_clear(HType *g) { \
+		int t; \
+		if (!g) return; \
+		for (t = 0; t < 1<<g->bits; ++t) \
+			prefix##_sub_clear(&g->sub[t]); \
+	} \
+	SCOPE kh_ensitr_t prefix##_getp(const HType *g, const khkey_t *key) { \
+		khint_t hash, low, ret; \
+		kh_ensitr_t r; \
+		HType##_sub *h; \
+		hash = __hash_fn(*key); \
+		low = hash & ((1U<<g->bits) - 1); \
+		h = &g->sub[low]; \
+		ret = prefix##_sub_getp_core(h, key, hash); \
+		if (ret >= kh_end(h)) r.sub = low, r.pos = (khint_t)-1; \
+		else r.sub = low, r.pos = ret; \
+		return r; \
+	} \
+	SCOPE kh_ensitr_t prefix##_get(const HType *g, const khkey_t key) { return prefix##_getp(g, &key); } \
+	SCOPE kh_ensitr_t prefix##_putp(HType *g, const khkey_t *key, int *absent) { \
+		khint_t hash, low, ret; \
+		kh_ensitr_t r; \
+		HType##_sub *h; \
+		hash = __hash_fn(*key); \
+		low = hash & ((1U<<g->bits) - 1); \
+		h = &g->sub[low]; \
+		ret = prefix##_sub_putp_core(h, key, hash, absent); \
+		if (*absent) ++g->count; \
+		if (ret == 1U<<h->bits) r.sub = low, r.pos = (khint_t)-1; \
+		else r.sub = low, r.pos = ret; \
+		return r; \
+	} \
+	SCOPE kh_ensitr_t prefix##_put(HType *g, const khkey_t key, int *absent) { return prefix##_putp(g, &key, absent); } \
+	SCOPE int prefix##_del(HType *g, kh_ensitr_t itr) { \
+		HType##_sub *h = &g->sub[itr.sub]; \
+		int ret; \
+		ret = prefix##_sub_del(h, itr.pos); \
+		if (ret) --g->count; \
+		return ret; \
+	}
+
+/*****************************
+ * More convenient interface *
+ *****************************/
+
+#define __kh_packed /* noop, we use -Werror=address-of-packed-member */
+#define __kh_cached_hash(x) ((x).hash)
+
+#define KHASHL_SET_INIT(SCOPE, HType, prefix, khkey_t, __hash_fn, __hash_eq) \
+	typedef struct { khkey_t key; } __kh_packed HType##_s_bucket_t; \
+	static kh_inline khint_t prefix##_s_hash(HType##_s_bucket_t x) { return __hash_fn(x.key); } \
+	static kh_inline int prefix##_s_eq(HType##_s_bucket_t x, HType##_s_bucket_t y) { return __hash_eq(x.key, y.key); } \
+	KHASHL_INIT(KH_LOCAL, HType, prefix##_s, HType##_s_bucket_t, prefix##_s_hash, prefix##_s_eq) \
+	SCOPE HType *prefix##_init(void) { return prefix##_s_init(); } \
+	SCOPE void prefix##_release(HType *h) { prefix##_s_release(h); } \
+	SCOPE void prefix##_destroy(HType *h) { prefix##_s_destroy(h); } \
+	SCOPE void prefix##_clear(HType *h) { prefix##_s_clear(h); } \
+	SCOPE void prefix##_resize(HType *h, khint_t new_n_buckets) { prefix##_s_resize(h, new_n_buckets); } \
+	SCOPE khint_t prefix##_get(const HType *h, khkey_t key) { HType##_s_bucket_t t; t.key = key; return prefix##_s_getp(h, &t); } \
+	SCOPE int prefix##_del(HType *h, khint_t k) { return prefix##_s_del(h, k); } \
+	SCOPE khint_t prefix##_put(HType *h, khkey_t key, int *absent) { HType##_s_bucket_t t; t.key = key; return prefix##_s_putp(h, &t, absent); } \
+
+#define KHASHL_MAP_INIT(SCOPE, HType, prefix, khkey_t, kh_val_t, __hash_fn, __hash_eq) \
+	typedef struct { khkey_t key; kh_val_t val; } __kh_packed HType##_m_bucket_t; \
+	static kh_inline khint_t prefix##_m_hash(HType##_m_bucket_t x) { return __hash_fn(x.key); } \
+	static kh_inline int prefix##_m_eq(HType##_m_bucket_t x, HType##_m_bucket_t y) { return __hash_eq(x.key, y.key); } \
+	KHASHL_INIT(KH_LOCAL, HType, prefix##_m, HType##_m_bucket_t, prefix##_m_hash, prefix##_m_eq) \
+	SCOPE HType *prefix##_init(void) { return prefix##_m_init(); } \
+	SCOPE void prefix##_release(HType *h) { prefix##_m_release(h); } \
+	SCOPE void prefix##_destroy(HType *h) { prefix##_m_destroy(h); } \
+	SCOPE void prefix##_clear(HType *h) { prefix##_m_clear(h); } \
+	SCOPE void prefix##_resize(HType *h, khint_t new_n_buckets) { prefix##_m_resize(h, new_n_buckets); } \
+	SCOPE khint_t prefix##_get(const HType *h, khkey_t key) { HType##_m_bucket_t t; t.key = key; return prefix##_m_getp(h, &t); } \
+	SCOPE int prefix##_del(HType *h, khint_t k) { return prefix##_m_del(h, k); } \
+	SCOPE khint_t prefix##_put(HType *h, khkey_t key, int *absent) { HType##_m_bucket_t t; t.key = key; return prefix##_m_putp(h, &t, absent); } \
+
+#define KHASHL_CSET_INIT(SCOPE, HType, prefix, khkey_t, __hash_fn, __hash_eq) \
+	typedef struct { khkey_t key; khint_t hash; } __kh_packed HType##_cs_bucket_t; \
+	static kh_inline int prefix##_cs_eq(HType##_cs_bucket_t x, HType##_cs_bucket_t y) { return x.hash == y.hash && __hash_eq(x.key, y.key); } \
+	KHASHL_INIT(KH_LOCAL, HType, prefix##_cs, HType##_cs_bucket_t, __kh_cached_hash, prefix##_cs_eq) \
+	SCOPE HType *prefix##_init(void) { return prefix##_cs_init(); } \
+	SCOPE void prefix##_destroy(HType *h) { prefix##_cs_destroy(h); } \
+	SCOPE khint_t prefix##_get(const HType *h, khkey_t key) { HType##_cs_bucket_t t; t.key = key; t.hash = __hash_fn(key); return prefix##_cs_getp(h, &t); } \
+	SCOPE int prefix##_del(HType *h, khint_t k) { return prefix##_cs_del(h, k); } \
+	SCOPE khint_t prefix##_put(HType *h, khkey_t key, int *absent) { HType##_cs_bucket_t t; t.key = key, t.hash = __hash_fn(key); return prefix##_cs_putp(h, &t, absent); }
+
+#define KHASHL_CMAP_INIT(SCOPE, HType, prefix, khkey_t, kh_val_t, __hash_fn, __hash_eq) \
+	typedef struct { khkey_t key; kh_val_t val; khint_t hash; } __kh_packed HType##_cm_bucket_t; \
+	static kh_inline int prefix##_cm_eq(HType##_cm_bucket_t x, HType##_cm_bucket_t y) { return x.hash == y.hash && __hash_eq(x.key, y.key); } \
+	KHASHL_INIT(KH_LOCAL, HType, prefix##_cm, HType##_cm_bucket_t, __kh_cached_hash, prefix##_cm_eq) \
+	SCOPE HType *prefix##_init(void) { return prefix##_cm_init(); } \
+	SCOPE void prefix##_destroy(HType *h) { prefix##_cm_destroy(h); } \
+	SCOPE khint_t prefix##_get(const HType *h, khkey_t key) { HType##_cm_bucket_t t; t.key = key; t.hash = __hash_fn(key); return prefix##_cm_getp(h, &t); } \
+	SCOPE int prefix##_del(HType *h, khint_t k) { return prefix##_cm_del(h, k); } \
+	SCOPE khint_t prefix##_put(HType *h, khkey_t key, int *absent) { HType##_cm_bucket_t t; t.key = key, t.hash = __hash_fn(key); return prefix##_cm_putp(h, &t, absent); }
+
+#define KHASHE_MAP_INIT(SCOPE, HType, prefix, khkey_t, kh_val_t, __hash_fn, __hash_eq) \
+	typedef struct { khkey_t key; kh_val_t val; } __kh_packed HType##_m_bucket_t; \
+	static kh_inline khint_t prefix##_m_hash(HType##_m_bucket_t x) { return __hash_fn(x.key); } \
+	static kh_inline int prefix##_m_eq(HType##_m_bucket_t x, HType##_m_bucket_t y) { return __hash_eq(x.key, y.key); } \
+	KHASHE_INIT(KH_LOCAL, HType, prefix##_m, HType##_m_bucket_t, prefix##_m_hash, prefix##_m_eq) \
+	SCOPE HType *prefix##_init(void) { return prefix##_m_init(); } \
+	SCOPE void prefix##_release(HType *h) { prefix##_m_release(h); } \
+	SCOPE void prefix##_destroy(HType *h) { prefix##_m_destroy(h); } \
+	SCOPE void prefix##_clear(HType *h) { prefix##_m_clear(h); } \
+	SCOPE void prefix##_resize(HType *h, khint_t ignore) { /* noop */ } \
+	SCOPE kh_ensitr_t prefix##_get(const HType *h, khkey_t key) { HType##_m_bucket_t t; t.key = key; return prefix##_m_getp(h, &t); } \
+	SCOPE int prefix##_del(HType *h, kh_ensitr_t k) { return prefix##_m_del(h, k); } \
+	SCOPE kh_ensitr_t prefix##_put(HType *h, khkey_t key, int *absent) { HType##_m_bucket_t t; t.key = key; return prefix##_m_putp(h, &t, absent); } \
+
+/**************************
+ * Public macro functions *
+ **************************/
+
+#define kh_bucket(h, x) ((h)->keys[x])
+
+/*! @function
+  @abstract     Get the number of elements in the hash table
+  @param  h     Pointer to the hash table
+  @return       Number of elements in the hash table [khint_t]
+ */
+#define kh_size(h) ((h)->count)
+
+#define kh_capacity(h) ((h)->keys? 1U<<(h)->bits : 0U)
+
+/*! @function
+  @abstract     Get the end iterator
+  @param  h     Pointer to the hash table
+  @return       The end iterator [khint_t]
+ */
+#define kh_end(h) kh_capacity(h)
+
+/*! @function
+  @abstract     Get key given an iterator
+  @param  h     Pointer to the hash table
+  @param  x     Iterator to the bucket [khint_t]
+  @return       Key [type of keys]
+ */
+#define kh_key(h, x) ((h)->keys[x].key)
+
+/*! @function
+  @abstract     Get value given an iterator
+  @param  h     Pointer to the hash table
+  @param  x     Iterator to the bucket [khint_t]
+  @return       Value [type of values]
+  @discussion   For hash sets, calling this results in segfault.
+ */
+#define kh_val(h, x) ((h)->keys[x].val)
+
+/*! @function
+  @abstract     Alias of kh_val()
+ */
+#define kh_value(h, x) kh_val(h, x)
+
+/*! @function
+  @abstract     Test whether a bucket contains data.
+  @param  h     Pointer to the hash table
+  @param  x     Iterator to the bucket [khint_t]
+  @return       1 if containing data; 0 otherwise [int]
+ */
+#define kh_exist(h, x) __kh_used((h)->used, (x))
+
+#define kh_ens_key(g, x) kh_key(&(g)->sub[(x).sub], (x).pos)
+#define kh_ens_val(g, x) kh_val(&(g)->sub[(x).sub], (x).pos)
+#define kh_ens_exist(g, x) kh_exist(&(g)->sub[(x).sub], (x).pos)
+#define kh_ens_is_end(x) ((x).pos == (khint_t)-1)
+#define kh_ens_size(g) ((g)->count)
+
+/**************************************
+ * Common hash and equality functions *
+ **************************************/
+
+#define kh_eq_generic(a, b) ((a) == (b))
+#define kh_eq_str(a, b) (strcmp((a), (b)) == 0)
+#define kh_hash_dummy(x) ((khint_t)(x))
+
+static kh_inline khint_t kh_hash_uint32(khint_t key) {
+	key += ~(key << 15);
+	key ^=  (key >> 10);
+	key +=  (key << 3);
+	key ^=  (key >> 6);
+	key += ~(key << 11);
+	key ^=  (key >> 16);
+	return key;
+}
+
+static kh_inline khint_t kh_hash_uint64(khint64_t key) {
+	key = ~key + (key << 21);
+	key = key ^ key >> 24;
+	key = (key + (key << 3)) + (key << 8);
+	key = key ^ key >> 14;
+	key = (key + (key << 2)) + (key << 4);
+	key = key ^ key >> 28;
+	key = key + (key << 31);
+	return (khint_t)key;
+}
+
+#define KH_FNV_SEED 11
+
+static kh_inline khint_t kh_hash_str(const char *s) { /* FNV1a */
+	khint_t h = KH_FNV_SEED ^ 2166136261U;
+	const unsigned char *t = (const unsigned char*)s;
+	for (; *t; ++t)
+		h ^= *t, h *= 16777619;
+	return h;
+}
+
+static kh_inline khint_t kh_hash_bytes(int len, const unsigned char *s) {
+	khint_t h = KH_FNV_SEED ^ 2166136261U;
+	int i;
+	for (i = 0; i < len; ++i)
+		h ^= s[i], h *= 16777619;
+	return h;
+}
+
+/*! @function
+  @abstract     Get the start iterator
+  @param  h     Pointer to the hash table
+  @return       The start iterator [khint_t]
+ */
+#define kh_begin(h) (khint_t)(0)
+
+/*! @function
+  @abstract     Iterate over the entries in the hash table
+  @param  h     Pointer to the hash table
+  @param  kvar  Variable to which key will be assigned
+  @param  vvar  Variable to which value will be assigned
+  @param  code  Block of code to execute
+ */
+#define kh_foreach(h, kvar, vvar, code) { khint_t __i;		\
+	for (__i = kh_begin(h); __i != kh_end(h); ++__i) {	\
+		if (!kh_exist(h,__i)) continue;			\
+		(kvar) = kh_key(h,__i);				\
+		(vvar) = kh_val(h,__i);				\
+		code;						\
+	} }
+
+#define kh_ens_foreach(g, kvar, vvar, code) do { \
+	size_t t; \
+	for (t = 0; t < 1<<g->bits; ++t) \
+		kh_foreach(&g->sub[t], kvar, vvar, code); \
+} while (0)
+
+#define kh_ens_foreach_value(g, vvar, code) do { \
+	size_t t; \
+	for (t = 0; t < 1<<g->bits; ++t) \
+		kh_foreach_value(&g->sub[t], vvar, code); \
+} while (0)
+
+/*! @function
+  @abstract     Iterate over the values in the hash table
+  @param  h     Pointer to the hash table
+  @param  vvar  Variable to which value will be assigned
+  @param  code  Block of code to execute
+ */
+#define kh_foreach_value(h, vvar, code) { khint_t __i;		\
+	for (__i = kh_begin(h); __i != kh_end(h); ++__i) {	\
+		if (!kh_exist(h,__i)) continue;			\
+		(vvar) = kh_val(h,__i);				\
+		code;						\
+	} }
+
+#endif /* __AC_KHASHL_H */
diff --git a/lib/PublicInbox/xap_helper.h b/lib/PublicInbox/xap_helper.h
index 8bfd7ab6..92d3d12f 100644
--- a/lib/PublicInbox/xap_helper.h
+++ b/lib/PublicInbox/xap_helper.h
@@ -7,7 +7,7 @@
  * this is not linked to Perl in any way.
  * C (not C++) is used as much as possible to lower the contribution
  * barrier for hackers who mainly know C (this includes the maintainer).
- * Yes, that means we use C stdlib stuff like hsearch and open_memstream
+ * Yes, that means we use C stdlib stuff like open_memstream
  * instead their equivalents in the C++ stdlib :P
  * Everything here is an unstable internal API of public-inbox and
  * NOT intended for ordinary users; only public-inbox hackers
@@ -15,6 +15,9 @@
 #ifndef _ALL_SOURCE
 #	define _ALL_SOURCE
 #endif
+#ifndef _GNU_SOURCE
+#	define _GNU_SOURCE
+#endif
 #if defined(__NetBSD__) && !defined(_OPENBSD_SOURCE) // for reallocarray(3)
 #	define _OPENBSD_SOURCE
 #endif
@@ -83,6 +86,36 @@
 #define ABORT(...) do { warnx(__VA_ARGS__); abort(); } while (0)
 #define EABORT(...) do { warn(__VA_ARGS__); abort(); } while (0)
 
+static void *xcalloc(size_t nmemb, size_t size)
+{
+	void *ret = calloc(nmemb, size);
+	if (!ret) EABORT("calloc(%zu, %zu)", nmemb, size);
+	return ret;
+}
+
+#if defined(__GLIBC__) && defined(__GLIBC_MINOR__) && \
+		MY_VER(__GLIBC__, __GLIBC_MINOR__, 0) >= MY_VER(2, 28, 0)
+#	define HAVE_REALLOCARRAY 1
+#elif (defined(__OpenBSD__) || defined(__DragonFly__) || \
+		defined(__FreeBSD__) || defined(__NetBSD__)
+#	define HAVE_REALLOCARRAY 1
+#endif
+
+static void *xreallocarray(void *ptr, size_t nmemb, size_t size)
+{
+#ifdef HAVE_REALLOCARRAY
+	void *ret = reallocarray(ptr, nmemb, size);
+#else // can't rely on __builtin_mul_overflow in gcc 4.x :<
+	void *ret = NULL;
+	if (nmemb && size > SIZE_MAX / nmemb)
+		errno = ENOMEM;
+	else
+		ret = realloc(ptr, nmemb * size);
+#endif
+	if (!ret) EABORT("reallocarray(..., %zu, %zu)", nmemb, size);
+	return ret;
+}
+
 // sock_fd is modified in signal handler, yes, it's SOCK_SEQPACKET
 static volatile int sock_fd = STDIN_FILENO;
 static sigset_t fullset, workerset;
@@ -374,25 +407,6 @@ static size_t off2size(off_t n)
 	return (size_t)n;
 }
 
-static char *hsearch_enter_key(char *s)
-{
-#if defined(__OpenBSD__) || defined(__DragonFly__)
-	// hdestroy frees each key on some platforms,
-	// so give it something to free:
-	char *ret = strdup(s);
-	if (!ret) err(EXIT_FAILURE, "strdup");
-	return ret;
-// AFAIK there's no way to detect musl, assume non-glibc Linux is musl:
-#elif defined(__GLIBC__) || defined(__linux__) || \
-	defined(__FreeBSD__) || defined(__NetBSD__)
-	// do nothing on these platforms
-#else
-#warning untested platform detected, unsure if hdestroy(3) frees keys
-#warning contact us at meta@public-inbox.org if you get segfaults
-#endif
-	return s;
-}
-
 // for test usage only, we need to ensure the compiler supports
 // __cleanup__ when exceptions are thrown
 struct inspect { struct req *req; };
@@ -421,6 +435,7 @@ static bool cmd_test_sleep(struct req *req)
 	for (;;) poll(NULL, 0, 10);
 	return false;
 }
+#include "khashl.h"
 #include "xh_mset.h" // read-only (WWW, IMAP, lei) stuff
 #include "xh_cidx.h" // CodeSearchIdx.pm stuff
 
diff --git a/lib/PublicInbox/xh_cidx.h b/lib/PublicInbox/xh_cidx.h
index 311ca05f..8cc6a845 100644
--- a/lib/PublicInbox/xh_cidx.h
+++ b/lib/PublicInbox/xh_cidx.h
@@ -3,11 +3,38 @@
 // This file is only intended to be included by xap_helper.h
 // it implements pieces used by CodeSearchIdx.pm
 
+// TODO: consider making PublicInbox::CodeSearchIdx emit binary
+// (20 or 32-bit) OIDs instead of ASCII hex.  It would require
+// more code in both Perl and C++, though...
+
+// assumes trusted data from same host
+static inline unsigned int hex2uint(char c)
+{
+	switch (c) {
+	case '0' ... '9': return c - '0';
+	case 'a' ... 'f': return c - 'a' + 10;
+	default: return 0xff; // oh well...
+	}
+}
+
+// assumes trusted data from same host
+static kh_inline khint_t sha_hex_hash(const char *hex)
+{
+	khint_t ret = 0;
+
+	for (size_t shift = 32; shift; )
+		ret |= hex2uint(*hex++) << (shift -= 4);
+
+	return ret;
+}
+
+KHASHL_CMAP_INIT(KH_LOCAL, root2id_map, root2id,
+		const char *, const char *,
+		sha_hex_hash, kh_eq_str)
+
 static void term_length_extract(struct req *req)
 {
-	req->lenv = (size_t *)calloc(req->pfxc, sizeof(size_t));
-	if (!req->lenv)
-		EABORT("lenv = calloc(%d %zu)", req->pfxc, sizeof(size_t));
+	req->lenv = (size_t *)xcalloc(req->pfxc, sizeof(size_t));
 	for (int i = 0; i < req->pfxc; i++) {
 		char *pfx = req->pfxv[i];
 		// extract trailing digits as length:
@@ -101,6 +128,7 @@ struct dump_roots_tmp {
 	void *mm_ptr;
 	char **entries;
 	struct fbuf wbuf;
+	root2id_map *root2id;
 	int root2off_fd;
 };
 
@@ -110,7 +138,8 @@ static void dump_roots_ensure(void *ptr)
 	struct dump_roots_tmp *drt = (struct dump_roots_tmp *)ptr;
 	if (drt->root2off_fd >= 0)
 		xclose(drt->root2off_fd);
-	hdestroy(); // idempotent
+	if (drt->root2id)
+		root2id_cm_destroy(drt->root2id);
 	size_t size = off2size(drt->sb.st_size);
 	if (drt->mm_ptr && munmap(drt->mm_ptr, size))
 		EABORT("BUG: munmap(%p, %zu)", drt->mm_ptr, size);
@@ -118,23 +147,21 @@ static void dump_roots_ensure(void *ptr)
 	fbuf_ensure(&drt->wbuf);
 }
 
-static bool root2offs_str(struct fbuf *root_offs, Xapian::Document *doc)
+static bool root2offs_str(struct dump_roots_tmp *drt,
+			struct fbuf *root_offs, Xapian::Document *doc)
 {
 	Xapian::TermIterator cur = doc->termlist_begin();
 	Xapian::TermIterator end = doc->termlist_end();
-	ENTRY e, *ep;
 	fbuf_init(root_offs);
 	for (cur.skip_to("G"); cur != end; cur++) {
 		std::string tn = *cur;
 		if (!starts_with(&tn, "G", 1)) break;
-		union { const char *in; char *out; } u;
-		u.in = tn.c_str() + 1;
-		e.key = u.out;
-		ep = hsearch(e, FIND);
-		if (!ep) ABORT("hsearch miss `%s'", e.key);
-		// ep->data is a NUL-terminated string matching /[0-9]+/
+		khint_t i = root2id_get(drt->root2id, tn.c_str() + 1);
+		if (i >= kh_end(drt->root2id))
+			ABORT("kh get miss `%s'", tn.c_str() + 1);
 		fputc(' ', root_offs->fp);
-		fputs((const char *)ep->data, root_offs->fp);
+		// kh_val(...) is a NUL-terminated string matching /[0-9]+/
+		fputs(kh_val(drt->root2id, i), root_offs->fp);
 	}
 	fputc('\n', root_offs->fp);
 	ERR_CLOSE(root_offs->fp, EXIT_FAILURE); // ENOMEM
@@ -198,7 +225,7 @@ static enum exc_iter dump_roots_iter(struct req *req,
 	CLEANUP_FBUF struct fbuf root_offs = {}; // " $ID0 $ID1 $IDx..\n"
 	try {
 		Xapian::Document doc = i->get_document();
-		if (!root2offs_str(&root_offs, &doc))
+		if (!root2offs_str(drt, &root_offs, &doc))
 			return ITER_ABORT; // bad request, abort
 		for (int p = 0; p < req->pfxc; p++)
 			dump_roots_term(req, p, drt, &root_offs, &doc);
@@ -226,8 +253,7 @@ static bool cmd_dump_roots(struct req *req)
 	if (fstat(drt.root2off_fd, &drt.sb)) // ENOMEM?
 		err(EXIT_FAILURE, "fstat(%s)", root2off_file);
 	// each entry is at least 43 bytes ({OIDHEX}\0{INT}\0),
-	// so /32 overestimates the number of expected entries by
-	// ~%25 (as recommended by Linux hcreate(3) manpage)
+	// so /32 overestimates the number of expected entries
 	size_t size = off2size(drt.sb.st_size);
 	size_t est = (size / 32) + 1; //+1 for "\0" termination
 	drt.mm_ptr = mmap(NULL, size, PROT_READ,
@@ -236,20 +262,19 @@ static bool cmd_dump_roots(struct req *req)
 		err(EXIT_FAILURE, "mmap(%zu, %s)", size, root2off_file);
 	size_t asize = est * 2;
 	if (asize < est) ABORT("too many entries: %zu", est);
-	drt.entries = (char **)calloc(asize, sizeof(char *));
-	if (!drt.entries)
-		err(EXIT_FAILURE, "calloc(%zu * 2, %zu)", est, sizeof(char *));
+	drt.entries = (char **)xcalloc(asize, sizeof(char *));
 	size_t tot = split2argv(drt.entries, (char *)drt.mm_ptr, size, asize);
 	if (tot <= 0) return false; // split2argv already warned on error
-	if (!hcreate(est))
-		err(EXIT_FAILURE, "hcreate(%zu)", est);
+	drt.root2id = root2id_init();
+	root2id_cm_resize(drt.root2id, est);
 	for (size_t i = 0; i < tot; ) {
-		ENTRY e;
-		e.key = hsearch_enter_key(drt.entries[i++]); // dies on ENOMEM
-		e.data = drt.entries[i++];
-		if (!hsearch(e, ENTER))
-			err(EXIT_FAILURE, "hsearch(%s => %s, ENTER)", e.key,
-					(const char *)e.data);
+		int absent;
+		const char *key = drt.entries[i++];
+		khint_t k = root2id_put(drt.root2id, key, &absent);
+		if (!absent)
+			err(EXIT_FAILURE, "put(%s => %s, ENTER)",
+				key, drt.entries[i]);
+		kh_val(drt.root2id, k) = drt.entries[i++];
 	}
 	req->asc = true;
 	req->sort_col = -1;

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 08/10] xap_helper.h: use xcalloc to simplify error checking
  2024-05-16  0:12 [PATCH 01/10] search: support per-inbox indexheader directive Eric Wong
                   ` (5 preceding siblings ...)
  2024-05-16  0:12 ` [PATCH 07/10] xap_helper.h: use khashl.h instead of hsearch(3) Eric Wong
@ 2024-05-16  0:12 ` Eric Wong
  2024-05-16  0:12 ` [PATCH 09/10] xap_helper.h: memoize Xapian handles with khashl Eric Wong
  2024-05-16  0:12 ` [PATCH 10/10] xap_helper: expire DB handles when FD table is near full Eric Wong
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2024-05-16  0:12 UTC (permalink / raw)
  To: spew

Since we introduced xcalloc for khashl.h usage, we might as well
use it elsewhere.
---
 lib/PublicInbox/xap_helper.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/PublicInbox/xap_helper.h b/lib/PublicInbox/xap_helper.h
index 92d3d12f..05b3b8c9 100644
--- a/lib/PublicInbox/xap_helper.h
+++ b/lib/PublicInbox/xap_helper.h
@@ -1100,8 +1100,7 @@ int main(int argc, char *argv[])
 	CHECK(int, 0, sigdelset(&workerset, SIGTERM));
 	CHECK(int, 0, sigdelset(&workerset, SIGCHLD));
 	nworker_hwm = nworker;
-	worker_pids = (pid_t *)calloc(nworker, sizeof(pid_t));
-	if (!worker_pids) err(EXIT_FAILURE, "calloc");
+	worker_pids = (pid_t *)xcalloc(nworker, sizeof(pid_t));
 
 	if (pipe(pipefds)) err(EXIT_FAILURE, "pipe");
 	int fl = fcntl(pipefds[1], F_GETFL);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 09/10] xap_helper.h: memoize Xapian handles with khashl
  2024-05-16  0:12 [PATCH 01/10] search: support per-inbox indexheader directive Eric Wong
                   ` (6 preceding siblings ...)
  2024-05-16  0:12 ` [PATCH 08/10] xap_helper.h: use xcalloc to simplify error checking Eric Wong
@ 2024-05-16  0:12 ` Eric Wong
  2024-05-16  0:12 ` [PATCH 10/10] xap_helper: expire DB handles when FD table is near full Eric Wong
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2024-05-16  0:12 UTC (permalink / raw)
  To: spew

Since we're already using khashl in the C++ implementation,
get rid of tsearch(3) and friends as well.  Relying on hash
tables in both the Perl and C(++) implementation reduces
cognitive load for hackers.
---
 lib/PublicInbox/xap_helper.h | 86 ++++++++++++++++++++----------------
 1 file changed, 48 insertions(+), 38 deletions(-)

diff --git a/lib/PublicInbox/xap_helper.h b/lib/PublicInbox/xap_helper.h
index 05b3b8c9..44e0d63e 100644
--- a/lib/PublicInbox/xap_helper.h
+++ b/lib/PublicInbox/xap_helper.h
@@ -37,7 +37,6 @@
 #include <errno.h>
 #include <fcntl.h>
 #include <limits.h>
-#include <search.h>
 #include <signal.h>
 #include <stddef.h>
 #include <stdint.h>
@@ -116,6 +115,31 @@ static void *xreallocarray(void *ptr, size_t nmemb, size_t size)
 	return ret;
 }
 
+#include "khashl.h"
+
+struct srch {
+	int ckey_len; // int for comparisons
+	unsigned qp_flags;
+	bool qp_extra_done;
+	Xapian::Database *db;
+	Xapian::QueryParser *qp;
+	unsigned char ckey[]; // $shard_path0\0$shard_path1\0...
+};
+
+static khint_t srch_hash(const struct srch *srch)
+{
+	return kh_hash_bytes(srch->ckey_len, srch->ckey);
+}
+
+static int srch_eq(const struct srch *a, const struct srch *b)
+{
+	return a->ckey_len == b->ckey_len ?
+		!memcmp(a->ckey, b->ckey, (size_t)a->ckey_len) : 0;
+}
+
+KHASHL_CSET_INIT(KH_LOCAL, srch_set, srch_set, struct srch *,
+		srch_hash, srch_eq)
+static srch_set *srch_cache;
 // sock_fd is modified in signal handler, yes, it's SOCK_SEQPACKET
 static volatile int sock_fd = STDIN_FILENO;
 static sigset_t fullset, workerset;
@@ -124,7 +148,6 @@ static bool alive = true;
 static FILE *orig_err = stderr;
 #endif
 static int orig_err_fd = -1;
-static void *srch_tree; // tsearch + tdelete + twalk
 static pid_t *worker_pids; // nr => pid
 #define WORKER_MAX USHRT_MAX
 static unsigned long nworker, nworker_hwm;
@@ -144,15 +167,6 @@ enum exc_iter {
 	ITER_ABORT
 };
 
-struct srch {
-	int ckey_len; // int for comparisons
-	unsigned qp_flags;
-	bool qp_extra_done;
-	Xapian::Database *db;
-	Xapian::QueryParser *qp;
-	char ckey[]; // $shard_path0\0$shard_path1\0...
-};
-
 #define MY_ARG_MAX 256
 typedef bool (*cmd)(struct req *);
 
@@ -435,7 +449,6 @@ static bool cmd_test_sleep(struct req *req)
 	for (;;) poll(NULL, 0, 10);
 	return false;
 }
-#include "khashl.h"
 #include "xh_mset.h" // read-only (WWW, IMAP, lei) stuff
 #include "xh_cidx.h" // CodeSearchIdx.pm stuff
 
@@ -526,15 +539,6 @@ again:
 	return false;
 }
 
-static int srch_cmp(const void *pa, const void *pb) // for tfind|tsearch
-{
-	const struct srch *a = (const struct srch *)pa;
-	const struct srch *b = (const struct srch *)pb;
-	int diff = a->ckey_len - b->ckey_len;
-
-	return diff ? diff : memcmp(a->ckey, b->ckey, (size_t)a->ckey_len);
-}
-
 static bool is_chert(const char *dir)
 {
 	char iamchert[PATH_MAX];
@@ -625,9 +629,8 @@ static void srch_init_extra(struct req *req)
 	req->srch->qp_extra_done = true;
 }
 
-static void free_srch(void *p) // tdestroy
+static void srch_free(struct srch *srch)
 {
-	struct srch *srch = (struct srch *)p;
 	delete srch->qp;
 	delete srch->db;
 	free(srch);
@@ -643,7 +646,6 @@ static void dispatch(struct req *req)
 	} kbuf;
 	char *end;
 	FILE *kfp;
-	struct srch **s;
 	req->threadid = ULLONG_MAX;
 	for (c = 0; c < (int)MY_ARRAY_SIZE(cmds); c++) {
 		if (cmds[c].fn_len == size &&
@@ -725,18 +727,19 @@ static void dispatch(struct req *req)
 	kbuf.srch->ckey_len = size - offsetof(struct srch, ckey);
 	if (kbuf.srch->ckey_len <= 0 || !req->dirc)
 		ABORT("no -d args (or too many)");
-	s = (struct srch **)tsearch(kbuf.srch, &srch_tree, srch_cmp);
-	if (!s) err(EXIT_FAILURE, "tsearch"); // likely ENOMEM
-	req->srch = *s;
-	if (req->srch != kbuf.srch) { // reuse existing
-		free_srch(kbuf.srch);
+
+	int absent;
+	khint_t ki = srch_set_put(srch_cache, kbuf.srch, &absent);
+	assert(ki < kh_end(srch_cache));
+	req->srch = kh_key(srch_cache, ki);
+	if (!absent) { // reuse existing
+		assert(req->srch != kbuf.srch);
+		srch_free(kbuf.srch);
 		req->srch->db->reopen();
 	} else if (!srch_init(req)) {
-		assert(kbuf.srch == *((struct srch **)tfind(
-					kbuf.srch, &srch_tree, srch_cmp)));
-		void *del = tdelete(kbuf.srch, &srch_tree, srch_cmp);
-		assert(del);
-		free_srch(kbuf.srch);
+		int gone = srch_set_del(srch_cache, ki);
+		assert(gone);
+		srch_free(kbuf.srch);
 		goto cmd_err; // srch_init already warned
 	}
 	if (req->qpfxc && !req->srch->qp_extra_done)
@@ -895,10 +898,16 @@ static void start_workers(void)
 static void cleanup_all(void)
 {
 	cleanup_pids();
-#ifdef __GLIBC__
-	tdestroy(srch_tree, free_srch);
-	srch_tree = NULL;
-#endif
+	if (!srch_cache)
+		return;
+
+	khint_t k;
+	for (k = kh_begin(srch_cache); k != kh_end(srch_cache); k++) {
+		if (kh_exist(srch_cache, k))
+			srch_free(kh_key(srch_cache, k));
+	}
+	srch_set_destroy(srch_cache);
+	srch_cache = NULL;
 }
 
 static void parent_reopen_logs(void)
@@ -1040,6 +1049,7 @@ int main(int argc, char *argv[])
 
 	mail_nrp_init();
 	code_nrp_init();
+	srch_cache = srch_set_init();
 	atexit(cleanup_all);
 
 	if (!STDERR_ASSIGNABLE) {

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 10/10] xap_helper: expire DB handles when FD table is near full
  2024-05-16  0:12 [PATCH 01/10] search: support per-inbox indexheader directive Eric Wong
                   ` (7 preceding siblings ...)
  2024-05-16  0:12 ` [PATCH 09/10] xap_helper.h: memoize Xapian handles with khashl Eric Wong
@ 2024-05-16  0:12 ` Eric Wong
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2024-05-16  0:12 UTC (permalink / raw)
  To: spew

For long-lived daemons across config reloads, we shouldn't keep
Xapian DBs open forever under FD pressure.  So estimate the
number of FDs we need per-shard and start clearing some out
if we have too many open.

While we're at it, hoist out our ulimit_n helper and share it
across extindex and the Perl XapHelper implementation.
---
 lib/PublicInbox/ExtSearchIdx.pm |  8 +----
 lib/PublicInbox/Search.pm       | 16 +++++++++
 lib/PublicInbox/XapHelper.pm    | 18 +++++++---
 lib/PublicInbox/XapHelperCxx.pm |  1 +
 lib/PublicInbox/xap_helper.h    | 58 +++++++++++++++++++++++++--------
 t/xap_helper.t                  | 23 +++++++++++++
 6 files changed, 100 insertions(+), 24 deletions(-)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 883dbea3..934197c0 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -543,13 +543,7 @@ sub _ibx_for ($$$) {
 sub _fd_constrained ($) {
 	my ($self) = @_;
 	$self->{-fd_constrained} //= do {
-		my $soft;
-		if (eval { require BSD::Resource; 1 }) {
-			my $NOFILE = BSD::Resource::RLIMIT_NOFILE();
-			($soft, undef) = BSD::Resource::getrlimit($NOFILE);
-		} else {
-			chomp($soft = `sh -c 'ulimit -n'`);
-		}
+		my $soft = PublicInbox::Search::ulimit_n;
 		if (defined($soft)) {
 			# $want is an estimate
 			my $want = scalar(@{$self->{ibx_active}}) + 64;
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index ae696735..cdf25172 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -54,6 +54,9 @@ use constant {
 	#
 	#      v1.6.0 adds BYTES, UID and THREADID values
 	SCHEMA_VERSION => 15,
+
+	# we may have up to 8 FDs per shard (depends on Xapian *shrug*)
+	SHARD_COST => 8,
 };
 
 use PublicInbox::Smsg;
@@ -748,4 +751,17 @@ sub get_doc ($$) {
 	}
 }
 
+# not sure where best to put this...
+sub ulimit_n () {
+	my $n;
+	if (eval { require BSD::Resource; 1 }) {
+		my $NOFILE = BSD::Resource::RLIMIT_NOFILE();
+		($n, undef) = BSD::Resource::getrlimit($NOFILE);
+	} else {
+		require PublicInbox::Spawn;
+		$n = PublicInbox::Spawn::run_qx([qw(/bin/sh -c), 'ulimit -n']);
+	}
+	$n;
+}
+
 1;
diff --git a/lib/PublicInbox/XapHelper.pm b/lib/PublicInbox/XapHelper.pm
index f1311bd4..db9e99ae 100644
--- a/lib/PublicInbox/XapHelper.pm
+++ b/lib/PublicInbox/XapHelper.pm
@@ -18,7 +18,7 @@ use POSIX qw(:signal_h);
 use Fcntl qw(LOCK_UN LOCK_EX);
 use Carp qw(croak);
 my $X = \%PublicInbox::Search::X;
-our (%SRCH, %WORKERS, $nworker, $workerset, $in);
+our (%SRCH, %WORKERS, $nworker, $workerset, $in, $SHARD_NFD, $MY_FD_MAX);
 our $stderr = \*STDERR;
 
 sub cmd_test_inspect {
@@ -193,8 +193,14 @@ sub dispatch {
 	my $key = "-d\0".join("\0-d\0", @$dirs);
 	$key .= "\0".join("\0", map { ('-Q', $_) } @{$req->{Q}}) if $req->{Q};
 	my $new;
-	$req->{srch} = $SRCH{$key} //= do {
+	$req->{srch} = $SRCH{$key} // do {
 		$new = { qp_flags => $PublicInbox::Search::QP_FLAGS };
+		my $nfd = scalar(@$dirs) * PublicInbox::Search::SHARD_COST;
+		$SHARD_NFD += $nfd;
+		if ($SHARD_NFD > $MY_FD_MAX) {
+			$SHARD_NFD = $nfd;
+			%SRCH = ();
+		}
 		my $first = shift @$dirs;
 		my $slow_phrase = -f "$first/iamchert";
 		$new->{xdb} = $X->{Database}->new($first);
@@ -207,7 +213,7 @@ sub dispatch {
 		bless $new, $req->{c} ? 'PublicInbox::CodeSearch' :
 					'PublicInbox::Search';
 		$new->{qp} = $new->qparse_new;
-		$new;
+		$SRCH{$key} = $new;
 	};
 	$req->{srch}->{xdb}->reopen unless $new;
 	$req->{Q} && !$req->{srch}->{qp_extra_done} and
@@ -305,7 +311,7 @@ sub start (@) {
 	my $c = getsockopt(local $in = \*STDIN, SOL_SOCKET, SO_TYPE);
 	unpack('i', $c) == SOCK_SEQPACKET or die 'stdin is not SOCK_SEQPACKET';
 
-	local (%SRCH, %WORKERS);
+	local (%SRCH, %WORKERS, $SHARD_NFD, $MY_FD_MAX);
 	PublicInbox::Search::load_xapian();
 	$GLP->getoptionsfromarray(\@argv, my $opt = { j => 1 }, 'j=i') or
 		die 'bad args';
@@ -314,6 +320,10 @@ sub start (@) {
 	for (@PublicInbox::DS::UNBLOCKABLE, POSIX::SIGUSR1) {
 		$workerset->delset($_) or die "delset($_): $!";
 	}
+	$MY_FD_MAX = PublicInbox::Search::ulimit_n //
+		die "E: unable to get RLIMIT_NOFILE: $!";
+	warn "W: RLIMIT_NOFILE=$MY_FD_MAX too low\n" if $MY_FD_MAX < 72;
+	$MY_FD_MAX -= 64;
 
 	local $nworker = $opt->{j};
 	return recv_loop() if $nworker == 0;
diff --git a/lib/PublicInbox/XapHelperCxx.pm b/lib/PublicInbox/XapHelperCxx.pm
index 74852ad1..922bd583 100644
--- a/lib/PublicInbox/XapHelperCxx.pm
+++ b/lib/PublicInbox/XapHelperCxx.pm
@@ -34,6 +34,7 @@ my $ldflags = '-Wl,-O1';
 $ldflags .= ' -Wl,--compress-debug-sections=zlib' if $^O ne 'openbsd';
 my $xflags = ($ENV{CXXFLAGS} // '-Wall -ggdb3 -pipe') . ' ' .
 	' -DTHREADID=' . PublicInbox::Search::THREADID .
+	' -DSHARD_COST=' . PublicInbox::Search::SHARD_COST .
 	' -DXH_SPEC="'.join('',
 		map { s/=.*/:/; $_ } @PublicInbox::Search::XH_SPEC) . '" ' .
 	($ENV{LDFLAGS} // $ldflags);
diff --git a/lib/PublicInbox/xap_helper.h b/lib/PublicInbox/xap_helper.h
index 44e0d63e..c71ac06d 100644
--- a/lib/PublicInbox/xap_helper.h
+++ b/lib/PublicInbox/xap_helper.h
@@ -140,6 +140,7 @@ static int srch_eq(const struct srch *a, const struct srch *b)
 KHASHL_CSET_INIT(KH_LOCAL, srch_set, srch_set, struct srch *,
 		srch_hash, srch_eq)
 static srch_set *srch_cache;
+static long my_fd_max, shard_nfd;
 // sock_fd is modified in signal handler, yes, it's SOCK_SEQPACKET
 static volatile int sock_fd = STDIN_FILENO;
 static sigset_t fullset, workerset;
@@ -552,6 +553,34 @@ static bool is_chert(const char *dir)
 	return false;
 }
 
+static void srch_free(struct srch *srch)
+{
+	delete srch->qp;
+	delete srch->db;
+	free(srch);
+}
+
+static void srch_cache_renew(struct srch *keep)
+{
+	khint_t k;
+
+	// can't delete while iterating, so just free each + clear
+	for (k = kh_begin(srch_cache); k != kh_end(srch_cache); k++) {
+		if (!kh_exist(srch_cache, k)) continue;
+		struct srch *cur = kh_key(srch_cache, k);
+
+		if (cur != keep)
+			srch_free(cur);
+	}
+	srch_set_cs_clear(srch_cache);
+	if (keep) {
+		int absent;
+		k = srch_set_put(srch_cache, keep, &absent);
+		assert(absent);
+		assert(k < kh_end(srch_cache));
+	}
+}
+
 static bool srch_init(struct req *req)
 {
 	int i;
@@ -563,6 +592,13 @@ static bool srch_init(struct req *req)
 			Xapian::QueryParser::FLAG_WILDCARD;
 	if (is_chert(req->dirv[0]))
 		srch->qp_flags &= ~FLAG_PHRASE;
+	long nfd = req->dirc * SHARD_COST;
+
+	shard_nfd += nfd;
+	if (shard_nfd > my_fd_max) {
+		srch_cache_renew(srch);
+		shard_nfd = nfd;
+	}
 	try {
 		srch->db = new Xapian::Database(req->dirv[0]);
 	} catch (...) {
@@ -629,13 +665,6 @@ static void srch_init_extra(struct req *req)
 	req->srch->qp_extra_done = true;
 }
 
-static void srch_free(struct srch *srch)
-{
-	delete srch->qp;
-	delete srch->db;
-	free(srch);
-}
-
 static void dispatch(struct req *req)
 {
 	int c;
@@ -900,12 +929,7 @@ static void cleanup_all(void)
 	cleanup_pids();
 	if (!srch_cache)
 		return;
-
-	khint_t k;
-	for (k = kh_begin(srch_cache); k != kh_end(srch_cache); k++) {
-		if (kh_exist(srch_cache, k))
-			srch_free(kh_key(srch_cache, k));
-	}
+	srch_cache_renew(NULL);
 	srch_set_destroy(srch_cache);
 	srch_cache = NULL;
 }
@@ -1041,12 +1065,20 @@ int main(int argc, char *argv[])
 	socklen_t slen = (socklen_t)sizeof(c);
 	stdout_path = getenv("STDOUT_PATH");
 	stderr_path = getenv("STDERR_PATH");
+	struct rlimit rl;
 
 	if (getsockopt(sock_fd, SOL_SOCKET, SO_TYPE, &c, &slen))
 		err(EXIT_FAILURE, "getsockopt");
 	if (c != SOCK_SEQPACKET)
 		errx(EXIT_FAILURE, "stdin is not SOCK_SEQPACKET");
 
+	if (getrlimit(RLIMIT_NOFILE, &rl))
+		err(EXIT_FAILURE, "getrlimit");
+	my_fd_max = rl.rlim_cur;
+	if (my_fd_max < 72)
+		warnx("W: RLIMIT_NOFILE=%ld too low\n", my_fd_max);
+	my_fd_max -= 64;
+
 	mail_nrp_init();
 	code_nrp_init();
 	srch_cache = srch_set_init();
diff --git a/t/xap_helper.t b/t/xap_helper.t
index 78be8539..b0fa75a2 100644
--- a/t/xap_helper.t
+++ b/t/xap_helper.t
@@ -284,4 +284,27 @@ for my $n (@NO_CXX) {
 	}
 }
 
+SKIP: {
+	my $nr = $ENV{TEST_XH_FDMAX} or
+		skip 'TEST_XH_FDMAX unset', 1;
+	my @xhc = map {
+		local $ENV{PI_NO_CXX} = $_;
+		PublicInbox::XapClient::start_helper('-j0');
+	} @NO_CXX;
+	my $n = 1;
+	my $exp;
+	for (0..(PublicInbox::Search::ulimit_n() * $nr)) {
+		for my $xhc (@xhc) {
+			my $r = $xhc->mkreq([], qw(mset -Q), "tst$n=XTST$n",
+					@ibx_shard_args, qw(rt:0..));
+			chomp(my @res = readline($r));
+			$exp //= $res[0];
+			$exp eq $res[0] or
+				is $exp, $res[0], "mset mismatch on n=$n";
+			++$n;
+		}
+	}
+	ok $exp, "got expected entries ($n)";
+}
+
 done_testing;

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-05-16  0:12 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-16  0:12 [PATCH 01/10] search: support per-inbox indexheader directive Eric Wong
2024-05-16  0:12 ` [PATCH 02/10] indexheader: deduplicate common values Eric Wong
2024-05-16  0:12 ` [PATCH 03/10] search: help: avoid ':' in user prefixes Eric Wong
2024-05-16  0:12 ` [PATCH 04/10] config: dedupe ibx->{newsgroup} Eric Wong
2024-05-16  0:12 ` [PATCH 05/10] search: move QueryParser mappings to xh_args Eric Wong
2024-05-16  0:12 ` [PATCH 06/10] xap_helper: key search instances by -Q params, too Eric Wong
2024-05-16  0:12 ` [PATCH 07/10] xap_helper.h: use khashl.h instead of hsearch(3) Eric Wong
2024-05-16  0:12 ` [PATCH 08/10] xap_helper.h: use xcalloc to simplify error checking Eric Wong
2024-05-16  0:12 ` [PATCH 09/10] xap_helper.h: memoize Xapian handles with khashl Eric Wong
2024-05-16  0:12 ` [PATCH 10/10] xap_helper: expire DB handles when FD table is near full Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).