public-inbox.git - an "archives first" approach to mailing lists

diff options

author	Eric Wong <e@80x24.org>	2023-08-24 01:22:34 +0000
committer	Eric Wong <e@80x24.org>	2023-08-24 07:47:52 +0000
commit	1c8430a7fa407e476ef70a6a199983faf071d7a5 (patch)
tree	610e37eb535d08b932f16ed02b82d03a01461bed /MANIFEST
parent	b18ecb7707e83cb8cb38c3736aecd984999ca0a7 (diff)
download	public-inbox-1c8430a7fa407e476ef70a6a199983faf071d7a5.tar.gz

cindex: fix sorting and uniqueness

We can't rely on combining the `-u' and `-k1,1' switches of POSIX
sort(1) to do what we want.  So only rely on `sort -k1,1' while
introducing a small Perl helper to fold identical prefixes into
one line.  In other words, input such as:

  deadbeef 0
  deadbeef 1
  deadbeef 2

Was getting deduplicated into a single line:

  deadbeef 0

... with `sort -u -k1,1'
This makes puts the output into a more optimal form for eventual
(not-fully-implemented-yet) parsing:

  deadbeef 0,1,2

ORS is current the comma (`,') for inbox IDs, but it'll be a
space (` ') for coderepo root IDs.  This implementation also
combines identical IDs in the 2nd column.  Thus:

  deadbeef 0
  deadbeef 0

Becomes a single `deadbeef 0' line thanks to the use of
XS List::Util::uniq (which beats a pure Perl hash).

I attempted to implement this in awk but Perl is close enough to
gawk in performance while being shorter and easier-to-understand
due to List::Util::uniq.  mawk was faster, but still not enough
to matter as the bottleneck is from iterating through Xapian
MSets.

Diffstat (limited to 'MANIFEST')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: