about summary refs log tree commit homepage
path: root/lib/PublicInbox/LeiDedupe.pm
DateCommit message (Collapse)
2023-06-15lei: make --dedupe=content always account for Message-IDs
The content dedupe logic was originally designed for v2 public inboxes as a fallback for when the importer sees identical Message-IDs. Thus it did not account for Message-ID(s) in the message itself. This change doesn't affect saved searches (the default when writing to a pathname or IMAP). It affects --no-save, and outputs to stdout (even if stdout is redirected to a file). Prior to this change, lei reused the v2 logic as-is without accounting for Message-IDs anywhere with `--dedupe=content' (the default). This could cause messages to be skipped when the content matches despite Message-IDs being different. So with this change, `lei q --dedupe=content' will hash the Message-ID(s) in the message to ensure messages with different Message-IDs are NOT deduplicated. Whether or not this change is a bug fix or introduces regression is actually debatable. In my mind, it is better to err on the side of showing too many messages rather than too few, even if the actual contents of the message are identical. Making saved searches deduplicate without accounting for Message-IDs would be more difficult, too.
2023-03-12lei_dedupe: simplify smsg_hash sub
We can just use the sha256() sub instead of dealing with the OO interface for a small string.
2023-01-30use Net::SSLeay (OpenSSL) for SHA-(1|256) if installed
On my x86-64 machine, OpenSSL SHA-256 is nearly twice as fast as the Digest::SHA implementation from Perl, most likely due to an optimized assembly implementation. SHA-1 is a few percent faster, too.
2021-07-25lei: avoid SQLite COUNT() for dedupe
SQLite COUNT() is a slow operation that does a full table scan with no conditions. There's no need for it, since lei dedupe only needs to know if it's empty or not to decide between new/ and cur/ for Maildir outputs.
2021-05-23lei <q|up>: set \Recent on non-empty mbox and Maildir
Despite JMAP not supporting the equivalent of the IMAP \Recent flag, it is useful for "lei q --augment", and "lei up" users to be able to distinguish new results from old-but-unread messages in an mbox or Maildir. For mbox family messages, we'll drop the "O" status flag when appending to mboxes, and we'll write to the "new" subdirectory of Maildirs. Behavior when writing to initially empty Maildirs and mboxes remains unchanged since there's no need to distinguish between new and old results in the initial case. Having users wait for a rename(2) storm or complete mbox rewrite hurts UX. With IMAP mailboxes, \Recent is already enforced by the IMAP server and IMAP clients have no way of changing it(*) (*) mutt uses the "Old" IMAP flag which isn't part of RFC 3501, other MUAs may do similar things.
2021-04-13lei_dedupe: adjust to prepare for saved searches
LeiSavedSearch will use a LeiDedupe-like internal API, so we won't have to make as many changes to callsites between saved and unsaved searches.
2021-03-21lei q: fix warning on remote imports
This will let us tie keywords from remote externals to those which only exist in local externals.
2021-02-18lei convert: mail format conversion sub-command
This will make testing IMAP support for other commands easier, as it doesn't write to lei/store at all. Like the pager and MUA, "git credential" is always spawned by script/lei (and not lei-daemon) so it has a controlling terminal for password prompts. v2: fix missing requires, correct test ordering v3: ensure config exists for IMAP auth
2021-02-04tests: guard against missing DBD::SQLite
The features we use for SharedKV could probably be implemented with GDBM_File or SDBM_File, but that doesn't seem worth it at the moment since we depend on SQLite elsewhere.
2021-02-01sharedkv: lock and explicitly disconnect {dbh}
It may be possible for updates or changes to be uncommitted until disconnect, so we'll use flock() as we do elsewhere to avoid the polling retry behavior of SQLite. We also need to clear CachedKids before disconnecting to to avoid warnings like: ->disconnect invalidates 1 active statement handle (either destroy statement handles or call finish on them before disconnecting)
2021-02-01lei_dedupe: use Digest::SHA
While it's loaded by ContentHash, we use Digest::SHA directly in this package for smsg and OID-only deduplication.
2021-02-01lei: more consistent dedupe and ovv_buf init
This fixes "--dedupe none" with Maildir where we don't create the object at all.
2021-01-18lei: q: results output to Maildir and mbox* working
All the augment and deduplication stuff seems to be working based on unit tests. OpPipe is a nice general addition that will probably make future state machines easier.
2021-01-14lei_dedupe+shared_kv: ensure round-tripping serialization
We'll be passing these objects via PublicInbox::IPC which uses Storable (or Sereal), so ensure they're safe to use after serialization.
2021-01-12lei q: deduplicate smsg
We don't want duplicate messages in results overviews, either.
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2021-01-01lei_to_mail: support Maildir, fix+test --augment
Maildir should be plenty fine for short-lived output folders.
2021-01-01lei: implement various deduplication strategies
For writing mboxes and Maildirs, users may wish to use stricter or looser deduplication strategies. This gives them more control.