about summary refs log tree commit homepage
path: root/lib/PublicInbox/ContentHash.pm
DateCommit message (Collapse)
2023-06-15lei: make --dedupe=content always account for Message-IDs
The content dedupe logic was originally designed for v2 public inboxes as a fallback for when the importer sees identical Message-IDs. Thus it did not account for Message-ID(s) in the message itself. This change doesn't affect saved searches (the default when writing to a pathname or IMAP). It affects --no-save, and outputs to stdout (even if stdout is redirected to a file). Prior to this change, lei reused the v2 logic as-is without accounting for Message-IDs anywhere with `--dedupe=content' (the default). This could cause messages to be skipped when the content matches despite Message-IDs being different. So with this change, `lei q --dedupe=content' will hash the Message-ID(s) in the message to ensure messages with different Message-IDs are NOT deduplicated. Whether or not this change is a bug fix or introduces regression is actually debatable. In my mind, it is better to err on the side of showing too many messages rather than too few, even if the actual contents of the message are identical. Making saved searches deduplicate without accounting for Message-IDs would be more difficult, too.
2023-04-25mail_diff: match ContentHash EOL and EOM behavior more closely
ContentHash currently doesn't convert CRCRLF to LF. Perhaps it should, but for now, have diff behavior match the actual comparison behavior used for dedupe and omit all trailing whitespace for diff.
2023-04-25mid+contenthash: eliminate needless local variable captures
It's possible in theory that Perl could be smarter and free memory a tad sooner this way. Regardless, fewer lines of code is easier-to-navigate/read and can save optree size and reduce parsing times.
2023-01-30use Net::SSLeay (OpenSSL) for SHA-(1|256) if installed
On my x86-64 machine, OpenSSL SHA-256 is nearly twice as fast as the Digest::SHA implementation from Perl, most likely due to an optimized assembly implementation. SHA-1 is a few percent faster, too.
2022-11-27content_hash: handle References as octets
The alsa-devel archives on lore has some UTF-8 References: headers, so we need to treat them as octets, again, otherwise (re)indexing triggers cascading failures. Fixes: 5198c976ce8b "eml: header_raw converts octets to Perl UTF-8"
2021-10-02content_hash: normalize whitespace before hashing addresses
This should prevent some false duplicates. I noticed this while implementing "lei mail-diff", and only noticed it when I implemented the ContentDigestDbg wrapper for mail-diff.
2021-10-02lei mail-diff: diagnostic command to diff mail contents
This is useful in finding the cause of deduplication bugs, and possibly the cause of missing threads reported by Konstantin in <20211001130527.z7eivotlgqbgetzz@meerkat.local> usage: u=https://yhbt.net/lore/all/87czop5j33.fsf@tynnyri.adurom.net/raw lei mail-diff $u
2021-04-30content_hash: git_sha: allow unblessed SCALAR refs
This will be convenient to avoid the overhead of PublicInbox::Eml for verifying synchronization in lei.
2021-03-21lei q: fix warning on remote imports
This will let us tie keywords from remote externals to those which only exist in local externals.
2021-01-31content_hash: skip Sender for cross posted messages
This regression was introduced long ago and matches behavior originally specified in the comments. It makes a noticeable improvement with search results using -extindex ("all") and lei results with multiple inboxes. Update some style bits at the top of the test case while we're at it. Fixes: f0ef0a56a8957d6f ("v2: improve deduplication checks")
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-08-02remove unnecessary ->header_obj calls
We used ->header_obj in the past as an optimization with Email::MIME. That optimization is no longer necessary with PublicInbox::Eml. This doesn't make any functional difference even if we were to go back to Email::MIME. However, it reduces the amount of code we have and slightly reduces allocations with PublicInbox::Eml.
2020-05-12rename "ContentId" to "ContentHash"
The old name may be confused with "Content-ID" as described in RFC 2392, so use an alternate name to avoid confusing future readers.