lei: make --dedupe=content always account for Message-IDs

The content dedupe logic was originally designed for v2 public inboxes as a fallback for when the importer sees identical Message-IDs. Thus it did not account for Message-ID(s) in the message itself. This change doesn't affect saved searches (the default when writing to a pathname or IMAP). It affects --no-save, and outputs to stdout (even if stdout is redirected to a file). Prior to this change, lei reused the v2 logic as-is without accounting for Message-IDs anywhere with `--dedupe=content' (the default). This could cause messages to be skipped when the content matches despite Message-IDs being different. So with this change, `lei q --dedupe=content' will hash the Message-ID(s) in the message to ensure messages with different Message-IDs are NOT deduplicated. Whether or not this change is a bug fix or introduces regression is actually debatable. In my mind, it is better to err on the side of showing too many messages rather than too few, even if the actual contents of the message are identical. Making saved searches deduplicate without accounting for Message-IDs would be more difficult, too.
author: Eric Wong <e@80x24.org> 2023-06-15 09:50:53 +0000
committer: Eric Wong <e@80x24.org> 2023-06-15 19:40:57 +0000
commit: dc5fe01a85c943b24b7b2ae0929b4fccaf81235f (patch)
tree: d2093c5d8521b340135c2301ae8b8c0f494a0e29 /lib/PublicInbox/ContentHash.pm
parent: dd1c202fdec09f47dc2d17b715b2c5b78f506ac8 (diff)
download: public-inbox-dc5fe01a85c943b24b7b2ae0929b4fccaf81235f.tar.gz
1 files changed, 11 insertions, 4 deletions
diff --git a/lib/PublicInbox/ContentHash.pm b/lib/PublicInbox/ContentHash.pm
index fc94257c..95ca2929 100644
--- a/lib/PublicInbox/ContentHash.pm
+++ b/lib/PublicInbox/ContentHash.pm
@@ -54,16 +54,23 @@ sub content_dig_i {
          $dig->add($s);
  }
  
-sub content_digest ($;$) {
-        my ($eml, $dig) = @_;
+sub content_digest ($;$$) {
+        my ($eml, $dig, $hash_mids) = @_;
          $dig //= Digest::SHA->new(256);
  
          # References: and In-Reply-To: get used interchangeably
          # in some "duplicates" in LKML.  We treat them the same
          # in SearchIdx, so treat them the same for this:
          # do NOT consider the Message-ID as part of the content_hash
-        # if we got here, we've already got Message-ID reuse
-        my %seen = map { $_ => 1 } @{mids($eml)};
+        # if we got here, we've already got Message-ID reuse for v2.
+        #
+        # However, `lei q --dedupe=content' does use $hash_mids since
+        # it doesn't have any other dedupe
+        my $mids = mids($eml);
+        if ($hash_mids) {
+                $dig->add("mid\0$_\0") for @$mids;
+        }
+        my %seen = map { $_ => 1 } @$mids;
          for (grep { !$seen{$_}++ } @{references($eml)}) {
                  utf8::encode($_);
                  $dig->add("ref\0$_\0");
author	Eric Wong <e@80x24.org>	2023-06-15 09:50:53 +0000
committer	Eric Wong <e@80x24.org>	2023-06-15 19:40:57 +0000
commit	dc5fe01a85c943b24b7b2ae0929b4fccaf81235f (patch)
tree	d2093c5d8521b340135c2301ae8b8c0f494a0e29 /lib/PublicInbox/ContentHash.pm
parent	dd1c202fdec09f47dc2d17b715b2c5b78f506ac8 (diff)
download	public-inbox-dc5fe01a85c943b24b7b2ae0929b4fccaf81235f.tar.gz