From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 1A9BA1F576; Sun, 21 Jan 2018 23:49:12 +0000 (UTC) Date: Sun, 21 Jan 2018 23:49:11 +0000 From: Eric Wong To: Dimid Duchovny Cc: msgthr-public@80x24.org Subject: Re: Feature Request: thread grouping Message-ID: <20180121234911.GA29238@whir> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: List-Id: Dimid Duchovny wrote: > However, I realized that the last step (walking) is redundant, > since that could be done by the library itself in the threading or > ordering stages. I think you want is best done in the storage/indexing stage; whereas msgthr is intended for display/rendering results that were retrieved from some sort of search engine. At least thats how notmuch does it, and I stole the logic for public-inbox(*) as they both use Xapian. I think mairix does something similar, too; but it's been a while... > E.g. keeping track of each container's thread, > and when adding a message A as a child of message B, to point A's > thread to B's one. > We could use an array with a single element, > or some other solution to have pass-by-reference semantics. > Finally, all top-level containers should have their own msg_id as the thread, > and all their descendants will point to it as well. One advantage to doing this in the storage phase is this info is persistent and you don't need to calculate it every time. This is great when you're dealing with more message skeletons than can fit in memory. git@vger has over 300k messages, LKML will have several million messages, and they both use String Message-IDs (being email), so it'll be many hundreds of MB just in containers and Message-IDs. Another huge advantage in doing this when indexing a message phase is you can easily search for something in a single message and then easily pull every message from the thread it belongs to based on a boolean thread_id search. I also find the "-t" switch of mairix being useful for my private mail. I can help you understand how public-inbox does this in SearchIdx.pm (indexer) and Search.pm (read-only queries) if you're not familiar with Perl5, but for now you can grab the code and try understanding it on your own: git clone https://public-inbox.org/public-inbox http://repo.or.cz/public-inbox.git/blob/4f2f0eb94739edf:/lib/PublicInbox/SearchIdx.pm http://repo.or.cz/public-inbox.git/blob/4f2f0eb94739edf:/lib/PublicInbox/Search.pm I'll be happy to answer questions on meta@public-inbox.org about it :) > Would you consider adding such a feature? If so, I'll be happy to work > out the details and submit a patch. I'm not sure if it makes sense to add this without a stable storage backend (Xapian or some other search indexer/DB). Another potential problem is adding this to msgthr is msgthr is GPL-2+ (since it's a port of Mail::Thread from CPAN); but the notmuch algorithm is GPL-3+, so I'm not allowed to put it into a GPL-2+ project (APGL-3+ is OK). Maybe you can cite prior art from mairix (GPL-2+), but I haven't looked at that code in many years and don't remember it.