msgthr user+dev discussion/patches/pulls/bugs/help
 help / color / mirror / code / Atom feed
* Feature Request: thread grouping
@ 2018-01-21  9:40 Dimid Duchovny
  2018-01-21 23:49 ` Eric Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Dimid Duchovny @ 2018-01-21  9:40 UTC (permalink / raw)
  To: msgthr-public

Hello Eric,

When using the library, I'd like to eventually know which mail belongs
to each thread
(The motivation is to use a Union-Find data structure).
Thus, I'm doing the following:
* adding messages
* threading
* ordering
* walking

I.e. something like this:
all_comms.each do |comm|
        [...]
        if in_reply_to && !in_reply_to.empty?
                refs = [in_reply_to]
        else
                refs = []
        end
        msgthr.add msg_id, refs, comm
end
msgthr.thread!
msgthr.order!{ |_| }
threads = {}

# We want to account for cases where the head of the thread is
missing, but still referenced.
last_thread = nil
msgthr.walk_thread do |level, container, index|
        msg = container.msg
        mid = container.mid
        if 0 == level
                threads[mid] = []
                last_thread = mid
        else
                threads[last_thread] << mid
        end
end

However, I realized that the last step (walking) is redundant,
since that could be done by the library itself in the threading or
ordering stages.
E.g. keeping track of each container's thread,
and when adding a message A as a child of message B, to point A's
thread to B's one.
We could use an array with a single element,
or some other solution to have pass-by-reference semantics.
Finally, all top-level containers should have their own msg_id as the thread,
and all their descendants will point to it as well.

Would you consider adding such a feature? If so, I'll be happy to work
out the details and submit a patch.

Thanks, and have a nice weekend.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature Request: thread grouping
  2018-01-21  9:40 Feature Request: thread grouping Dimid Duchovny
@ 2018-01-21 23:49 ` Eric Wong
  2018-01-23 21:04   ` Dimid Duchovny
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Wong @ 2018-01-21 23:49 UTC (permalink / raw)
  To: Dimid Duchovny; +Cc: msgthr-public

Dimid Duchovny <dimidd@gmail.com> wrote:
> However, I realized that the last step (walking) is redundant,
> since that could be done by the library itself in the threading or
> ordering stages.

I think you want is best done in the storage/indexing stage;
whereas msgthr is intended for display/rendering results that
were retrieved from some sort of search engine.

At least thats how notmuch does it, and I stole the logic for
public-inbox(*) as they both use Xapian.  I think mairix does
something similar, too; but it's been a while...

> E.g. keeping track of each container's thread,
> and when adding a message A as a child of message B, to point A's
> thread to B's one.
> We could use an array with a single element,
> or some other solution to have pass-by-reference semantics.
> Finally, all top-level containers should have their own msg_id as the thread,
> and all their descendants will point to it as well.

One advantage to doing this in the storage phase is this info is
persistent and you don't need to calculate it every time.  This
is great when you're dealing with more message skeletons than
can fit in memory.  git@vger has over 300k messages, LKML will
have several million messages, and they both use String
Message-IDs (being email), so it'll be many hundreds of MB just
in containers and Message-IDs.

Another huge advantage in doing this when indexing a message
phase is you can easily search for something in a single
message and then easily pull every message from the thread it
belongs to based on a boolean thread_id search.  I also find
the "-t" switch of mairix being useful for my private mail.

I can help you understand how public-inbox does this in
SearchIdx.pm (indexer) and Search.pm (read-only queries) if
you're not familiar with Perl5, but for now you can grab the
code and try understanding it on your own:

	git clone https://public-inbox.org/public-inbox

http://repo.or.cz/public-inbox.git/blob/4f2f0eb94739edf:/lib/PublicInbox/SearchIdx.pm
http://repo.or.cz/public-inbox.git/blob/4f2f0eb94739edf:/lib/PublicInbox/Search.pm

I'll be happy to answer questions on meta@public-inbox.org
about it :)

> Would you consider adding such a feature? If so, I'll be happy to work
> out the details and submit a patch.

I'm not sure if it makes sense to add this without a stable
storage backend (Xapian or some other search indexer/DB).

Another potential problem is adding this to msgthr is msgthr is
GPL-2+ (since it's a port of Mail::Thread from CPAN); but the
notmuch algorithm is GPL-3+, so I'm not allowed to put it into
a GPL-2+ project (APGL-3+ is OK).

Maybe you can cite prior art from mairix (GPL-2+), but I haven't
looked at that code in many years and don't remember it.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature Request: thread grouping
  2018-01-21 23:49 ` Eric Wong
@ 2018-01-23 21:04   ` Dimid Duchovny
  2018-01-23 21:12     ` Dimid Duchovny
  0 siblings, 1 reply; 12+ messages in thread
From: Dimid Duchovny @ 2018-01-23 21:04 UTC (permalink / raw)
  To: Eric Wong; +Cc: msgthr-public

2018-01-22 1:49 GMT+02:00 Eric Wong <e@80x24.org>:
> Dimid Duchovny <dimidd@gmail.com> wrote:
>> However, I realized that the last step (walking) is redundant,
>> since that could be done by the library itself in the threading or
>> ordering stages.
>
> I think you want is best done in the storage/indexing stage;
> whereas msgthr is intended for display/rendering results that
> were retrieved from some sort of search engine.
>

You're right. In my case the flow was: read emails from storage ->
group to threads -> add thread field to storage.
However, I guess it's an edge-case.
On second thought, maybe it'd be better to have a more general solution.
E.g. let the client run an arbitrary callback after adding a child.

Here's a quick POC:
https://github.com/dimidd/msgthr/commit/1c701717d10879d492d8b55fb8ca2f1c53d7e13f

P.S. I hope you don't mind I uploaded my fork to github.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature Request: thread grouping
  2018-01-23 21:04   ` Dimid Duchovny
@ 2018-01-23 21:12     ` Dimid Duchovny
  2018-01-23 22:03       ` Eric Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Dimid Duchovny @ 2018-01-23 21:12 UTC (permalink / raw)
  To: Eric Wong; +Cc: msgthr-public

Sorry, the link should be:
https://github.com/dimidd/msgthr/commit/1c701717d10879d492d8b55fb8ca2f1c53d7e13f.diff

2018-01-23 23:04 GMT+02:00 Dimid Duchovny <dimidd@gmail.com>:
> 2018-01-22 1:49 GMT+02:00 Eric Wong <e@80x24.org>:
>> Dimid Duchovny <dimidd@gmail.com> wrote:
>>> However, I realized that the last step (walking) is redundant,
>>> since that could be done by the library itself in the threading or
>>> ordering stages.
>>
>> I think you want is best done in the storage/indexing stage;
>> whereas msgthr is intended for display/rendering results that
>> were retrieved from some sort of search engine.
>>
>
> You're right. In my case the flow was: read emails from storage ->
> group to threads -> add thread field to storage.
> However, I guess it's an edge-case.
> On second thought, maybe it'd be better to have a more general solution.
> E.g. let the client run an arbitrary callback after adding a child.
>
> Here's a quick POC:
> https://github.com/dimidd/msgthr/commit/1c701717d10879d492d8b55fb8ca2f1c53d7e13f
>
> P.S. I hope you don't mind I uploaded my fork to github.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature Request: thread grouping
  2018-01-23 21:12     ` Dimid Duchovny
@ 2018-01-23 22:03       ` Eric Wong
  2018-01-24 10:28         ` Dimid Duchovny
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Wong @ 2018-01-23 22:03 UTC (permalink / raw)
  To: Dimid Duchovny; +Cc: msgthr-public

Dimid Duchovny <dimidd@gmail.com> wrote:
> > You're right. In my case the flow was: read emails from storage ->
> > group to threads -> add thread field to storage.
> > However, I guess it's an edge-case.
> > On second thought, maybe it'd be better to have a more general solution.
> > E.g. let the client run an arbitrary callback after adding a child.

OK, I guess you managed to fit skeletons of all your messages in memory?

> > Here's a quick POC:
> > https://github.com/dimidd/msgthr/commit/1c701717d10879d492d8b55fb8ca2f1c53d7e13f

(truncated output of "git show 1c701717d10879d492d8b55fb8ca2f1c53d7e13f"

>     add callback to Msgthr#add
>     
>     The motivation is to allow the client to have a custom code executed,
>         whenever a child is added.
> 
> --- a/lib/msgthr.rb
> +++ b/lib/msgthr.rb
> @@ -166,12 +166,16 @@ class Msgthr
>        # but do not change existing links or loop
>        if prev && !cont.parent && !cont.has_descendent(prev)
>          prev.add_child(cont)
> +        yield(prev, cont) if block_given?
>        end
>        prev = cont
>      end
>  
>      # set parent of this message to be the last element in refs
> -    prev.add_child(cur) if prev
> +    if prev
> +      prev.add_child(cur)
> +      yield(prev, cur) if block_given?
> +    end
>    end
>  end

OK, that seems generic enough and we can probably support it
long-term, so I'm somewhat inclined to accept it...

However, APIs encouraging/supporting folks to load their entire
collection(*) of messages (even skeletons) into memory feels
wrong to me.

Can you come up with a use case where this is useful for
a subset of messages?


(*) I work with millions of emails

> > P.S. I hope you don't mind I uploaded my fork to github.

That's fine, I just add a new remote(*) to my .git/config, fetch
and show.

What I won't accept about GitHub is having it as a centralized
and proprietary messaging system which forces participants to
accept their ToS.  I can't accept that; no single entity
controls email, so that's what I stick with.


(*) added this to my .git/config
==> .git/config <==
[remote "dimidd"]
	url = https://github.com/dimidd/msgthr
	fetch = refs/heads/*:refs/remotes/dimidd/*

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature Request: thread grouping
  2018-01-23 22:03       ` Eric Wong
@ 2018-01-24 10:28         ` Dimid Duchovny
  2018-01-24 19:18           ` Eric Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Dimid Duchovny @ 2018-01-24 10:28 UTC (permalink / raw)
  To: Eric Wong; +Cc: msgthr-public

2018-01-24 0:03 GMT+02:00 Eric Wong <e@80x24.org>:
> Dimid Duchovny <dimidd@gmail.com> wrote:
>> > You're right. In my case the flow was: read emails from storage ->
>> > group to threads -> add thread field to storage.
>> > However, I guess it's an edge-case.
>> > On second thought, maybe it'd be better to have a more general solution.
>> > E.g. let the client run an arbitrary callback after adding a child.
>
> OK, I guess you managed to fit skeletons of all your messages in memory?
>
>> > Here's a quick POC:
>> > https://github.com/dimidd/msgthr/commit/1c701717d10879d492d8b55fb8ca2f1c53d7e13f
>
> (truncated output of "git show 1c701717d10879d492d8b55fb8ca2f1c53d7e13f"
>
>>     add callback to Msgthr#add
>>
>>     The motivation is to allow the client to have a custom code executed,
>>         whenever a child is added.
>>
>> --- a/lib/msgthr.rb
>> +++ b/lib/msgthr.rb
>> @@ -166,12 +166,16 @@ class Msgthr
>>        # but do not change existing links or loop
>>        if prev && !cont.parent && !cont.has_descendent(prev)
>>          prev.add_child(cont)
>> +        yield(prev, cont) if block_given?
>>        end
>>        prev = cont
>>      end
>>
>>      # set parent of this message to be the last element in refs
>> -    prev.add_child(cur) if prev
>> +    if prev
>> +      prev.add_child(cur)
>> +      yield(prev, cur) if block_given?
>> +    end
>>    end
>>  end
>
> OK, that seems generic enough and we can probably support it
> long-term, so I'm somewhat inclined to accept it...
>
> However, APIs encouraging/supporting folks to load their entire
> collection(*) of messages (even skeletons) into memory feels
> wrong to me.
>
> Can you come up with a use case where this is useful for
> a subset of messages?
>

Well, in my specific case there weren't many messages, so memory
wasn't an issue.
In general, I think the question of adding the add_child callback is
orthogonal to the
question of using the entire collection or parts of.
I.e. one could use Msgthr as it is, with millions of emails, and one
could use the callback with only a few messages.
Consider this flow:
1. querying the storage backend according to some criteria (e.g. a
time range, a particular sender, etc.)
2. grouping the messages in the response to threads

I'd rather show than tell, so here's a more elaborated example:
https://github.com/dimidd/msgthr/commit/3e38a4910e7a3c17c07f47c4f1b9d556a4a951fd.patch

BTW, note how we only needed one pointer per message and one string
*per thread*,
by using an array with a single element and saving the actual message
only in the top level (the rootset).


>
> (*) I work with millions of emails
>
>> > P.S. I hope you don't mind I uploaded my fork to github.
>
> That's fine, I just add a new remote(*) to my .git/config, fetch
> and show.
>
> What I won't accept about GitHub is having it as a centralized
> and proprietary messaging system which forces participants to
> accept their ToS.  I can't accept that; no single entity
> controls email, so that's what I stick with.
>
>
> (*) added this to my .git/config
> ==> .git/config <==
> [remote "dimidd"]
>         url = https://github.com/dimidd/msgthr
>         fetch = refs/heads/*:refs/remotes/dimidd/*

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature Request: thread grouping
  2018-01-24 10:28         ` Dimid Duchovny
@ 2018-01-24 19:18           ` Eric Wong
  2018-01-24 21:14             ` Dimid Duchovny
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Wong @ 2018-01-24 19:18 UTC (permalink / raw)
  To: Dimid Duchovny; +Cc: msgthr-public

Dimid Duchovny <dimidd@gmail.com> wrote:
> Consider this flow:
> 1. querying the storage backend according to some criteria (e.g. a
> time range, a particular sender, etc.)
> 2. grouping the messages in the response to threads
> 
> I'd rather show than tell, so here's a more elaborated example:
> https://github.com/dimidd/msgthr/commit/3e38a4910e7a3c17c07f47c4f1b9d556a4a951fd.patch

OK, thank you; I see potential for this change, now.

Can you also add documentation for this feature so I can make a
release?  Thanks again.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature Request: thread grouping
  2018-01-24 19:18           ` Eric Wong
@ 2018-01-24 21:14             ` Dimid Duchovny
  2018-01-24 22:49               ` Eric Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Dimid Duchovny @ 2018-01-24 21:14 UTC (permalink / raw)
  To: Eric Wong; +Cc: msgthr-public

> Can you also add documentation for this feature so I can make a
> release?  Thanks again.

Sure, let me know if it's clear enough:
https://github.com/dimidd/msgthr/commit/6a1d06abc6c42e504a83dca4420c1d5ef39f09d2.patch

I've tested with olddoc.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature Request: thread grouping
  2018-01-24 21:14             ` Dimid Duchovny
@ 2018-01-24 22:49               ` Eric Wong
  2018-01-25  8:16                 ` Dimid Duchovny
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Wong @ 2018-01-24 22:49 UTC (permalink / raw)
  To: Dimid Duchovny; +Cc: msgthr-public

Dimid Duchovny <dimidd@gmail.com> wrote:
> > Can you also add documentation for this feature so I can make a
> > release?  Thanks again.
> 
> Sure, let me know if it's clear enough:
> https://github.com/dimidd/msgthr/commit/6a1d06abc6c42e504a83dca4420c1d5ef39f09d2.patch

Thanks, that's fine and I've pushed to 80x24.org/msgthr.git
I noticed you're also working on a new lca branch, should we
wait for that, or are you good with things as-is?

Also, if you want, you can also write some text for release
announcement which will be read by ruby-talk (just reply here
with the text).  Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature Request: thread grouping
  2018-01-24 22:49               ` Eric Wong
@ 2018-01-25  8:16                 ` Dimid Duchovny
  2018-01-25  8:38                   ` Eric Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Dimid Duchovny @ 2018-01-25  8:16 UTC (permalink / raw)
  To: Eric Wong; +Cc: msgthr-public

2018-01-25 0:49 GMT+02:00 Eric Wong <e@80x24.org>:
> Dimid Duchovny <dimidd@gmail.com> wrote:
>> > Can you also add documentation for this feature so I can make a
>> > release?  Thanks again.
>>
>> Sure, let me know if it's clear enough:
>> https://github.com/dimidd/msgthr/commit/6a1d06abc6c42e504a83dca4420c1d5ef39f09d2.patch
>
> Thanks, that's fine and I've pushed to 80x24.org/msgthr.git
> I noticed you're also working on a new lca branch, should we
> wait for that, or are you good with things as-is?

Thanks, I'm  good. The lca branch is intended for a POC of my actual use case,
Lowest Common Ancestor in email threads.
It has another test for the add_child callback. It's a rather specific
and complex test, so I've put it in a separate branch.
That said, feel free to merge it as you see fit.

> Also, if you want, you can also write some text for release
> announcement which will be read by ruby-talk (just reply here
> with the text).  Thanks.

Hmmm, I wasn't thinking about an announcement for such a minor change.
Could you write one?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature Request: thread grouping
  2018-01-25  8:16                 ` Dimid Duchovny
@ 2018-01-25  8:38                   ` Eric Wong
  2018-02-08 13:06                     ` Dimid Duchovny
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Wong @ 2018-01-25  8:38 UTC (permalink / raw)
  To: Dimid Duchovny; +Cc: msgthr-public

Dimid Duchovny <dimidd@gmail.com> wrote:
> 2018-01-25 0:49 GMT+02:00 Eric Wong <e@80x24.org>:
> > I noticed you're also working on a new lca branch, should we
> > wait for that, or are you good with things as-is?
> 
> Thanks, I'm  good. The lca branch is intended for a POC of my actual use case,
> Lowest Common Ancestor in email threads.
> It has another test for the add_child callback. It's a rather specific
> and complex test, so I've put it in a separate branch.

OK, I will wait until it's more proven.  I'm pretty conservative
about accepting new features because of support and performance
concerns.

> > Also, if you want, you can also write some text for release
> > announcement which will be read by ruby-talk (just reply here
> > with the text).  Thanks.
> 
> Hmmm, I wasn't thinking about an announcement for such a minor change.
> Could you write one?

Sure, but I was just thinking you could do a better job
introducing it to other people since you wrote it :)
I announce all releases of software I intend to support to ruby-talk.
I'm sleepy so my English has deteriorated, will wait until tomorrow
before releasing + announcement.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature Request: thread grouping
  2018-01-25  8:38                   ` Eric Wong
@ 2018-02-08 13:06                     ` Dimid Duchovny
  0 siblings, 0 replies; 12+ messages in thread
From: Dimid Duchovny @ 2018-02-08 13:06 UTC (permalink / raw)
  To: Eric Wong; +Cc: msgthr-public

> Sure, but I was just thinking you could do a better job
> introducing it to other people since you wrote it :)
> I announce all releases of software I intend to support to ruby-talk.
> I'm sleepy so my English has deteriorated, will wait until tomorrow
> before releasing + announcement.


Hi Eric,

Thanks for writing the announcement. I just realized that solely for
thread grouping purposes,
It's not necessary to call `thread!`, since it doesn't create any new edges.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-02-08 13:07 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-21  9:40 Feature Request: thread grouping Dimid Duchovny
2018-01-21 23:49 ` Eric Wong
2018-01-23 21:04   ` Dimid Duchovny
2018-01-23 21:12     ` Dimid Duchovny
2018-01-23 22:03       ` Eric Wong
2018-01-24 10:28         ` Dimid Duchovny
2018-01-24 19:18           ` Eric Wong
2018-01-24 21:14             ` Dimid Duchovny
2018-01-24 22:49               ` Eric Wong
2018-01-25  8:16                 ` Dimid Duchovny
2018-01-25  8:38                   ` Eric Wong
2018-02-08 13:06                     ` Dimid Duchovny

Code repositories for project(s) associated with this public inbox

	https://80x24.org/msgthr.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).