 So welcome to this talk raining over high-volume Debian emails by Pablo Ariel Dubois and go ahead Hello, so we are here in Icaragua where it rains and Where we're talking about what to do when you get You have a waterfall of emails Okay, so this is a very very short introductions before we We go a little bit around the room and talk about how different people have different ways to deal with the influx of emails in Debian So the motivation is of course as we all know the Debian community depends heavily on emails Not only for personal communications but there are key tools part of the infrastructures that are completely mail based and I have seen that over time and most of the Debian the contributors have developed their own way of deal with that and on the other hand at my Let's call it daytime job I work on machine learning and natural language processing and for many many years they have in tons of papers coming out saying What you can use this technique that technique to improve Versa on dealing with a lot of information interestingly There's still not that type of technology deployed on the field and Something I have learned from the free software world is this concept of well You work much better when you actually work on something you want to use yourself That's seldom the case in science and I want to put it into tests by this So here I want to share a tool I developed for my own use and another tool that I wrote for community use and Then brainstorm with people whether the tool I wrote come useful for other people or you know What techniques people use and and we how can we improve it with some machine learning? so the tool I wrote is has the The name so far of list also smart mailing list reader, but the name is subject to change When we come with a better name and the tool is a fairly simple tool you get the emails coming in Go to a regular mail archive in inbox format Then you have a back end that you spell and mailbox thread and I'm classifier DB ACL and populate SQLite database and then a front-end written in Scala and Formalism call it the next step These are chosen because I really wanted to use the tool and I didn't have much time and I just went for the technology I was more familiar with and it was faster. It took me like a week to put this together And I've been using it so for for a while now. I'm pretty happy with it although it has a number of shortcomings for for other people, but Take a look how the tool Looks like so here you have the depth conf discuss mailing list and Here you are filtering at a particular threshold minus 10 That means you see absolutely every email that come in in the last week on deaf con discuss. That's Three pages of emails a hundred and forty messages and here you have the difference core assigned by the classifiers. So For example, I don't play as a sense so even though they are not particularly bad emails I don't really care about them so much and It seems I'm a really bad person because I don't care about lost cameras If I then feel that if I don't have much time to read the emails and I filter at score 1.0 then that gets reduced to 29 emails and Well, I'm very interested in these open block stuff or maybe buying a deviant jacket then so How does the classifier works? Well, when you click to see an email then you have these buttons over there that says well This is an email. I really like all these an email I really don't like and all the stuffs go store it into the SQLite database and a classifier is trained on that This is definitely not rocket science. They're the only work is behind the database the different schema stuff I mean few tables with messages and threads and How the message got predicted and stuff like that most of my my work on building this tool wasn't putting together the database scheme and stuff like that and The classifier itself. I'm using a classifier that is packed in deviant Which I have used for other projects and I highly recommend it is very fast. It handles emails natively and It's more than just naive base classifier. It's a maximum entropy classifier The author actually have used it to train a program to play chess So it's definitely not your you know runoff of the meal spam detector And the way I use the categories is that messages that they had marked plus two or minus two are always used as a positive or negative train examples and then the plus one and minus one are a sample so Basically, we say something is plus Tuesday. I always want things like this while something is plus one is like well It's things like this I like and it also takes into account emails You saw on the on the title, but you didn't even click to choose to open. Those are like mild negatives Besides the classifier. I also have a little rule engine where you can write actually executable pearl code because I Don't hang out with Luciano enough about security and web apps That You can write rules like saying oh if somebody says my name I want this to have a plus 10 or Like one of the rules I have is things that come from Debian dot org in Debian devil mailing list and should get a little push up that's because I don't know enough Debian developers yet and and the UI is It's it's a really really funny UI because it's if you know that the very very old Java UI framework swing did somebody actually wrote something that you you write code that looks like swing and it renders an Ajax app application and It's really really ugly. It's not the way to do it, but this is really fast. The whole UI is 500 lines of code and Okay, so this is what I have and it Makes me happy. I The UI still have issues for example. I compute all the threaded structure, but I don't expose it on the UI and There are the main point and we were just talking with a bar before here is to actually make it use I'm up rather than inbox files. So you can glue it together with with another and The other things that the tool as it is right now is really set up for for the the machines where I run the UI in a VM and The back-end in a different machine and stuff like that. So it's not really that installable So I want to work on getting a version that people can install So that's for the one side and the other tool I wrote is a 150 lines of pearl that handles It's a bridge between the devian user Spanish mailing list and Twitter So if you subscribe to add devian underscore. Yes, then you get For every discussion on the mailing list that has at least three emails of Which there are two different authors that helps with spam detection You get a tweet and you only get once per thread So if you want to know what's going on on the mailing list for eight months, it has generated around a thousand tweets It's not a particular silent But it's definitely much less volume that being subscribed to the mailing list and reading it every day and So far it has got a hundred followers, which I mean justifies for me just to keep that both running The boat is public domain. You can download it. The link for downloading is on the on the Twitter account in a in a more serious aspect, I'm Have a personal project of trying to reproduce the kernel traffic Summarization for the Linux kernel mailing list somebody for doing five years every week summarized by hand all the emails from the Linux kernel mailing list and So and all those summaries are available our GPL and we can use it to actually build a system that can summarize Debbie and devil at that stage I Publish a paper of that a few months ago And what I'm working now is in aligning the original emails with the summary and trying to derive rules of When you want to include certain part that that's a very well-studied problem, but here we have very good data to work on that This will be very very nice. You you could have people who are Debbie and Interesting Debbie and not developers and and read a summary weekly of what's going on on Debbie and devil and I Mean if we get to that point, I will be a very happy Pablo When I started the the smart mailing list reader people point me to this blog and I would really like if more people write blogs like that so joy I analyze What makes in the thread structure to be an interesting thread to read So for example, if you're having a thread only two people going back and forth, then that's what he called take it to private Email or if you have the thing if if you have one person posting something and immediately replying to himself there's a thing before you post or Or somebody post something and then you you don't have no discussion You have just a ton of responses to the same. That's the blindingly obvious answer and so I Would like to add this type of identification Yeah, we get a comment from my RC from Kevin mark He says it would be cool to have something that matches the Debian weekly news Yep indeed It will even cool if Kevin X was here, but Okay, so that's that's what I have I So now I would like to know how you guys deal with the ridiculous amount of emails we Proceed in Debian and your own personal strategies In Python, Argentina, the the mailing list has a karma bottom so you can say that That mail as a positive or negative karma and it's included as a heavy in the in the in the mails in the future I mean when That that machine learns and then tax automatically the mails with that karma so you can filter by that karma So and then use you used to post silly things. It's but karma and probably nobody gonna read it in the future So do you are familiar with that system? Oh I I was thinking on similar lines. So yes So so but to to associate the karma you have to go to a website and click a button for that particular email How how it works a link to to say Very nice Her what how people will feel about that? Julie is less than nice on his he called it cooks He says well if this this thread has too many cook to to good posters ratio. I don't read it But that's the same bad karma Wouldn't be nice to have a karma system for emails. That's just based on the headers to thoughts about that so You could reply with you could include a karma header in your reply And that sounds like oh wait, but I don't necessarily want to reply to something And then perhaps you shouldn't be judging if it's not that important that you don't want to reply So so so the karma go for the person or it goes automatically text classification on the content on the email. Yeah He said that is on the person, but it could be re-implemented based on the text Well one thing that strikes me here is that yeah, I understand for this kind of feedback The most usual interface will be web-based However, there's this cultural issue in our community that we tried we tend to avoid web-based interfaces When whenever mail-based interfaces are are available. So having a male available interface to rate males say I don't know could be strange, but Work better with this group For this type of things we have already have Spam collect all addresses that you can bounce spam to so we could have positive and negative karma bounce addresses that you just bounce From whatever least on Debian. You just bounce. That's good. That's bad. That's very good Yeah, then you can have lists put some karma experimental thing one thing I wanted to add is that in my experience To read many many mails from mailing lists a good thing is to revisit the subscriptions you have from time to time and then somehow sometimes decides that this is not worth the time and refraining from Subscribing to every list is also a good way to having too many mails So I'm starting to get a very clear picture and involves two mechanisms One is replying with karma and that just gives karma to the to the threat not to any specific person So some kind of weight weighted average of the karma with decay And but the other mechanism would be Rating people and that would involve addresses where you send somebody's Email address to that special He was talking about bouncing so there you bounce the whole email And we can use the email text to do that or we can use the sender Whether to go text-based that it will be as you were saying for the whole thread And then you can get flame words and stuff like that or going for people. That's most likely Willing ball are very nice flame words, but I was sort of I was riffing on his idea And I mean I think it makes more sense to keep the karma regarding specific mails or threads in the thread But the the special address I think would be very nice for giving karma to people but The reason I have this smart mailing list reader is because actually it's not a matter of karma I'm interested in contributing natural language processing and machine learning stuff to a bunch of projects I'm not interested on everything the projects come I'm only interested if they say something related to my field So I'm subscribed to Libre office mailing list that is a huge traffic and most likely I'm all interested of like five emails a month at most So that's not really that they are saying things that are bad. This is that I don't care So There's a danger with the bounce sort of stuff that people who are Not currently involved can't get involved because their mails will not get read because they don't have karma So I think this sort of thing is really personal. You should do it only on your personal machine You shouldn't be sharing the results with other people Great. I mean I I Was really thinking it about something personal and the other parties I'm tired of Google knowing, you know, everything I do and the nice thing about this is You keep all these what you read and what you didn't personal and So I'm currently using Gmail for email reading. I would like to switch to this it sounds really useful, but The interface maybe needs a bit of work in terms of the threading particularly Getting if I can get more people to help I'm not a UI guy at any point in time as you can see Yeah, what language did you say I was written in? This is written in Scala, but No, but in reality all you have is an SQL like database that you want to serve and you want to you You can do it in PHP. You can do it in any language you want so You could theoretically write UI In different languages and do the back end all the stuff that you have And you can also write back ends in different language in as much as populate the same database So I was just thinking about the No karma for unknown people and well, you could certainly start unknown people at some positive. I mean be optimist It seems like a relatively simple problem to solve Of course, I probably missed the hard part But I I feel the Karman issue is more like a thing that projects should discuss Rather than a technology stuff. So I'm fine with whatever the people want to do What I'm more interested is in the part of which type of technology can help If we want to implement a karma system based on text I definitely can help a karma system based on send there is some people it's it's doesn't need any NLP or anything like that It's just counting Maybe one thing to add is that we are already having some sort of karma just for spam detection I mean we do already rank some mails below some threshold based on their content So maybe we could extend this spam detection. I mean basically it's the same tools Yeah, but karma base for community really leads you to group think and to voting Hmm and to efficiently well not efficiently but Executely being voting on the people who mail because you have more people bouncing the same mail to either positive or negative. So I Yes, I feel feel I have something to say right there was discussion along these lines In Talk in 2010 Stargirls Flame word detection. Yeah, exactly and There was obviously nothing's been done about it since but it seemed like there were there were some quite good ideas that popped up around that time so that you'd mechanically spot flame wars and then tag the message that gave rise to that flame war as being flame bait and then Use machine learning to spot similar messages in the future And that if you did that if you provoked the filter you could put the flame bait on hold for 12 hours Send a message to the sender saying you do realize you just triggered the flame bait Detector, do you want to get the bad reputation that goes with that or do you want to click this link to get rid of the mail and pretend it was never sent one of the nice thing is that if you get people to To adapt themselves to write emails that pass bypass the flame war filter But for it there will be no flame war because the email will be too complicated to understand that they are just flaming And maybe another thing to add is that we also have some sort of mechanism to prevent film wars in Debian that is named listmasters and you Anyone of the readers of Debian devil can report anyone's behavior on the list to the listmasters and they Quite often react by banning for a certain amount of time given if you I mean You have to make a case and give some emails to justify that this person should be banned for Sometime from the mailing list because their contributions are not positive to the discussion. How often that happens though ask them Like once one person per year. I think My feeling is that it's more than that probably more than something like around one per month Wow on on all lists But I don't know so Phil, how do you handle emails? That's to pick a random person. Well at the moment I'm not reading email because my laptop drive crashed two days before deb camp and I've been trying to repair my not much database ever since and So I'm having a very very relaxed deb camp and deb com because I just haven't looked at email at all while I've been here, so I didn't know what the day trip was going to be and It's just great. It's like a magical mystery. Joe. I normally I use not much which is just really fantastic It's not much of a mailer So it's just a thing for indexing every mail you've got with Zapien and then the UI that's best developed for that is based in emacs Which will piss a lot of people off, but you can't some lunatic has done a Vim The version as well, so you do all your email within some mad Vim mode You can plummet into mutt as well It would be quite nice if the natural language stuff you're talking about could interact with that. Okay, so the way that it works with With emacs at least is that you can apply tags to any Any message so you could have extra tags for how spamming something is and then you can do searches across the whole Text of all the emails that you've got and you get the first 20 or so emails in about two seconds Even if you've specified quite a complicated search with multiple tags and multiple search terms Do you do you actually assign tags yourself? You are the mythical person who assigned tags You sign first you have a personal database with all the tags on all the emails that you've got But you you actually tag emails say this email tag Well, what you normally do is When you first get a new email you have a lot of rules about Instead of doing proc mail rules you have rules that say if it's to me then add this to me tag Okay, and if it's a list what one of these lists then I had a list tag and then Under these circumstances remove the inbox tags, so I never have to see it unless I'm searching for it Okay, that's the way I I try to make it so that as few things land in my inbox Even though that's actually not the folder that exists So that those are the males that I actually try and read and then other males if someone refers to something you can search for it really quickly and then find the thread and Sometimes when a thread keeps on coming back from the dead like you know the tempifest thread or Or the system deep thread I tag one of the messages in the thread as killed and then I never see that thread again Which is rather nice very relaxing Any of the else so just a quick follow-up. Thanks Phil. I'm glad you like not much Also, there is for people who hate Who think that reading email in an editor is fundamentally a stupid idea? There is a Python based curse and curses interface called a lot Which I know puns are one of the great trials of the not much community That should be uploaded to Debian pretty soon, so it's nearing Production readiness I guess The tagging script I use is called a few Yeah, it's kind of so I've read about not much a couple of times and it sounds really good But I've never quite worked out what it was that I had to do To get it used so if you're the not much guy that's a really useful piece of information So what do you use to handle the emails yourself? I just use mutt Okay And all my email lives on a server on the net so I get the same I map you of everything everywhere It doesn't matter which machine I'm on and how how how how many except for at work where of course They've got exchange and so work mail just kind of goes in a bucket that I ignore mostly I look at it once a week and discover I've missed a meeting They wouldn't allow I'm up in arm until last year we had to bitch and bitch and bitch so now we can actually read mail real mail So so how how much time a day do you spend reading mailing list most of it? Okay, I don't do much apart from reading right mail I spent some time recently looking at the use of the IMAP keywords for mail Instead of using folders that at the moment. I just automatically filter everything into folders with proc mail And I use a variety of mail readers like on my phone I stuff it's under bird on a windows machine and plus a web Interface with scurril mail and so I started looking at if I could transition from folders to using the keywords And the two problems I found one was simply that not all the clients support the keywords in the same way And that was going to be a pain and the other problem that I found was that Synchronizing the keyboards so the keywords That each client that I'm currently aware of needs to have a local list of the keywords And you need to manually enter them into the client So if you have a desktop and a laptop you have to set up your list of keywords on both of them all your custom keywords And I couldn't find any simple way to synchronize that between all my clients So that that's as far as I got and it just seemed like I didn't have time to To go into it to actually find a solution But the the thing I liked about the keywords is you can have multiple keywords and you can declare Like views basically instead of folders And that can provide a lot of sophistication For managing email and also to manage it in in different times of the day Like you could say during business hours my main view might be focused on my Business email and outside business hours. You might want to exclude business email Whether it's from a folder or whether it's on your private email address or something else And and you can tune that a little bit But there are those big stumbling blocks that I mentioned before I could get to that point and In general How do you deal with so much email for for Debian in the sense of on your off work hours? How How do you manage to to go through that you just go and look at all the threats and choose to participate in some At the moment I deal with it very selectively So I only look at things if I'm very actively interested in a subject So when I join a list, I set up a folder for that list I set up a rule in proc mail to capture everything that I Receive on that list and I won't even look in that folder unless I'm participating in some thread and one Consequence of that is that people often have to CC me privately to wake me up to look at something Especially if it's it after a few days after the thread has gone quiet Yeah, so I mean I do something very similar to most of that I actually said I use xim rather than proc mail to do filtering and I found one problem was that I used to filter all my Debian stuff into a Debian bucket The problem was it was very hard to get bug mail to end up in my inbox and not in the Debian bucket Because the Debian bucket didn't get looked in very often And I kept failing to write a rule so important bugs with molder for three months before I noticed that Stuff was broken which is kind of embarrassing So you wanted personal emails to your Debian address to go to your inbox not to your exactly At the moment various things broke on the server. So everything's just going to my inbox Which means I noticed things much more and actually reply, which is good, but you get more mail in your inbox That was something we were discussing with The Zumbi this at breakfast that some classifier that will detect email that in a sense is Addressed to you It's it's quite important and it's not as trivial as it sounds because if you're having a conversation in a mailing list Then they are just addressed to the mailing list Then so then you have to know the previous history that you replied or you started the threat Maybe to add it to the things that can be done to read mail effectively is Also, not really technical rather how you manage your time. I have found Several times that if you start just reading on the background your email You end up doing nothing, but just go and refresh your email to just see if by mistake Something happened on some Debian mailing list. Oh, it didn't and five minutes later You just try again and ups it didn't and then you spend a day long refreshing folders So nowadays I have the opportunity to commute in train So I have 40 minutes offline and I usually think before entering the train and assign this 40 minutes to do this And when I get out then it's done or not, but it's done and It's mostly a matter of assigning time and just to detail my technical solution I have several accounts that I think you think offline em up and when a they enter Dovecotts they are repossed into Dovecotts and filtered using sieve You may em up extension and then they're piped it into different folders The advantage of this is that if you don't have your computer you can go to the webmails of the source thing And they're already filtered So it's quite a nice solution It has some drawbacks and sometimes offline em up just crashes for one of the folders. So just Login to restart it from time to time, but yeah, and and sieve is a program or a protocol sieve is actually a protocol to define filters for Which which sieve program you use I use Dovecotts Dovecotts has something to do this It's quite tricky because it you you have to re-send the mails to Dovecotts when they arrive through offline em up Which is quite ugly, but works. Have you do you have a blog post and they could scribe in your setting because sounds very Not yet Rhonda, do you want to share your email? life Well over the last years and more or less Started to not follow emails too much I'm More catch up the habit of following IRC and that worked out pretty well and People are throwing pointers around anyway on IRC when there's something interesting going on in the mail. So That works for me most of the times I'm still a plain mat user and have some prog mail files filters But the thing that worked out most for me is coloring emails Depending on Who is sending it or other reasons? Yeah, but the what you were mentioning we could use a bot on on on Hush, they've been devil or some other channels to To acquire an interesting training data for email categorization Every time somebody posts there some email some link to some email on the archive And you say okay, well, this is really newsworthy people are talking about it No So about half the time those links are posted as an example of the pits of human behavior, so Like people shaking their heads like that. It's like. Oh my goodness. I can't believe what so and so just posted So which you could also use for a bot, but you have to differentiate somehow pops Come on you keep track of all the devian mentors emails, how do you do that? I think I said before that I'm using Google mail. Okay, just for the lists But do you do you use the Their importance training system that you know I skim or read every thread How much time do you spend that way too much? Well again, that's also another thing that one can do besides unscribing from the mailing list is blacklist thread Once you have spent too much time You can just decide any contribution to this is just too much and I don't care and then well every mail will Just be tagged as read by your mail client. So it's quite a good way to avoid noise and Avoiding noise is good because what you want is information not noise Just a couple of other things that come to mind. I mean my my system of organizing things into folders but by list is it's not so sophisticated because You could end up with You know something important in any one of those folders and and not know it's there And I think that that email as it stands doesn't really give the sender a lot of options to To emphasize what their message is like some messages like an invitation to a party Probably a lot more important before the party then then after the party So this would be a vital clue if that was actually if the email reader had some way of knowing That after a certain date this message is meaningless So I think there are a lot of opportunities like that that could be explored I mean one option is to send calendar invites instead of regular emails And that might fulfill some of that but maybe there are more specific things that could be done You know for things in the Debian project particularly relating to bugs or or other things that to help the recipient To prioritize because otherwise you you just have a long list of things and you go through from start to finish the the issue with the The That's that's something that could be done like for example if there's a discussion about a bag and the bag has been closed Then your email we could have something that that marks that whole thing as no longer relevant So so that's a very good point and and at least for that come for example that there is a lot of emails that are time-dependent and Can make a big difference? I mean just to continue that suggestion what one idea would be if I'm Maintainer and I'm responding to a bug report and I want to say to the Bug reporter look, I'm not going to close your bug report without consulting you I'm giving you a week to check if I've fixed your bug But if you don't close the bug yourself and confirm you're satisfied in a week. It's going to close automatically It would be nice for me to actually be able to emphasize with that email that they have to act on it And then their their email program would somehow Emphasize that this is an action that they need to take rather than just an email for them to read And then they can still ignore it. I mean they're not forced to read my emails, but if it could give them an extra clue Then that would be really helpful and it might be more productive. Yeah, that's something that can definitely be add to Debugs You can send an email like like something like delayed close or something like that Okay, so do people feel they would like to help with these do you have some Friends that are good at making you eyes So I just wanted to mention one more not much related item, which is that we've been experimenting with sharing tags For the purposes of patch tracking and It's a really simple scheme and it's not tied to not much It's just using a get a shared get repo that we push to and then we grab those tags and push them into The not much database, but so something along those lines Maybe with a different UI could could be useful for for some of these schemes as well where you work in a distributed manner and people make some Determination about messages and then you do conflict resolution via Via some version control system now That's a pretty techie system. So so it depends how techie of a system you're willing to tolerate But it seems to work. Okay for us Sounds good, particularly those Some of those tags can also be assigned automatically Okay, so I guess we Finishing a little earlier. Thank you so much for coming in spite of the rain and Look forward for the com 13 with some better solutions Hopefully put all together