 All right, welcome to our July 16th DevSync meeting 2020 so What should we talk about today All right, let's just jump into the edge your sprints and see what we've got All right Everybody see my jeer screen All right since Charlie is not here I'm Derek you want to give an update on project rollover prototypes Yes, so we're just basically continuing to print parts and As I get them done I bring them to Charlie and he does the post-processing So as of right now we're about 75% done with the prints they've been fully done by the weekend By the end of the weekend So Charlie has been working through building those audio chambers And just cleaning up all the prints that I give them So yeah, basically everything's still in progress Yeah, we should have everything printed by but the end of the weekend so And then the goal is to get Charlie in Hopefully tomorrow, but at the latest on Monday and we can start cranking these out So I think we can ship Josh should ask to ship five. It's at a hard deadline. I think we can do that a week from Today, I think it's pretty realistic for that. Yeah So that's that's where we're at there. Okay. Well, I think it's still in the right place. Yeah All right work to the first prototype So we've got a few things moved over here. So the system blocking update As dying it we've got The review of the repo and the post of the blog or blog post and what I mean, it's been pretty well received So this thought there's a lot of low going back and forth on that We kind of think we talked about consulting the repo So we got that all consolidated into one mark to repo and so within that we will have All the marks who related projects that includes the off-the-shelf design, which is what we're calling Kind of retroactively calling the design that we've been working with the one that uses the re-speaker microwave So that's abbreviated as OTS in the repo and Then moving forward we'll call the new design that we've been working on with Kevin the mark to RPI for raspberry pi Dev kit So that's that's what we're calling The project we're working on right now That's all under hardware dash my cross dash mark to Yeah working to be a little better about keeping everything But yeah as far as day-to-day I'm still working on the industrial design I basically really kind of just started that a couple days ago as we've been kind of going back and forth on Little tweaks here and there and moving buttons around to accommodate a Different connector a cheaper connector and you know just now that it's 95% Solidified we might get some good suggestions from the community and they change things a little bit I could make some more progress on that and Also kind of you know told the community we want to share a good update on that soon so That would be my best mix really good headway there. Yeah, so that's that's kind of where I'm at on that and then so Kevin's basically You know if we if we get some feedback From the community like some you know engineers take a look and tell us we need to change a few things or we'll make those things happen Until you know as we're waiting for that type of stuff will basically he's he's kind of a holding pattern All right. Yeah. Oh the other thing is we got a FDM printer on the way we've kind of been without an FDM printer here more for a little while and under the Couple community members have been working on FDM versions of the over the shelf or off the shelf design But we want to Be able to get some support there on the The new design as well So being able to do that with the community is going to be beneficial so back we've got that on order it should be You see that new Moira design What's that? Did you see the new Moira design from the community? It's on the printing front. I Have not Okay, and it's on the forums and Reddit Someone created a whole new design. Oh Cool. Yeah, there was one that was created by a member It's basically kind of like a Flat-sided kind of version of Original over the off the shelf design Now this is this is more echo-ish in appearance But they do I do need you to an action item on that is that they they asked me a couple questions that I can't answer They wanted to do a domed top and if you remember the the original Version of the mark to was curved on top and we got rid of that because of something to do with noise Cancellation and like channel shape and stuff So they wanted an answer as to why that is and then they had one other question that I couldn't answer because it was an industrial design thing Can you can you reach in and answer his questions on that on that form? Yeah, yeah, I'll take a look at how yeah Well, it's not terribly don't it's got a flat is stop But yeah, there's some things looks like you might be trying to do an upward-facing speaker Which would be probably not the most beneficial for Um margin is it's you know firing right at the microphone. But yeah, I'll get in there and and uh It makes a problem Looks cool. That's I think that's about it for me. All right moving on fixed existing So I want to go through the dining column here because I know I've completed a few of these Updating the mark to give the image to my craft 22 for I did find out that issue was Chris as far as um, you know why we weren't getting past the first screen So that is actually done and working So I closed that one The cursor visible. That was one line fixed. It was pretty easy to do. I So I got that one fixed um The skills loading progress bar off screen. I made a small coding change to the kibiko and That has gone away So that is good um, I've rebooted my device like a hundred times in the last couple days just to To make sure it was kind of an intermittent thing. So I did a whole bunch of reboots to Uh, make sure it didn't recur and I know I haven't seen it Hey, I've got a request. I've I've got a request for a feature Sorry, I I do want to get it in there while I'm thinking about it. Um, can we have a Can can we add a crown job to that image that cycles it? Um, let's start with every Seven days, so let's just cycle it on Sundays at midnight local. Um, but You know at wicked I I have a lot of experience running Stuff that's got to be high availability And what we found with with linux was we could keep equipment up and running for You know in some cases a decade if we just cycled it Every every week like it it just stops the lockups and you know equipment just needs to be, you know As much as I would love to have something that's so elegantly designed that it never needs to be rebooted Especially with this cots version. We just you know, as michael pointed out we go to drizes and all this other stuff There's just enough that we're not controlling or we don't understand that that we're going to run into periodic issues And the one that's been sitting on my counter forever You know when I encountered problems with it a reboot usually fixed it so we may be able to put a lot of the Potential bugs that we're going to be facing to bed by simply adding a crown job where we we just cycle the damn thing once every seven days Okay, I mean it's not a hard thing to do I can create a ticket for that You have pretty quickly and put it into the image Yeah, I think that's a that's a good feature to put into the The backlog but we should discuss it more before actually implementing it. There's some other repercussions there And also probably make it configurable, right? Yeah, that would be one of the issues All right, I'll add it to the to the backlog then so right now I'm you know, I was also looking at This freeze The woosh bar being stuck when we connect to internet I think that was also fixed when I did the Progress bar fix but I have not seen that since either So I'm probably going to go ahead and close that issue Um, if anybody, you know, I'll keep rebooting it often and if I ever see it, I'll I'll take a look but This is mc108 Yeah Have you tried having internet problems? Well, like for example, you know using the wrong password or something I'm not Yeah, but I think that's a bigger problem because if we have the wrong password. There's not really a A way that I know that I don't know that we've coded the mark to right now to Identify the fact that it can't connect to the internet and then try to reconnect like the mark one does that by default if you change your Okay, so we're not trying to solve this problem then because the whole boot up connect to the internet services still uh Still not standing to do Yeah, that can be included in that particular effort, but yeah, that's something we shouldn't consider Gotcha Uh, yes, we also closed a couple other issues right with uh, I know I did some prs reviews this week too Yeah, yeah, so I mentioned a few things um from the time scale and well, yeah a few skills And Uh, yeah, so there's six of the installer skill the time of skill um judge skill Um, is that how to random bug in it? And yeah, the um The news and and singing Hey, I still have not had any um Any breakthroughs on um, so that's where the stop message is timing out in the like comps test Um, so I'm still starting on that. Uh, the skill date time Uh That's the one where I I just put in that random 10 second sleep at the start of um before actually kicking off all the test um Thing happens so infrequently that it's hard to know whether that is definitively Fixing it or not, but it has never happened. I've been just running that um that fork Periodically to try and get it to trigger to know that it's not working but It's never failed but that 10 second sleep in there. So I'm I'm thinking that Maybe we just merge that and then uh, see if If it ever reappears then we can pull it out again, but um At the moment that seems like So is that a hack to core or is that a a hack to the test suite? A hack to the test suite Essentially sleep sleep with that sweet for 10 seconds after it thinks it's ready to start Um, it sleeps for 10 seconds and then does the first test Okay, and that's because if you if it starts too quickly The skill date time test sometimes just fails It's because the very first skill date time Test sometimes fails and they haven't been able to figure out why Um, and so my assumption is that it's because something isn't ready in time, but I haven't I haven't figured out what that is But the fact that it was the first test in the entire test suite just sort of sent me in that direction and then I haven't been able to get it to retrigger with the sleep in there So that is that is the entirety of the information that I'm going on, you know, gotcha. Well, yeah Uh, so do we have a A procedure or a default, you know, sort of way of marking things that are hacks Rather than actual fixes Just for the comment The usual Yeah, all right. What do you just put in like hack and capital letters or something or What do how do you do it here? Yeah, would you really make something we can search the code base for exactly? Yeah, yeah Um, I mean and you could reference the jury ticket too Most of the time they're they're um Really faced and so, you know, we can genuinely search for like um comments that Include the next major release Tag, like, you know 20 l8. What do we need to do for 20 l8? Because they'll they'll generally be things like pull this out at the next major release, you know, once everything else is in place I mean we can just do a A hash to do so, you know Okay, classic Yeah, and and reference the ticket number there and I wouldn't close that ticket until we've actually solved the problem because You're probably right. It probably is some service not being ready in time And that is definitely something we'll want to fix for production software Yeah So, yeah, but I mean if you're just wasting time beating your head against it for now Uh, I don't know how and you've got something that seems to sort of fix it And it's only a Just yeah, I think yeah, I think this is just the easy way to validate whether this does fix it or not because it'll It'll then get run, you know, many many many times per day. And so if we never see it again Right for the next two weeks then I can come back and and dig into okay. What is that actual? Thing now that we've isolated where the problem exists, right? Okay Cool And put a domain in and then leave the ticket open and come back to it in two weeks Got into a lot more I've spent a lot of time this week on the That marshy repo and comms and stuff so okay, um Something else I wanted to talk about before can gives his precise update is SEL 38 Um Giz and I have had a little bit back and forth on this one. So when you try to create an account For an email address that already exists on our system, uh, you get a A 500 error and you also get a message on the screen That says a count could not be created and it's an error message um initially We thought that this was a good thing. I mean the 500 may not be the best return code for it but we thought it was a good thing because We wanted the error message not to say something like This email address already has an account in case somebody used phishing for accounts um so Giz pointed out it probably was probably a good a good solution to this problem which is to Be better about sending emails to people when they try to access their accounts um And this would be one case where you know, you try to someone tried to create an account that uses your email Either it's a warning to them that something bad's happening or a warning to them that hey, you already have an account Let maybe change your password So I guess my question is do we want to close this? as You know works as designed and then have another ticket that Deals with what happens when you try to sign up for an account With an existing email address Or do we want to just repurpose this ticket to be care? Uh, well, yeah, if it's works as designed then let's just close it and mark it and make a new ticket to design a better system Is there a That an error code that we can put out there Just like wondering, you know, if there was an issue In the system the people again, like It would be hard to know. Is this 500 era genuine 500 era or You know engineered one Yeah, why don't we why don't we look at how other You know companies that have larger security budgets are dealing with this question We see to go into a few different companies, you know reddit or whatever And make this attempt and see how they how they are handling it. Um, and then, you know, if we agree with their thinking Um, you know handle it that way. I would argue that this is a rate limit thing but you know the The easy way to handle it is to to throw the message that says you already have an account, you know, and then as as Chris says send that account an email saying somebody tried to access your account And then restrict the number of attempts that anybody can make on that page, you know, but one every five seconds or, you know, something But but even then it would You know, if we're providing feedback you already have an account somebody could set up something that's low and slow and just Does it once every five seconds and just start cycling through popular account names or popular user names? So anyway summary of that is let's go see how other folks are doing it and if we like their ideas, let's steal them Yeah, so that was why I was sort of suggesting like from the front end you why This email process is exactly the same For the person you like trying to sign up for the account It's exactly the same whether the email address exists or not. So There's no way that someone can script and automatically, you know Check whether an email address exists because to them it just looks like oh, yeah, I'm getting my verification email um But yeah, in in terms of the the different error code. I just meant like I agree. I think this is a must be a ticket we can um Uh, that needs to to be its own ticket and its own task. Um But I just wondered if we could quickly switch that error code so that we know whether or not it's it's a real error or not um Are you so to suggestion I had in the ticket here is a is a 400 Bad request, maybe usually that means a malformed request, but um I don't think there's really you know at some point you might be able to put like a an error message In even the 500 code that says why it failed Or not well not why it failed, but that wasn't really a problem um, I mean in In general Most people aren't going to ask, you know the idea that there's a A pop-up thing on the window that said they couldn't create their account. You know, we have a lot of Special special users now that can get to those 500 errors and know and try to dig into what happened for us But generally people aren't going to be seeing the 500 error. There's going to see the error that comes up on the screen So I mean I could certainly look into that as a as a stopgap ish just to not make it not a 500 So to be clear, this is an issue with our current gooey our The salini backend gooey, right? It's not a api or anything like that We're having trouble with you know, it's the api that's returning the 500, which is an internal server error And then the the ui Translates that and puts up a little stack bar that says an error occurred creating your account So most users will see the snack bar and not look any further, right except when you call in support um But the person who presented this issue Knew enough to go into the browser javascript console and see what happened um, it's all 500 error, so But but the But kinder, I think we're just was coming from is how do you differentiate that from the backend api being broken? Because this is when you when you you know responding responding to the query from people like sometimes People are very adamant that they have never signed up for micro And it must be a real genuine error Uh, and you know you go back and forth a bunch more times and then they realize that oh wait They finally did follow the instructions that I sent them the first time and tried to reset the password and everything's fine now Um, so it's more just like how do we shortcut that? Um, so if there was a different error code I could be like well, I can guarantee you if you're getting a mail for well I can Be very confident if you're getting a 400 error or something Then um, then that's because you've already have an account or someone else. It's creating an account You know using your email address or something, you know, I mean, so basically, I mean it is really works as designed right the It is designed such that people will get a cryptic error message So they'll reach out to customer support and then gaz gets a personal email Like that's the system that chris bears, you know describing right and so that's Yes So it sounds like uh, gaz does not like this system and once a new system is on He doesn't like how it was intended to work Yeah, the problem with doing a 400 is just other things we could call it a 400, right? Okay Counts already in use Because then yeah, if somebody looks in the control they'll see that and then they'll know that The whole reason this is the way it is now is because we don't want people knowing that they found an account that existed Well, all right, I mean, um Yeah Yeah, but that's half the battle. So like half the internet, you know has, you know, extensive password reuse, right? So they they may have a strong password, but they use the same password across 9 000 different sites And so once the hacker has identified that that email address has an account on our system They can load one of these, you know multi gigabit Files that have been hacked from yahoo and whoever else Find that email address and say oh his his password at yahoo was, you know, whatever 1 2 3 4 5 bang And then they try that on our system and now they're in right and the You know the the piece of information that they needed to close that loop was whether or not the count even existed And so the you know the solution to that is to not communicate To random hacker when they try to put in an account that the count art exists in the system Or as you know, as I was suggesting we rate limit it and say you're only allowed to make a certain number of attempts and um As chris veyer pointed out just send the user an email and say hey like somebody just tried to log in with your system with your With your credentials. Yeah, okay, so we're back to the original answer which is works as designed We need a new ticket to create a new system Correct All right, I will forward this one then and Uh take chris's comment to start um and just take it to start another one So, yeah, I mean guys, I don't know how much of a pain this is for you, but Yeah, okay, so we've already spent too much time on it. All right, moving on I was gonna say we probably spent too much time on it already Apologies and apologies to everyone listening off of the fact That's We're we'll get better All right, uh, so I'm probably gonna start working on these two Volume settings. I found some stuff in the mark two Mark two pi image information and github and how they build the old mark two and there was some command Oh, is that me sure Yeah, there's some some commands in there. I'm gonna try that hopefully will address these two issues. Um So, yeah, I'm gonna work on that next And then move down the list I guess Uh And the size update Is there a way for me to share like a file with everybody I tried dragging and dropping it doesn't seem to work How do I share a file and check an image? Yeah, well, I mean, you can definitely share your screen Well, I thought I was trying to move away Okay, I'll just show my screen Yeah, I don't know. Hold on. You can jump in anything Matamud, but yeah, Google chat doesn't Let you do you file there I've uh I've obscured the uh the recording until you uh can guarantee you're not Showing your password on the screen All right. Uh, yeah, I don't know if this is as great as I had hoped. Can can you see that at all? Yeah, I can see it I can shut off one of these monitors. I think that's part of the problem. Yeah, let's do that Okay, so We have before you a Set of test results from some models I built uh, so the Ken test model the I guess for lack of a better term the hyper parameter being uh moved here or modified here is the epic And so what this means is Ken test 60 ran for 60 epics Ken test 600 ran for 600 epics 3k 3000 and 6k 6000 and then here is the resultant uh Values or accuracy levels based upon two or three different test data sets Does that make sense to everybody? Yeah looks like 6k the winner Yeah, but what's interesting is look at the male contribs down at 3k So it's almost like you reach a point of diminishing return somewhere around there you get a little bit better performance um If you go to 6k But that's just to show you some experimentation and The model I would probably give to initially would be the Ken test 6k. Does that make sense? Yeah Yeah, I would say that's fair. I'm gonna stop presenting by the way. Um, yeah, I think that's fair. I've been looking at Obviously Getting a little more knowledgeable on The internals are precise than I had expected So I've been actually looking at trying to identify hyper parameters, which is not a trivial issue They're spread out across a lot of different places um And trying to figure out which ones are going to be the biggest bang for the buck and Actually some choices on some functions that are being used for loss and Gradient and stuff like that. So I don't know I'm out on hyper parameters yet. Um, I know that Probably an exhaustive Um, grid approach is not going to be the best approach and I think some of these there's some common Sensible things that can be done like batch sizes to powers of two and things like that. So Uh, the bottom line is I'm working on that in the background. I'll probably build some models this weekend using tensor flow I had built the other one using scikit And uh, and get a better handle on tensor flow in general. Um You know, some of the obvious hyper parameters are the long short term memory units and the drop-off rate and stuff, but Uh, and I just haven't gotten there yet. Uh, I was building that that what I just showed you because I provided a name in the chat, which is tensor board with tensor board will allow you to Build visualizations on like how these hyper parameters work and where you're, you know, graph these Performance over time and it provides a lot of visual to help you Sess out the information that you're looking for. So I'd strongly encourage you to figure out how to use it and put it to use Yeah, I haven't gotten there yet Let me figure out how to build a tensor flow model before I start figuring out how to use its measurement tools And that's where I'm kind of going with this, but right now. I was more focused on And getting something a little more Uh, concrete uh regarding a better model for them. And then if they wanted to do Custom wakeboard model. I was prepared for that Uh, anyway, that's what I've been working on and then, um, I spent a day or two Understanding and documenting our data pipeline. So so gaz I updated that pr 56 ticket with, um A link to the actual wiki page for the data pipeline which contains most of that knowledge I've transferred there And I'll just explain that very briefly here um We had a system in place Well, let me put it this way. We have a system in place where whenever a device or Our code running somewhere. I don't want to say device Because my laptop could be considered device. I suspect but whenever a Wake word is believed to be detected Um, it is shipped to the cloud Just carry out there. It's gonna be At best as we saw from that, uh chart 85 accurate. So it's it's carrying with it A 15 negative bias already, right That data finds its way into one Big old subdirectory Which is on a network access device a network access storage device. It's on mount nas Slash wake words And thank you chris flour, uh for spending the time with me to Help me get on to these servers um That in and of itself was not a trivial issue because it's I guess located in the Lawrence data center And not part of the mycroft, um Core machines that being said I don't really care about that one thing that would be helpful would be to share if possible That mount With the lambda 2 server I don't know how difficult that would be to configure but the rationale for that Is that that nas device has 18 terabytes of storage available of which one terabyte is used Whereas lambda 2 which is where the models are actually trained and produced Is down to a little under a terabyte of storage left Um, and I could easily chop a terabyte or two here and there Because our initial data set is a terabyte now Let me let me get into a little bit of the pipeline as it exists and explain where we're at And then I could explain What's got to change so right now that data goes to a server which is running fine. It's a flask, uh app It accepts that audio And it stores the file the way of file With a structured file name as you pointed out josh um into that directory the Parameters for that file name or questionable at best Uh, it carries a model identifier, which is this big honk and spring that I don't think is ever changed It carries sessions, which I don't think are meaningful to anybody the information it does contain that's meaningful Is there is a user? Uh, you know shaw of some sort in there that I'm assuming chris could tie back To users that being said the data gets dumped in its railway format with a restructured file name into this big old wake words directory There currently exists Over 1.6 million samples in that directory So if ls on a directory of 115,000 hangs this hangs A couple of orders of magnitude worse now that being said It also was at one point in time Taking that data And storing it in a mysql database in a somewhat structured manner And that behavior discontinued around august of 2019 And the fallout from that is there's 1.6 million files in the directory And 1.1 million in the database A touch base with matt on that, but I didn't really have to but I just verified some stuff and Somebody was manually running this this sync script periodically And that person no longer exists and that sync script hasn't been run Since august Which is not that big a deal and I'll explain all in a minute. Um, anyway The point being made here is that we have a lot of data However, it's not of any value since it's carrying at least a 15 or 20 error rate So it certainly can't be used for training or testing until it's been manually validated Which there was a process in place for that which I assume is no longer going on I'm assuming that that process because this is what matt communicated to me Was responsible for producing the 114,000 samples we have in the tgz that gage shared with me And um those 114,000 samples to the best of my knowledge Have been manually validated. Yes, this contains the wake word or no, this sample doesn't That's the layer of classification that was put on top of that that raw data And according to matt it was done by two people if they were in agreement It was fine if they weren't it was put back out there so that we could get a third and that was the process So that being said, I feel pretty confident that that 114,000 samples are Let's call it good data The 1.6 million or let's say 1.5 million beyond that are of questionable value until they are manually Validated Or until such time as I can build some sort of classifier with a high enough confidence level That we could allow that data into a training set That being said Uh what I didn't do yet Because it's not clear to me just how much data I need yet and michael we alluded to this last meeting, which is We don't know if we've reached the point of diminishing returns on the number or sample size yet Of our data and and that's some of the experimentation I'll be doing this weekend, which is okay. I'm using I don't know 50 or 60 000 out of the 114 Uh thousand samples I have because that's why I was able to classify what if I were to go back and Reduce that even further and you know make it even higher confidence level These are high and low pitch voices and drop out a bunch and maybe only have 20 000 would that be better All right, I suspect not my my gut instinct tells me we're still at the point in time where More epics is good and more data is good Obviously you can reach some point Where it becomes detrimental I don't know that we've reached that yet And I'll have to do some testing and look at loss and stuff like that over the weekend to see if we are there Uh and that applies to epic as well That being said, I'd like to if we do decide that this is something we'd like to continue to pursue which is The accumulation of more data and it's not clear to me that we need more than 1.6 million samples, but if we decide to continue with that, I'd like to Fix the process so that if I become a splotch on the side of the road tomorrow my cruft doesn't suffer and Somebody manually running a sync task Um no longer doing that causes disruption. So I will basically Go ahead and start pulling those files moving them Into subdirectories that represent sanity sanity being somewhat less than 50 000 files in a directory If there's less than 50 000 files in a directory, you can ls it you can cat it find what work is advertised Um And it doesn't take hours to come back from a An ls or something. So, you know, I'm going to break them up in the subdirectories of no more than 50 000 files in each one and modify the current process to have a con that runs every night That would move today's samples Into the the latest directory until such time as that directory reaches 50 000 And then creating a directory and start sticking them there Not a major problem, but you know something that I I think might be worth it if we decide to continue to collect samples When you throw that ticket, can you do me two additional favors? One is uh One is uh DU the disc And make sure Well as part of this nightly process Oh, yeah, yeah, okay. Yeah, sure So I've I've had trouble in the past with people not not printing databases and stuff and having them new computers so You know as part of this process you should DU the disc and if it's getting too close to the end of the disc It should stop doing it or delete the oldest one or something But it should not just keep filling it until the computer locks up and then number two if you could look at flak or some similar lossless compression mechanism and You know crunch those waves down into something that's that's easier to store We may be able to save quite a lot of storage space That 1.7 terabytes of or that 17 terabytes of storage is actually eight 10 terabyte discs So it's both mirrored and rated across two separate masses So every bit that you write is actually four bits And it's all the storage that we've got for the time being so If there are places where we can intelligently use compression, we should go ahead and do that That being said do not you know if it If the wakeboard stuff needs to be processed as wave files Let's not set ourselves up for a process where we have to decompress You know a terabyte of data before we run training So, you know balance the the need to store it compressed with the need to train on it But those should both be part of whatever nightly process you're running Yeah That all makes sense. What are you running on those rate 50s? Well 50 would be mirrored and then rate five Yeah, it's they're mirrored striped and then there's two masses running as identical mirrors of each other So there's two separate masses each one is mirrored and striped Right when you say mirrored and striped I just want to make sure because Raid and there's a lot of different I just actually wrote a lot of the raid firmware So my question is when you say mirrored and striped Striped with redundancy you can nuke any disc and it'll it'll stand it back up Yeah, so you have three drives, right associated with each sector, right? You have the parity drive that you can recover any one of the three if they go down from the other two, correct? Yeah, it's four drives. But yeah, same same basic principle. You can you can you can nuke any one of them and it'll keep working Right because you're also mirroring. Okay, cool. So you're running rate 50. So that's awesome What I was going to recommend I worked at a company in the silicon valley when I made my mecca out there Which every programmer needs to do before he passes on And I worked for a company called clock cast that process is about 50 petabytes of data a night And they were leading Developers of high performance file systems for large data sets like this They have the clock cast file system. But more importantly, they were using Apache spark, I believe for large data sets and That's kind of a buzzword that investors might want to hear that typically is associated machine learning and large data sets Do you want me to investigate migrating that data storage to a spark? system or no Not at this time We need to focus on making the mark to actually work before we get into anything like that But I think it's a good long-term ticket Okay, good. All right. Yeah, so just to just to be clear I'll I'll build a simple cron that will handle the the overflow I will Check the storage each day to make sure it's not getting too large. Although I warn you I would probably shut off gathering of data today Since that drive is 90 full um, if we could nass in the Uh, the large terabyte Mount that we have into Lambda two That would make it a lot easier because lambda two is the one that's running out of disk space number one and number two is Then we wouldn't have to transfer large amounts of data to build up training data sets Kind of the whole concept of the index files I built was that the data just stays static wherever it was stored and then you pass around index files to reference it Uh, so, you know, if we could get that mount onto that lambda two That would be unbelievably awesome. Uh, but yeah I'll look at what it takes to get that done and you and I can have an offline discussion about whether or not that makes sense It's going to depend on how you're accessing the data In other words, if we have to move the entire data set over the network every time you want to train Then it may be quicker to just slap another disk on the box No, I'm saying is your mount nass is is a network accessible storage drive You just need to allow lambda two to access it Yeah, but they're they're two mile. They're in buildings that are two miles apart Hmm. I don't think the electrons will care about all right. I'll look into that Anyway, um, yeah, so, um, so that's what I'm at is is managing the data pipeline There's actually a wiki page that I cut for that That I referenced in that ticket case that you uh, that you did pr 56 or whatever And you can read up on what I found there. It has most of the information about The lambda two server and the data storage server and process and and where I'm going with what it should become versus what it is And I'll continue to update that document as I do that and yes, Josh I'll I just made a note create a ticket for For the cron and stuff like that. So, okay So, yeah, that's wrong that I can give a another model which may be better I can also speak with the developer and help him to fine tune it and when he gets it installed for their environment and And then we can take it from there As far as long term a better model goes. I think that's going to come with more analysis of the model and the hyper parameters and figuring out How to get control of that and that's a little bit longer term out there for right now. I guess although i'm working on it Okay, thanks ken. Um, yeah, that's a good update. That's that's um Sounds like you've got that under control. I'd like to um Yeah, let's make sure that Uh, we are not approaching the limit of usefulness in terms of adding new samples to get a better system before we go ahead and try to Find a way to utilize those 1.6 million samples Yeah, I mean the only way we're able to utilize them is to go back to a Business process flow where they're manually being valid. Yeah, exactly right So, you know, we're talking about standing up tools to do that Anyway, because we want to be able to generate new wake words, uh, but If it's not necessary we've got lots of other things to work on so Now michael, just so you know, um, and I understand the the concerns here Having them create their own custom wake words on their own and letting them experiment with that Might serve two purposes it might I mean, they certainly can do it. It's not that difficult and I can teach them how to do it in like an hour or two Uh, it might keep them busy And then they develop a better appreciation for the bottle we ship them But I mean really the only constraint on them training their own wake words would be their data samples And that's entirely lucked up to them. I mean, I can even offer that they want to ship me up you know a Car file of samples. I'll train it for them and give them a model back. It doesn't take that long. Is that the big deal? Yeah, that's not really the that's not the hurdle. I think it's the collecting the samples. That's the hurdle and you know I mean you can explain to them what the process is and how many they need Yeah, I understand we were coming from between the lines and I won't go there. I'm just saying just between me and the fence post It's not that tough. Yeah. No, I know it's at the top. It's just a data collection and I don't think But you know, I mean there are companies that has formed a whole business model around Building these wake words, right? Correct, correct. All right All right, anybody have anything else on the bug fixing Sprint Can I just make a final point on that data stuff like anything that we do to the raw data I want to be very careful that we're not doing something that can't be reverted that we don't have a backup of the original data for like we don't want to do any Compressional processing that's gonna That we don't have a backup of I think Yeah, so that brings up a good point by the way. Yes, I should have mentioned this. So I guess You guess the bottom line Josh regarding data integrity and uh, and uh privacy is up until Last august if somebody opts out Uh, if we if if chris can tie that person To the person identifier in the database I can get rid of all their samples Right now I couldn't until all the data is moved out of that big directory because rm fails miserably On directories that large as does ls Once I had the data moved out Into manageable sub directories Then I could get rid of it in the absence of it being in the database. Does that make sense? In other words, I don't need the database since I know the structure of the file And I could manually write code manually go through and say, ah, this file belongs to this guy This file belongs to this guy this file belongs to this guy obviously be preferable to say select File names where user is blah But you know again, that's only going to cover up until last august So Once I get the data moved out into the little sub directories. I could certainly do that today I could not guarantee you I could whack somebody's data if they asked us to Yeah, I can't I can't it's you just you string together a bunch of linux commands. It's it's very doable It's I can never remember the exact syntax, but I just put a link up in the thing You just search stack overflow I put my search in the facility as well Yeah, so it's it's available it's it's just you can't use the out of the box tools He kind of has to work around it You know can you could use the file names as a guide? Regarding the house of organize the files And then if you did like directories to some of the filing parts out of it No, no I understand what I would recommend is gezz and josh whatever you guys sent me just go try it on your 1.6 million File directory Uh and see what it actually does. Yeah, I can use head and stuff like that, but and I've definitely I mean, I actually, you know row python script. That's all I know. I can tell you there's 1.6 million files in there But if you're gonna try to use said and walk and grip and all that stuff on that directory. Good luck Okay, moving on. Oh, well it does sound like There is a missing piece of our process here that's been highlighted Since things are no longer in being added to our database and if we, you know We're supposedly have a system whereby we delete people's samples from the database if they turn off their opt-in right Have we been doing it do we have a system like that that is actually executing right now? Not Not automatically but as far as I know nobody flipped that bit so we could go back and check Right. Okay. Well, yeah, we should create a ticket to actually implement that system if it's not currently running And we're not sharing this data set with anybody outside Should it be if someone turns off opt-in? Does that mean that they hadn't Stopping opting in from now on or does that mean that we delete all of their data that they ever can It means From a business rules perspective, we want to nuke everything we've ever touched from that that end user So if you hit opt out your data is gone, right? As quickly as we can do it The yeah So can when you're building all Consider if students building them Based on account ID because that way it would be very easy to find an account Instead of having to trans Transverse 1.6 million Subdirector or files and subdirectories. We could just find a sub directory that has that account ID and then just blow that away Well, yeah, I don't know that I was gonna I'm not sure how I would want to partition the data yet. That was something that I was considering late last night before I went to sleep Um, you know, but for right now, I was just going to start moving 50 000 at a time in the directories All I was getting at was if somebody said today Whack all my data. I couldn't Once I get it broken out into directories of less than 50 000. I could But today I could whack all their data up until last august. That's all I'm saying In other words, I'll be prepared to do it should somebody decide they want to opt out Okay, thanks Yeah Okay, um, mycroft sprint 12, which is a kind of our bucket of Things that don't really fit anywhere else Selene 68 I am going to I have not seen any problems with the Selene released it's in test So my plan is to tonight move that to production and then Once that's been running for a little bit to add these new metrics into our New metrics website, so This should be done soon And while I'm doing things tonight I'm also planning on doing this award press production droplet upgrade as well since I'll be doing things After hours So That was my plan for tonight before those two will get will get closed Upgrade matter most seems to be a new one. Is that guess you put that one there? Yeah, we um, so we use the extended support releases or something like that Of matter most and the current one that we're on is uh And at end of life So there should be a new one out Should be yesterday. I haven't checked if that's been released, but um, that should bring us through to to mid 2021 Uh Yeah, but then the other thing that we will need to look at pretty pretty quickly is uh From my read our license only extends to 5 000 users. Um, which we're about to hit so I need to do some testing to see you know, if we Yeah, I need to do some more reading about what that what's going to happen when we hit that 5 000 whether You know, whether they're deactivating counts that don't count towards it or Yeah, I think matter most is uh, free for for, uh open source communities if I remember correctly, um Yeah, we have to go to them periodically and and Get a new license key because it's free that you have to jump through some hoops Okay, but yeah, let's there I know I've been on the phone with our CEO several times So we can just reach out and be like hey, we need more more licenses. I think that's the way to handle it But if it's not then let's work out We'll cross that bridge when we come to it cool Um, but yeah in terms of the upgrade, uh Chris, I wanted if that's better sitting in your bucket sorry to do that to you to remember just seeing the royal pain last time we did it so um It was a world of pain do you say? Yeah, it was this was not a fun upgrade last time I did it but um But yeah, we'll uh, I mean start keeping on my radar Yeah, I don't think we can prioritize it if it doesn't expire until October then let's do it in September No, no, no the the one that we're on at the moment expired two days ago. Well yesterday, sorry. I see Yeah, there there's two options. There's two EFR options Okay, yeah, I'm just really sorry Yeah, yeah Well, I mean I can look into this since it's already expired I can I mean does it not it's no longer supported. Does that mean it's not gonna work or is it Just I don't Yeah, if some security issue came up tomorrow, then they wouldn't push a fix for it. Okay I think we can probably safely delay this. We don't transmit any secure info over a matter most, right? I'm not intentionally All right, let's not spend any time on this until we've got somebody who you know is dedicated to infrastructure stuff because Um Yeah, it would just be if there was an actual vulnerability that like let people, you know with the Like it would have to be pretty bad to really be an issue but And I'm sure it would be reported still I think it's safer if we move off it, you know when we can But I don't think that's happened today for sure All right, um Right Anything else? Yeah, that's it for the Sit for the sprint okay, so um Normally that would be the end of the meeting but because I like really long meetings. I was going to drag it out a little bit um Sounds like fun Is anybody still listening to this? Actually, uh, so What I want to do is actually discuss how we might make these meetings a little bit more efficient so, um The devs think is uh, you know This is something we started doing. I don't know. I guess you guys started doing it last year um to just stay in touch And I think it's a good idea But I think that we should um, you know, there's a a fair number of issues that come up that aren't really pertinent to the whole team Like when we're talking about like matter most upgrades or anything having to do with like Standing up a new server or something like that. So I'm wondering if we should um Have another I don't know if those are things that can just be communicated one-on-one directly it resolved that way um And and we can kind of keep this to issues that involve the whole team So like the precise stuff I know not everyone's working on precise, but like it's it's critical. I think and I think everybody should know what's going on there Um, so I know that eats up a lot of our time on these meetings, but But like I said, I think I think it's important. I think everybody needs to know what's you know, what's happening um And uh, you know likewise with the mark 2 I want to keep everybody up to date on the hardware side of things um, but um, that's my opinion. What do you guys think? Uh before we get into that I would need to throw a wrench so If we think we are in possession of data for people who have opted out We do need to move that right up to the top of the priority list. So I just sent you an email and said basically Um, one we need a process that when they flip the bit we nuke it off number two We need an audit process because we're not keeping the Historic state of that bit in other words. We don't know if somebody used to opt in and is now opt out We need an audit script in case that fails At the time they hit the button. So if they hit the opt out button and for whatever reason our systems are down Or they don't communicate or we have a bug or whatever and their stuff doesn't get nuked We need an audit system to come back through every 24 hours or 48 hours Verify the states of those bits and then if there is something in the directory that belongs to that user Who has now opted out we need to reach out and nuke it And then item three we need to run the script I just talked about an item two basically right now Because what we what we just heard is we very well might be in possession of data for people who have opted out So we need to reach into the system and and take care of that So can can guess can you and Bayer and Can get together and figure out how to provide can with a list of those unique caches And then we can simply run on you know, we can hand jam this the first time out We can just run a script that basically says You know any of those hashes that are opt out if we've got them anywhere in that directory We're going to nuke that file I mean something quick and dirty, but like I said, we just found out we may be in possession of data We're not explicitly authorized to have so we do need to move that to the top of the priority list Yeah, just me a minute to take that through Yeah, this is salini issue 83, uh, which I logged a few minutes ago so, um Yeah, I agree with josh. We should get on this Okay Chris I'll get with you first thing in the morning and give you some, um user shaws And you see if you can reverse Determine who they're associated with so that we can then do that mapping. Okay Yeah We keep on the database I'm sorry We'll need to much for what we'll figure something out because We can't back trace that data. Who's it is? It's only a big problem Yeah, I'm thinking we can I'm thinking we can it's the account ID that's coming from microafcorp Yeah, my if I wrote it There's a good chance that two things happened I grabbed the account ID and then I wrote a salt to it So it's non-reversible and then I hashed it So you need to find the code where I salt that and figure out what the salt string is in order to figure out who it is Okay I almost I am if I wrote it. It's almost certainly salted Well, that's uh That's gonna make a little more challenging chris. Yeah, find the code the salt string is probably static Um, so once you know what it is it'll be easy but Otherwise, it's easy to run collision attacks against it, right? If you don't salt it I yeah, I I'll double check. I do not think it's being salted. Um, I think it's just plain Hey, this is the account ID. There's the session ID. There's the time. This is the model identifier Bam concatenate them dot wave. I'll uh Said chris, I'll get with you in the morning on that and give you some samples and we'll see if we can go Okay, sounds good all right Then I guess that concludes our meeting for today And you decide we're gonna shorten this meeting or do you? Oh, you're right. We didn't Well, I was gonna shorten it by Again So I think so money input there and part of this meeting is to keep You Michael up to date with what's going on. Um In So I think we could go over less things But I would be worried that something, you know, you wanted to know might be Uh glossed over as well. Uh, so, you know, if we as long as we had decent uh parameters is, you know, what What we can do You know On the dl And what really requires, you know, your knowledge and attention any time would probably help Okay, well documentation is good for that, you know emails. It's easy to see somebody on that Um, you know discussions that matter most I can see those How's about a radical idea? I'm I'm down with radical ideas What do you think during this meeting? Why don't you spend no more than five minutes a person tell us what you did yesterday What you're planning on doing today what you're blocked on and if there's anything That's really worth sharing with the rest of the team that you've learned And then we can keep these down around a half an hour That's a It's a good idea There's a very difficult stand-up meeting You know philosophy I think part of the problem we're having now is that we kind of get off on tangents, right? You know, I think it gets opened and we kind of just kind of go up on this thing for 15 minutes Um, you know, so hopefully get back around till we were talking about so Probably the first thing we should do is identify our tangents early I think that's yeah, but I think that's an artifact of the fact that we just kind of open up the project board and go There's all the projects. Let's talk about Versus this is what I specifically worked on So I just uh, I think we open ourselves up to this That's my personal opinion I mean, I feel like the project board Should actually really reflect what we're working on if we're using it properly, you know, and I think that's part of the problem is like This is probably also a push of let's use the tools that we're Say that we're using to the to the best that we can um and you know keep discussion about those tickets in the tickets and You know that that will help that process of getting Michael informed as well You know if there's actually the discussion about that thing in the tickets And if you can go in there at any time and and have a look at what's going on And we can we can fight it quickly in this meeting that like, you know, the discussion's there if you want to take a look um, but we don't have to go into the The detail of it necessarily The other thing I've I've been considering is moving from this, um scrum type Type boards to more of a Kanban type board. I've been reading, um A little bit about this And the fact that you know that the scrum um With the scrum procedures don't really fit how we're doing things um, you know as far as A sprint and all that kind of stuff are really kind of doing things more in a Kanban way where you know We have these are the prioritized tasks and you just pick them off one by one um, so maybe You know just reorganizing a little bit how we How we do this First thread really thin, you know right now, we're doing you've got five or six things going on if we You know, we're a little more focused on, you know, we're doing bug fixes there. We're doing Um, you know one or two things I think that would probably help the meeting go a little slower as well I'm sorry a little faster as well So I've done Kanban extensively actually and um I don't think we fit it uh for what we have Kanban is a pull technology where you go up to the board You have fungible assets and the next fungible asset pulls off the next Tasks that all fungible assets could do could could accomplish We we are specialized somewhat here. I couldn't pick up salini tomorrow So it's not really a good fit for us at this point in time. Now that being said, I mean That lassian has Kanban boards built into it and it's it's pretty cool And I I certainly like the bottom-up approach to kanban. I just don't think we fit that model yet Well, I mean you you could pick off the next ticket that you could do right? I mean is that I understand the idea behind it that anybody could pick off any ticket But in our reality, you know, you pick off the next ticket. That's the highest priority that you Yeah, you could do and I think there would have to be some Um, you know communication about you know, who's what's expected of each person and you know, what the Yeah, I don't really see how that's much different or any different from what we're doing now because We're treating it sort of like a kanban like we've got the list of open bugs And you're just picking off the next one that you can do Uh, but we've also divided it into projects because some of these projects are discreet Well, you can have multiple kanban boards And you can have, you know, one, two, you know per project, but Sure But we're using a scrum Uh tool for it right now Yeah, the concept for kanban is that changes come from the bottom up, right? The canonical example is you're in a car factory Your job is to take as a assembly comes down the assembly line Your job is to take a door from a stack of doors behind you bolt it onto the car and down it goes And you turn around and you notice you're out of car doors So now you have to go and pick them up And maybe one of the recommendations you would make in the kanban meeting was it would be better If there was a blue ticket for doors down And when somebody saw that as they were making their rounds, they would magically Punish my door supply So I don't have to continue to do it and take time off going over and picking them up and fork lifting them over and getting resources But it's a ground up recommendation for how to improve the process flow And I just don't know how that applies here We're I mean, I I'm not saying in a negative manner, but we're being driven from the top down Interesting. Yeah, but scrum doesn't really fit us either because they're not really doing sprints Yeah, well the whole purpose of scrum is just a daily hey five minutes What are you working on let everybody know if you if you're stuck on something let us know Did you learn anything great no good great move on usually the board speaks for themselves and if Upper management like michael or josh have any questions they can pull the individuals associated with the project aside and ask how they're going This doesn't necessarily have to be the form for that Yeah, I just want to say So thinking about this You know, you're you're right there chris. I think the The point of these meetings is not hopefully not to communicate facts because that's it should be all captured in The ticket system or whatnot, right? the point of this this meeting should be to facilitate communication where we have questions or need you know need some sort of interaction and You know to highlight that stuff so maybe We can go back through or maybe I'll go back through this The recording here and just see what kind of interactions we're having and see you know Where we're just relaying information that doesn't you know doesn't need to be relayed in a meeting of you know all hands like this and And then maybe we can focus the discussion on just the things that are you know causing people problems or might be might be new information Yeah, I think yeah, I agree one thing I've also found difficult in me is to To tell what is done with some of these like our If you look at these Like the fix existing bug spring I mean this list is huge and knowing what got added to it since we talked last Yeah, well, we haven't been rolling it over every week, right? So we could close out every week and make a new sprint like we had been doing with the generic sprint 1112 things So I think if that's not too much trouble to do then let's just do that Yeah, we can make it seven. Yeah, like every week we could have a different fix existing bug sprint. Yeah, exactly Okay, maybe we'll do that then because this is yeah, this is getting a little unwieldy. Yeah, I agree with that Okay Should we then um I think we want to if we want to double down on the sprint kind of process Um, should we have a you know future fix existing bugs? with backlog And you know what we're doing in this in this wake or in this sprint? And at the moment Yeah, the concept for sprints is kind of miss. I hate to say it It's kind of misplaced on us. Well, because the whole purpose of it is to get the um The person that owns the project to give you feedback at the end of the sprint and give you direction for what the next sprint should be doing Uh, you know, you may have fought for the next three sprints You're going to be working on blah and after one sprint But you know the owner the stakeholder says well, you know now I see we're at Do this instead of that right? So it's to be agile. So you can you can swap and change on the fly what you're working on Um to meet the needs of the customer. We don't really get that feedback, right? We don't we don't have any product on we're saying Okay, now that you've done this I think this is the next thing you should do So, you know the concept of sprints is almost lost on this team, right to a certain extent I I agree. I agree with the yeah, we don't have a customer per se Um, but I still think it makes sense to have you know weekly or bi-weekly Or you know regular objectives, right? I mean we don't You know, I hate to say this but we don't even have a schedule for when we're going to be done with the micro Core software being ready to shift, right? There's just so many features that and we're so understaffed For the the amount of work that we have to do That you know kind of even planning that right now is kind of pointless So, you know, we're just trying to uh tackle the low hanging fruit And make as much progress as quickly as we can Yeah, I mean it makes sense to be more organized about this when we have a bigger team And you know, let's put some of this stuff up a little more Or we're going to fashion it I agree with you guys right now. It's just kind of a This thing that I identified to some of these things kind of took my spot I don't there's not It's really hard to time off some of these things too Um, one thing I don't like about sprints is that they Yeah, you can you can say this is going to take me a day or two to do this task And I have these four tasks to do this week But more times than not, um, you know, it takes you longer to do something Than you think it will And then it just winds up goofing up the whole, you know schedule if you try to keep to the schedule um Or you wind up, you know Doing bad things at the end of the week to call a task done when you know It's better to spend an extra day or two on it to get it. Um Done the right way, right So Anyway, um, I mean I the other the other point of having regular check-ins like this is just to make sure that people aren't, you know Um, we have make sure people aren't stuck or, you know, uh monitoring off off target right, so Um, so, you know on that front, I think we're doing all right Yeah, and historically we I don't think we've gone deep enough So there are what we're discovering as we're digging into precise is that we were not near as you know At least I was not near and nearly as on top of How these things were developing as I should have been because I was wandering around raising money and You know the the responsible the parties that were responsible for making sure that this stuff was done properly weren't doing their job Yeah, definitely seems that way So all right, um, let's let everybody go and um, we'll uh Probably try something a little bit different for our next meeting All right