 This is a talk inspired by work that got inspired by people that have problems with the old CSV approach. When I took over CXV-XS, which is a module that is a state machine and parts the CSV the right way, it already had an API. There is another talk about APIs later by someone else, which I will also follow because I'd love to know more about APIs. The API is object-oriented and CSV is seen by most people as a simple format. We just have to deal with this, but most CSV files are broken one way or another. Either they have too much spaces or the wrong portations or bad separation characters or whatever. CXV-XS is very good at catching models. But new people, new people to Chrome said it's too complicated to start with and the documentation is 60 page long. I want something simple that works. So, who knows, there's a user of promocs and she in this case said, my problem is that modules like CXV-XS doesn't open. Most people are non-English in there, so there's a lot of nice typos. Don't open files from me too. That's one extra line, so that's not your problem. So I asked all and all and all and she said later, I might use, CXV-XS is a wrapper around CXV-XS or pp. So if you use this and CXV-XS is installed, it will use XS. You know, everybody knows what XS is? It's a sea layer to the core, so it's much faster than QoPol. And modules, CXV-XS, if it becomes a one line, I joined promocs back then to see what the problems, the open problems are to modules I haven't seen yet, just to read about it. How do people use it? Why have they problems? What are the bugs they see as bugs? So I started thinking about this and there's a huge discussion on promocs somewhere where I say, well, this is the problem. What are your opinions about my thoughts? This is the object-oriented way of CXV parsing in writing. Can everybody read it or should I make it bigger? Nice. I'm still not. You use the module, you open a file, and you can add the column encoding, whatever. That's opening a file in Perl. You make a new object and one of the things you have to do, say, I accept binary and something new about two years ago was out to DRC, when you expect problem, please put that on, because the moment you hit an error, it will show you the error and you don't have to say, well, it doesn't work. There it says why. Then you do a while loop and you get a line from the file handle you just opened and you get a row with the fields. Most of the time this will work. If it doesn't, this out to DRC will show you an error with there's an incorrect quote or an incorrect character or you forgot that and you add another attribute here to fix that. But as this is always true for me, I use CSV access a lot and the first things I add are those two. That's silly. That should have been in the default API, but I didn't write the default API, so everything I add later on should have the default it had 17 years ago. So it makes it harder for people to use it from scratch. You have to read and know why. The right thing is to say you open a file probably with columning code, you have a new where you say binary out to the F, and you have to specify an end of line. Why do you have to specify an end of line here and not there? If you don't specify one, it will be versatile. It recognizes all valid end of line sequences, like new line carries you to a new line or carries you into it, like the old Mac state. So if you don't specify it, all valid new line separators are even mixed. So if you one line with new line, new line, new line, new line carries you to a new line, carries you to a new line, it will still be valid. Did you ever have any of those? Yes. Wow. That's one of the reasons I wrote it. So for output, you do want a new line character because you have to specify one. So the most versatile is carries you to a new line, which is accepted everywhere. The defaults are just a comma and a double quote. Then you prepare from database and this is mixing BBI. So this is the shortest function to dump the database table into CSV. Please do and ask questions because I can skip, skip, skip and do it in three minutes. I've got 40, so I've got lots to show, but also time for answers to questions. So you execute the prepare, select start from full and while row is fetch row, a regref, which is defaulting the icalls, you do a print with the row and you're done. Very simple. Attributes, like I said, you have to add options, attributes for all the problems people came with. I want my empty strings to be quoted. Why? I want my empty strings to be quoted. No reason. Well, always quote means that even if it's not needed to quote a string, like if you have foo, F-O-O, the CSV does not say you have to quote it. There is another nice thing about always quote is that you can say that if it's undef, you don't quote. So always quote will only quote if it's defined and that means that if you have no characters in your database, you have comma, comma in your CSV output, but if you have an empty string in the double quotes, so in reading back your CSV, you know what the difference is. So you can retain no characters, which is normally not supported in CSV. Quote no. So you can always quote, but you can also say quote no. And then you have no difference between the empty string and the no. But some applications that read CSV don't accept the empty field. Yes? Another record where you say, well, that's valid CSV, but okay. Separation. Ask it please. This is to normal. How to do that? Please use. You can also set it to two. So zero, one or two, and then it will die when it has an error. So one will report, two will die. And if you have auto die on, then it will increase it with one. So if you have auto die and you set it to no, it will be one. If you set it to one, it will be two and it will die. Lose quotes. It is possible that quotes are not correctly escaped. So we want to accept that. Well, all those things, this is for parsing data. Blank is undef, empty is undef. If I've got an empty string, like the whole quote, do I want it as an empty string or do I want it as an undef? Never use that one. Wait, wait, wait. Now I'm intrigued. Read the documentation. This is one specific error that occurs and there is no other way to skip that other than using this. This has to do with embedded carriage returns if the new line is the separation character. Well, yes, don't. These two are wrong in my opinion, the defaults. But those are the defaults to be backward compatible. So what I want is only to skip those two. So this is not legal yet. This is just to demonstrate what I want. I want to skip that and I want to skip that because if I write, this is my thinking process towards the one-liner I started. So what do I want to change to make the one-liners easy? First, skip what I always do because those are same defaults and for writing, the end of the line should be what everybody expects. So, simple. A lot easier to read. What does it mean? Then I don't want all that object-oriented stuff. I want one function that opens the file and then it returns an array of arrays and I want to go for each my row over the array of arrays. Output is even more simple. Csv output is test.csv in this select all array. Wow. Array of arrays, which is already supported by the DPI. But for the parsing, if you just read the whole file and it's a large file, you're going to be using a lot of memory. That was the next remark I wanted to make. This is only suitable if all your data fits into memory. Or you can tie it. No, you can't. Ooh. You could do that, but why would you if you then also could write the loop yourself, it's much faster. Why would you tie it? Yeah. You can leave it tied. When you parse, is there a reason that I can't give you a list of field names so I don't get an array of arrays and an array of caches? It's not the next sheet, but almost. Sorry. I was presenting my way of thinking towards this. So, the question asked was if the one-liner could take a file, so simplifying this into a file, which used to test it, and writing is even more simple, how it is tested, and accepting an array of arrays. It seems to function in CSV from file, but in CSV from file, then you don't have to have to minimize. No. All right. I've had a lot of discussion also in Amsterdam with the promonger set, and I'll show you why. So, this is what it looks like. But now you can combine them, and I answer your question. Because the output of a CSV is an array of arrays, and the input of an output is an array of arrays, I can combine them, and I say the output is a new file, and the output is the output of a CSV function. How does that answer it, though? How is that? How does that answer it? If you change the name of the function from CSV, and then give it a file of old CSV, you can change it to CSV file as the name of the command. Yeah. And then you have CSV file handle, you have CSV file, you have different functions, and then the parameter is a single parameter. It's just in file name or in the file handle. In accepts everything. It's just in case if it's a file or a file handle or a reference that CSV returns just an array of arrays. It doesn't return a file. By default, it returns an array of arrays, but it can also return an array of arrays. So that means you can still do what I suggested. Yes, but you can do much more. So, the in argument accepts a file name, a file handle, and since Perl 5.8 something, it also accepts the glob itself instead of that reference to the glob. So you can just pass star std out. So, or star std in. So, it really, the in is very versatile. So, why do I need a CSV file? In accepts everything and recognizes it as such. And because it returns an array of arrays, you can pass it to the in of an out. And the in also integrates with dmi. So, if someone now asks me, how do I dump a table? This is the way to dump a table. How seriously fucking, how is that? I'm laughing. Please answer more. This is how to convert a file with a semicolon separated values to a file with columns, comma separated values. So, if you've got a Microsoft export with user semicolons, just use this. And you've got a nicely, correctly formatted CSV file with commas that you can pass on anywhere. RFC 7111. Anybody heard of it? I have to make it smaller this. It's a boring sheet. Who now is Rick? The current pearl port pumpkin? He tweeted me. Well, what's this? So, I opened it and it was, you refer to identifiers for the text CSV media type. I do CSV. So, I read it. It's a long, long, long document. And it starts, only the start is already almost 100 lines, blah, blah, blah. And then you have an introduction. This is about sections of CSV file or CSV data stream. And what's the blah, blah, blah, blah. And what's important is that you can have, this is an example of the inputs and then you can have cell based, like you want the stream that you want to sell for one, two, six, two. Or you have row based, common based, etc., etc. Well, that specification doesn't look too hard. Look at the date. This is very new. This is from January 2014 and tweet was on the 17th. So, I implemented this CSV fragment and specification. And specification can be just exactly how the RFC does it. I had even Twitter communication with them for what do you mean by that? Is it uniquely? There is still a discussion open because one of the things you can do is four minus star, which means column four till the end. I don't know how many columns there are but I want to skip the first three. Or row, one till two, and four, and six till the end. I don't know how many there are. Very powerful, very nice. What are some of the remaining discussion points? The remaining discussion points is what happens if you do one dash, two, semicolon star, semicolon star minus four. I think that would be probably out of power. No, no, no, that makes no sense. Everything up to last four? The discussion is anything that cannot be uniquely identified as being valid should be ignored. Ignored instead of airing out? Yes. And then with the other dApp find that as a problem and go like, hey, you're supposed to be stupid. You're ready, I'm saying. My module croaks with the message your fragment is invalid, but the RFC cell, you can return anything you want, even zero. So they're considering doing stuff like star minus four, right? Star minus four is said to be invalid and croak. So you won't be able to say I want from the first one to four before the last? The first one is always one, so that's easy. Right, but if you don't know how many there are, you don't want the last three. That's not important. That's not this RFC. This RFC is you know how big your dilation is if you want that part. Very expensive to implement, because you couldn't say star minus 10,000. Implement that. I don't have the time probably. I'm off way now. I don't have the time to go into that, but I made the star dynamic. Well, indeed, using a star with 10,000 can be very expensive in checking, but if you see the star and you implement it as a flag end of data, horizontal or vertically, you don't have to check all the records. You just check all the known numbers in a bitmap, and if you have a star, anything that is over that number is end of data. So what happens if you have 1-2, semicolon 4, semicolon 7 minus star, semicolon 12, you just ignore the 12 because it's already in there. So there are some optimations. I won't show the code. This is 4, semicolon 4? Yeah, and it will just do... It will not duplicate its specification on what you want, not what goes out. That's the RFC, so it's filled. Then the feedback, I did a lot of Twitter, Twitter is a nice medium, if you don't know those people. And I said, well, I have now implemented, look at the date, 21st of January, so it was only three days, four days later. And now, in the plural module, we released for the win. That was quick, thanks for supporting. Awesome, sounds like other languages needed, too. This is pretty flipping. Cool, thanks again for this. So, yeah, this is also motivating people to use the plural because it's fast, it's easy, it's simple. Because you implemented fast. Yeah, it was very easy to do that. Relative, good for you. Well, come on, buy it then. That was awesome, thanks. I'm missing a statement here. You can just say, out in fragment. I think there's an HTML problem here because this is not legal. There's something in there, I can fix that. You can just say, out in fragment. So you have an in file from a CSV, a CSV file was fragment and it outs there. So you can make with one code an output fragment of an input. Callbacks, I'm in time. Say you have a CSV file that has this, maybe. So, it's better. You have a CSV file with the workshops. The date it starts, some identification number and you have the description and the city and the country optionally filled because you don't know where practice lay of you or it never specified any entity. And what you want is, I want an ID and I want the date to start and I want the event name. Look, I've got also headers and I want the city with impairments behind it, the country. This to show on the interwebs. The old way. We have a new with the binary out of the FDN flight. We open the half. We print the new feathers while we get those lines. If row 4, 0, 1, 2, 3, 4. So I look at the country. If it's filled, I add that to row 3. It's not really the feeling. But it works and it's, yes, seven lines. So, not hard to write, but this is what comes out. I've, this can also all be on one line. I've stretched it. So what I want is an array of arrays with an in of that file and I've got Colbex now. Colbex are new. Those are only in there since 105 and I will release 106 and 107 release today. 105 release. 105 release. In general with RFC second one way. There are a few Colbex. I will go through them. One of them is after parts. The moment I have parsed the line with all the attributes like allow coding, allow stray sketch, whatever. I have a array of fields internally. I can do something with that. With a Colbex. So after parts, the function that is called the Colbex has two arguments. The current CSV, parser, which you can use and the row reference. I can do the same here. Row four and row three is unshift and mean. This is from XS. So this is a parole function. I can pull from within XS on the lowest level just after parts just before I return it to my core. And I do the same thing. And I have a record number. I don't know if anybody's seen that. That is also relatively new. As CSV can have embedded new lines reads per line, but doesn't have to be the same record. So if there's new lines in there, there's two lines for one record. So CSV holds your record count. So I have unshift. So I put in front of that row the record number. I have fragmented this with the first two columns and the four and five columns. So I skip this one and I skip that one. That's my fragment. So this in is all I want for me to do to read this file and convert it into the data I want without the headers. Now I do CSV. In is that array of arrays. Out is star as to the out. No need to open the file or backslash it. This is supported. And I want headers. These headers. And that comes out. Pretty intuitive. What? Is the fragment happening before the after-pass callback? No. This is a hash. Does it do the fragmentation before it actually fires the after-pass callback? No. The after-pass is after parsing and that comes to fragmented. And that's documented. Otherwise this four and five columns. It returns that that's the after-pass that should return the row. The question is a good question. So it parses. Because you put a new item at the beginning of the other row. Then you have a new column. And if you fragment afterwards and that's actually returning the row because if you don't return the row from the callback then you actually don't really have to put something in there because you're throwing it away. So you will have a new row and your fragment then needs to take into account a new row. This fragment is on the modified data. Okay. Shall we put a comment then? Maybe. Maybe. You probably don't even want to go that far because there's more. Are there any additional callbacks? Additional? Yes. Coming up. Don't feed the tables. Okay. This is an extension on the callback. Yes. What if you do a... This is what you get from a transaction logging system. You've got the transaction, the quantity and the product. Well, that's... Having products feed 13 is not nice to present to someone. You want to make an invoice. And the invoice should have the transaction ID, the quantity, the product, which is the black tin beer instead of product A3 and a price and a total price based on the numerical product. So what you do is in the after parts you push and skip the first argument is the CSV and the second is the row. So $1 is the row just to make it short. It doesn't read nice, but I'm thinking about using $1A and $1B for this. No. It's confusing. So push and this is from a module get, desk and price for the quantity and the product which I remove from the list and I push the new things that can come from there which are ferry beer, price and total price. Which means that you can make callbacks in a module and use very complicated stuff and just use it in a callback. See I just now also combined all of that into one CSV call that is out. The headers the headers there, the infile is a CSV call so this is the reading with the callback and the out file brings us to stand up. Can you do stuff like add lines not just add rows? Can you play with the lines themselves? Because I've worked with CSVs where they're too lazy to have several rows for ID's they'll have one row that has ID, comma, ID, comma as a value in CSV itself. You mean add rows instead of columns? Yeah add rows instead of columns or reduce rows. I understand what you mean. Oh it's fucking nuts with CSVs. It's The header skip, what are the values are there and can you indicate the number of headers? Sometimes it's three lines or something. What if I want a header to roll a skip? What if I want a header that is a skip? What? Can you call back for how to decide headers? Yes and no. This is the example for skip so I don't call it transaction Can you indicate how many lines there are? No. It's either a skip which is I don't care about what the original headers are but there are headers and just ignore them. Shouldn't it be a constant? What does it have on? A CSV file the first line can have a title for the column. Which is like this is presented as spreadsheet but it's like transaction, quantity, product. Which is the first line which is no data content. But the definition is only the first line. And you can skip it, ignore it and you can also say but that's probably the next slide. I've got still finding I need to hurry now. So you can use this with modules which it can be better. I've got now the same the purchase not CSV which is my input file headers auto which means that I don't return an array of arrays but an array of ashes and every field has a name. So in the the first argument is my CSV object and the second argument is a hash reference. So in there I give the hash reference to the routine and it can look at hash arrow, quantity and it can look at hash arrow, product and it can just replace hash arrow, product with the product description and add elements to the hash ref. So I can just add the total price. I can add the whatever I want to do. So this one accepts a hash ref and it will do with the hash ref whatever you want. So CSV I've got an input from a standard call with only two arguments. I've got a file name and headers auto so this means I want a hash you can also supply headers skip which ignores it and then you get an array of arrays and you can also pass a anonymous list of words in which you pass your own headers on input. So if the input has no headers but you want it to have headers and you want an array of hashes then you pass a anonymous list of titles or header names and it will make it into a hash. So then this returns an array of hashes the input detects an array of hashes and at the moment you have a sub on in which is a new callback at a new level. So after parsing you can also put the callback after pars which is the lowest level just after parsing then it makes up your array of hashes or fields and then it processes if you pass it to on in which means I'm done with the parsing I want to do a callback on that on that row if it's either hash or anonymous list. So you pass it the hashref. Out to still send out and my new headers are those and those headers are just entries in the hashref so they have to mesh if this was transaction and you want transaction ID you have to take care in this routine that transaction is replaced with TRD so if this entry is unknown it will just pass here on then. More questions for me that's left yes So how would you use that linking with what you just said to rename a column to use a different header like you say TRD because you're using transaction in both if you want TRD here you can have this just say dollar underscore bracket one bracket arrow TRD is dollar underscore one bracket arrow transaction so you just copy the key into the hashref it's a hashref so you can do with it whatever you want and this is at pearl level this is not at excess level yes and then the second header parameter would say the TRD which you just created so you just replace this with TRD simple what ideas do you have for chunk processing for chunk processing so that will be the entire file use fragment that will do that because it only returns you the chunk you want so the fragment still reads line by line and it skips everything that the fragment excludes so if you got a huge file you only want times 10,455 till 5 lines later you can use the fragment and it skips the rest so I'm very curious that it works under the hood because you still have to do a person I can't show that in 2 minutes if you need each file will read the file in memory first yes now it will read line by line and if no error occurs any line is added to a this so yes it will read the complete file in memory but not first read the file now it reads line by line it reads until the end it adds up a lot 2 other things that I didn't now there's 2 more callbacks there's a callback on before print that's also the lowest level that at the moment it outputs just before that you can do something with it either altering the new line sequence or adding what you want saying the ID on the first line and put the rest on the next line that's possible and there is a new callback which is not nice to talk about here just read docs that's on error the most questions I got in the last half year I got that error and I want to ignore it yes and well I got that error and I want to ignore it because I know there is an error and I just want to skip it anymore what are you about the error? you can that's the countdown so I can see in which time I have you can say a callbacks error and at the moment the advantage of the error callback is that there is no slowdown at all because it's only when there is an error and what you get in the error is the line number the CSV object the field number the reason of the error you can say if my error is 2023 I know that I expect that error just reset this error and go on and that's all in the documentation thank you