 So, now we will see about UDF. So, how to instant view and how we can use UDF. So, UDF stands for user defined functions. These are separate functions which you can write which can be invoked inside of it. Why we need UDF? Basically to say that you know there are certain operations which you might want to do which is not on one field. You might want to do it on quite a few fields. Let us say let us want one typical example would be you might probably want to let us say that there is a you have a data board member and then you have the start date and then the ending or a contract. And you want to find out all people who work less for less than 30 years. So, how do you find it? So, you can filter it out using various functions. So, in that case what you have to do is to take both the fields and you need to compute something. So, that process of computing can be offloaded to the UDF. So, scenarios like that where you might have to operate on more than one field. So, that is one thing one place where you might have to use UDF. The other thing is you want to do more than filtering and grouping. So, this custom filter we said about where you have multiple dates and then based on the date we need to find out some operation or things like simple cap all the letters capital. Yeah, that is also something I wish you might want to do. Things like stream manipulation or you know you want to find out time or some kind of thing in manipulation those kind of things or you want to strip some characters from the beginning and end of the column. So, those kind of scenarios you might probably want to use the UDF. And the other reason is if you already have a legacy Arup system and there are like always the business logic is already there may be in the Arup system or maybe in the enterprise software model. Then you want to reuse something. There is no point if you already have something and there is no point provided it is modular. There is no point in ream and take it again and again and again. The other reason is program doesn't generally constable because you might probably use Java for the last 5-6 years. You might probably be more comfortable with that. So, traditionally before I do sorry before I release 0.8 the current version is 0.1. So, before that we can create UDF only in Java. But now you can actually use languages like Python and other languages as well. I think in 0.8 Python support was added and in 0.10 I guess we support for dynamic languages. So, it can do Ruby, Javascript, other things. So, basically the dynamic languages as in Javascript and other languages. For example, Python is Jdan, Ruby means Jw, Javascript. Those things which can run into Jb. And there is also some streaming by which you don't need things to be Jb. It just got some executable which can read from SDD and write from SDD. So, there are multiple types of UDF which you can write. One is the eval function and then you have feature functions, then you have load functions and store functions. So, what is eval function? Eval function is something where you want to take some input and give an output from. The input could be anything. It could be a top level, it could be a single field, it could be a teacher, or it could be any data type. Then output could also be the same. It could be a scalar type like a string integer or whatever. Or it could be a top level or it could be a path. You could take an expression where you get basically returns and that is eval. And filter functions is a special case of eval where it is going to return a boolean. So, if it returns a boolean then you can use that in the filter clause or in a comparison operation and then you can use it for filtering. And load and store functions are basically used in cases where you are loading and whatever you want to store something. So, we see about eval functions. So, eval functions are generally used in the for-each. So, for each you use the generate and then in the generate you may want to pass in a field or the entire table and things like that and then you want to. So, in the generate you want to use something else. And this is the most stronger type of data. Pretty much most of the UDAs which are written are written for UDAs. So, can you use a very flat line for this earlier equation that there is a boolean mechanism available? So, can I embed that in say javascript or python or directly here for the eval? So, when I said inline I meant in embedding three boole inside java. That is like one page. That is one thing. The other thing is streaming. Streaming is where you spawn a processor and it is executable and then pass it. The third is union. These are three different things. You can write UDAs in javascript. It is called botanical. You cannot embed it. It will not understand. Embedding only works if you are streaming something. So, if you have an executable and you can execute it and then that can invoke a STD. And then that can read from STD and then write from STD. It is basically using big storage kind of thing. You can use it that executable thing. So, they can be used like similar to any other. They can be used for a particular field. They can be used on a particular relationship. They can be used during load time or during eval time or during use. You do not have to say that that applies to this. It is executable. So, they can return simple times or couple times. So, this is how you generally use it. So, you say for each a whatever relationship generate. So, this is your package dot your function. And then what you want to pass. You can pass a single parameter, multiple parameters or you can pass in a couple. This relationship itself you can pass. That becomes a couple. So, let us see one thing. If I want to reference the entire row in the entire couple. How do I reference it? It is after this generate. So, if I want to reference the entire couple in function. We just pass in 8. Just pass in 8 then. All you can pass in. You can also pass in start and start function in the classification. So, there are certain things which you need to do to create a variable. This is specific to Java. And so, what happens is like it provides a function an interface called evalpunk. Which is a interface which is available as part of the picture. So, this is a generic interface. Which means that you need to specify the type. So, the type here is the return type. So, your UDF is going to return something back. Whether it is going to return a string or an integer or a tuple or a bad whatever. That way you go get it. And then input always comes as a tuple. You should remember this. Even though you want to pass only one. So, in the previous thing we passed only one column here. Even though you have a pass one column. Or you are passing two columns. All of that is put into a tuple. And then you will be getting the input as a tuple. So, you need to do your custom validation there. Find out whether the tuple is passed properly. And check for empty is null and all that. That you have to do. And then there is a function called exec. Which you should extend that is in the interface. And that function should return the actual value which you wanted to return to the fixed. And so, these two are optional. There is also there are two functions. You will get R to funk mapping. This is a really funny name. So, what it is used is to map the input types to the value. So, the input I say is always a tuple. But the tuple should follow some pattern. So, it should be like the first is string, second is integer. Some ordering that tuple. If you want to maintain that. Then you need to extend this function. And the same thing when you are writing something back. If you just instead of writing a scalar value, you are writing a tuple or your back or something. Then you should know what type you are going to. For that you should extend this function. These two are optional. If you are just going to take one string variable and then return a string variable. But if you are doing. If you are going to accept a tuple, then you need to extend this. So, you are going to return a tuple or a back. Do you say that you know all this may take tuple because you can have one argument or whatever. Yeah, you are always a tuple. But what is the structure of the tuple? Whether it has three columns, one first is a string, second is an int, and third is an integer. That structure if you need to, if you need to, if you need to give it to the unit, then you need to. So, you may be writing some java code. So, the other thing is after you do this. How do you invoke the unit of input? So, after you create the unit classes, you need to create a java file and then package all your classes to java file. And then there is the put the class, put the java file in the class path and then register the java at the top of this. There is a function called register which will take the part of the java and then you just need to call that. That is like input. You just need to do that. Or that is like the hash input which you used to go and see. Just define. And then if that jar, if your unit is going to depend on some other jar files, let's say you are using the Apache Common string library or something like that, then you need to input those libraries as well and you can register those jar files. And then you have to define your unit of input. This is like a function declaration which you should do at all. So I have a small code which means so this is a unit function which I wrote. The reason why we are going to use this function is if you actually look at the tweets, so I told you that this has quotes in it. So I want to extract only the text of the tweet. But then you find that it has the quotes. So that's right. So you are using it completely or you don't need it. It's just that I created it first. So if you are not using it, you just need to build a jar. What is the peak jar that you want to use? So inside the lips, I have the peak jar which is out the one you downloaded. Without jar files? Without jar files. There are two jars now. What are the games of this? Peak version number that jar and peak version number that jar. It should work without a jar. But this particular one also works. So this jar, you should not package as in your jar file. Just so that you know, actually there is a comment about the beauty of the functions. So this name of the class, you need the name of your beauty function. So the name here, whatever you are using, you need the name of your beauty function. And then it has to extend this interface. And I am returning it. I am going to return the same here. So this is what I have done. And I have to inherit two methods. One is the exeC method. And the other one is the getR to function method. So technically I don't need to do the second one in this case because that is going to take us written by this thing. But I did this so that when you do a disk drive or something, something like that to reflect back on the peaks you are down the line, then it will be easier to do it. Technically you don't need to do it, but it is a good practice to do this. So this is the exe function is going to get the top-to-input. So we are getting input as a top-to-input. So as I said, it is up to us to check if the input is null or if it is an empty or if the size is proper and all those things. So we should do that as a condition generally. And then, so I am converting the string to the input.get. First, this is a top-to-input. There is only one one. The first thing which I am going to do is I am converting that to string. And I am passing into this, so this is basically the Apache government's string library. So basically I am going to strip whatever character which I give from the front end of the end of the string. So this character, this code character is going to strip it off from the first end of the string. And so what you need to do is you need to catch all exceptions. So you may not, so class cast exception is not mandatory to catch, but you still have to catch it here. And you should always make sure that you don't throw any exception. Because if you throw any exception, you are fixed up, might fail. So make sure you catch all your exception. And when you throw something like that, so there is another slide where I will tell you what is the logic you need to follow for error handling in the video. For now, remember that you need to catch all your exceptions. That's what I said to be decided. So you are saying that you should only throw all your exceptions. The contract seems to be that you should not throw any exceptions into all your exceptions, right? Not just that. So there is a reason I didn't throw any exception. And we talked about the error handling here. But remember that you need to catch all the exceptions. Here, even if I remove this slide, try the catch and all that. I don't know how to get an exception. I don't know how to get an exception. I don't need to, in Java, I don't need to catch all the exceptions. But we still are doing that. So basically, you are saying we should catch all the exceptions. So that is an easy, easy function. And then just get R to function mapping. So what we are going to get is just one. So inside the tuple, we are going to get only one string. So we define a new schema and then we are adding a piece. And then we are saying that that would be our type, which is a string. And then this is boiler view code which you have to write. It basically creates a function specific list based on this array. This is Java. So you can expect anything better. You have a lot of these. boiler view. So once you do this, what you need to do is you need to create a jar with this file. So if you are using Eclipse, you can just right click and then create a jar. Or if you are not using Eclipse, then you just use javascript and create a jar. It's javascript and javascript and javascript. Or you can use and add the script. And some script to create a jar. So if you look at my full structure, what I have done is that I have the UDF with this project inside this directory. And then I have created a jar in the list directly. So from it, I have created a list directly. This is the UDF. So this is the picture where we have all that UDF. What is that in the first slide? The slide is not absolutely registered. Do we have to write this or do we have to react? No. Because it doesn't go which way. I mean, the script doesn't know which jar you are using. So why are we adding the common line? Because inside this here I am using that. So this project also... Does UDF jar itself contain this common line? No. The UDF jar will not attend. It will just be in the second last slide. So this is the list directly so that Eglips just will come in. But when I created the jar, only the UDF jar is there itself. So we need to call the register function. The register function just takes the jar, parts of the jar, related parts of the register. After that we need to register the jar. After we register the jar, we have to use something called define function. So this define function is somewhat similar to import. Right? So you define the import so that you don't have to use a full package when you call the function. More or less. So you just define what the function should have called and then you pass it. And then what we do which is to load the file here. This is the normal stuff which we used to do. If I think we could let's see how it works. You can call but sometimes people complain. It's better to do it. So there are some weird combination for which it's not there. And so here is what we are calling. So we are using forage, and this is the unit. And then we are passing in this 8th parameter. So if you actually look at the data file. So if you actually look at the data file. So this will be the 8th column. And then for each data we are calling the function and then getting the 8th. And then we just get this particular data from it by contacting using the Twitter API to get it. No, Twitter provides a way by which you can explore all your feeds. These are my feeds. So if you go to Twitter settings, you can request an archive of all your feeds. And then they give it to you in case you want it. Alternatively you could install an application and then this connector can create it. But the problem is this. If you even only query the last x number. There is some link. You only go up to 20 pages. There is some link. Even if it's your own feeds. If you have an API you can't query all of them. It's like a dump. It gives you all your feeds. So I need to pay special money to use the Twitter API to pay all the feeds. Like let's say I have an application and some services to someone. No, they have a hard one. So they recently changed the API. They don't have to take a five-hose from Twitter. They don't give you a five-hose. You need to become a partner with Twitter. They only give it to like four partners. So you can get the five-hose data. You get the real time. It was too late. So you get the real time feeds of all the Twitter services. But you need to it's not that you can be an API. You need to be a partner. You need to sign a deal. It transfers to me. So some huge amount. But we can get our own data. We can get a lot of data. That's imprecise. No not really. People complain. We will give you a certain interface. We need to give you the application. If you have a restaurant and you are trying to mind that you are doing some semantic analysis to find the It could be better to have an API, it could be better, and we have some, they recently changed, they recently put a lot of huge web app.net, so let's execute this, let's do that. Similar, they used to do the Twitter API also, and it was needed a lot. I think earlier, they used to, one by one, they stopped it, recently there was some API term change. Okay, so it has given us the beats, and yeah, there are no quotes. So you can, so just give us some feedback. So any questions on, so one question that happens, this is in the jar, I'm not sure. This chapter that we generate as part of our, this project, it needs to be a, what does it mean? It can be a thing. It can be a topic script, you know, where does it mean? It can be anywhere, it can be anywhere, you just need to do the part change. So I showed you that, I have put it in this relative directory, so I'm putting this relative there. You can give a relative part, you can give an absolute part, anything is possible. And it doesn't care where it is, just that it needs to know where it is. If you are doing on the, on the cluster mode, people generally do what they do is put the jar in the instance rather than in the local instance. That improves, because that, this can be put in the HKS's cache and put across all the nodes and all those things. So people generally put it in the instance. Okay. Any questions on, on pick, sorry, the UDF or, what we do in UDF? So if you have a doubt script based in a UDF, what do you do that? It's completely different. It has a similar resistance. What is this below part of the list for the script? Okay. And does it automatically get juggled? No, no, no. What I, pretty much most of the UDF is about, it's just completely different. Seriously. So what it actually does is, so you give, I mean, so it's from your, you're going to give some data, some data time. But what pick does is that pick is going to enclose that in a tuple and then send it to UDF. Okay. In the UDF, you know how to distinguish it. So you basically say, split it up and do this. But internally, so this is, this implementation is, there are some engine which is running this implementation, right? That implementation needs to know what is the data type, but it can better optimize it. So for that, you need to tell the engine what was the mapping between the tuple and the actual value. Or you basically need to define the schema of that input which you're getting. So that's what you're doing here. You're basically creating a schema, empty schema. And then you're adding to that a field schema. This is a parent schema. And then you're writing a field schema. And then the field schema is of type canter. Okay. Okay. And then you basically have to return the schema in this function. But this expects it to be a list of function spec. Function spec is a class and then you need to return the list of it. So what you do, you take arrays.assist is basically going to take an array and then convert that into a list, basically this list. Okay. And then you create a function spec. Because you have only one function spec here. So you have only one input. And then you pass in. So you basically create a new function spec using the schema which you created. So you create a schema. That schema had a field schema. So you created a schema with one field schema. You have a schema object. You see that schema object, you create a function spec. You created that function spec. So for that, conceptually you're passing. This keeps the schema which you've created. You need to also give a name to it. The name is not used anymore but you need to give a name. Some, if you go to the explain in the big script then basically if you go to the exhibition engine you can find out what is the schema name. Okay. So for that you need to know the unique name. Since you uniquely name it, the Java's way of unique naming is this.class.getname. Basically you use the current class. Okay. So this is, so if you take function spec that takes two parameters. One is a statement name and the schema. Okay. So that's what you give in here. This is specifically for big design. It's specifically for UDF. Okay. And who calls this particular, the big engine which is doing this UDF. Okay. So and then what do you do? You create this function spec. Okay. It's a single object. Okay. But you created this. Okay. So you convert that as a disk user. Okay. Even your UDF will be considered to a map reduce. UDF will not be converted to a map reduce. Yes. UDF will be used as a, so from every map reduce to this jar will be pushed to all the machines, all the nodes. And in that node that map reduce job will copy. Okay. This will be like a jar. Okay. So let's say I have another UDF in which say I have a base and I'm splitting it into first name and name. So in that case, how will this work? So in that case, you basically are going to return that. Okay. Okay. We'll see that. So next example. So this example is where we take a string and then return a string. Next example, you will take a string and return a couple. Okay. So before that we need questions like this. Okay. We'll move on to the next one. For example, where we were giving take a string and return a string, we will return a tuple. Okay. For this tuple, what we are going to do is, to the same, when you reply to somebody, when you reply to somebody, what happens you have the person's name on that string. Right. So now let's find out to who are the people I have replied to so far. Okay. So for that what we need to do, we need to find out first whether that particular tweet has any hat or model. If there's a hat, then we need to find out how many are there. So we don't know how many will be there. There could be zero, there could be one, there could be two. That's why we are giving a tuple. We don't know. I mean, otherwise you could return on that string, but you don't know what, how many are going to be there. That's why you are going to print up three factory instances. Okay. So let me show you. So this is the email function. The same email in the interface which you need to implement for X10. And then you have to specify the, instead of the generate, you have to specify the return there. So return that, return that, return the data back. Okay. And this is the function. Right. And this is the same exe function. Like before, but here I'm going to run a data back. Okay. So the input is going to be a tuple. The same checks which have to do. There's an RL, MTE, and the first thing is an RL. So all these dimensions are there. After the tuple. So let's first find the actual logic of the data back. Then we will see how the data back works. Okay. So I'm going to explore this one for now. So here what I do, I'm using the built-in state token. You see people are taking a string. And then it will really take, like a string, it will take the delimiter. And then it will really need the list of tokens, which I can iterate. Standard JavaScript. So I iterate through all the elements here. Okay. I take the current token. I call this function. It basically sees if there's an act on that desktop, on that board. I take each word and then I see if the word contains one. It's not the clear logic I should have to see whether it's there or not. I realize it's there. I should see whether it's there. Because I realize that there was an email. So and then if it is there, then I have to use that. So for that, so we basically have two factories. One is called the back factory and the other one is called a tuple. So to return a tuple, I can't just return a tuple. So I need to make it a back and then return it. Okay. So just so you can say, we got a back and then a tuple. For every user name which I found. Every tuple have only one column. Yeah, every tuple will have only one. Can I have one column since then? Yes. Can I? I need to create. So these are factories, the default implementation for back factory and tuple factory. Just create some instance. And then this is how I create a new data back. So I create a new data back of the output. And then I say, okay, the factory.up called data back. Okay. And then what I do here, if that particular word is a better username, then I add it to the output. So this is the back. Then I have to create a new tuple. To the new tuple, I pass a string. So this is one tuple. This is a tuple with one string. It's like this. So you create that tuple, and then add it to the data back. So this, as long as any number of words match, that will add. If there's nothing which is going to match, then there will be no data back. Okay. Then I add it up. But it will be returned to the union. Okay. But there are two things. Two things each property. Okay. One is plus three is another. I have created this. So we have these two functions. So this function basically maps the input output. Same thing. So we create a schema. We create a field schema of that particular string. So we're creating only one string there. And then the same code. This is the same code which we had earlier. And this is the output schema. So output schema is used to map the tuple which was given out. So what we're giving out, we're giving out a bag with the tuple inside. Right. So that's what we're creating. So we're creating an external schema. And then the external schema has a field schema, which is a, which is, which is a character. Right. And then, yes, I'll do it here. We create a schema, which can do a field schema office. So there are two schemas which we've created. Two or three. There are two schemas which we've created. The field of a field. Right. You get the second one. So it's basically a data bag containing a tuple with one string. You have to do this basically, you're looking for a bag only. So why are we complicating that schema so much? Okay. So one thing is when you design that interlogic, because they were thinking of having a very fixed schema earlier days. So the legacy of courtesy. Because otherwise you could just say, you know, this is going to be a bag. But that was it. The second reason is, let's say, if you're debugging something, right, and you want to see what is the structure of this relationship. That relationship then needs to have a very good system. So you need to let me know what is the data which is going on. So that implementation becomes complicated. That's why we are writing this very complex. Just to say that we're just determining a bag. So if they have abstracted it out saying, you know, there's a schema, there's a field schema. You create a field, define a type for it. Then add it to a schema. Then create another class called function bag. And then return a list of that. Okay. So it became very... Because currently now we're only... On the function spec, we only have bag. That's the whole thing. But that bag, what is the definition of bag? That's a couple. So for that, we need to create a field schema. Inside that, it's another schema. For this, we should also have... First schema is a bag. On the bag, we have a couple. Once we have a couple, we have a... a statement. That's why it's a little... And one other reason which I... Which I personally hate about Java is that, you know, you end up creating a lot of these boiler queues. See, some of... More rights go to Java for it. Everything becomes an object. They create factories and objects, and the same part of it gets used extensively. In some places, it makes sense. But there are other places where it's like, really complicates something else as well. That's my personal view. So this is the big script which we have. So the same thing. So we are registering those two jars. I mean, definitely if this jar is not... Sorry, we need it because we are using this... Both the UDF. So we are defining both the UDF. And then, what we are doing, we are taking the wheat, stripping the coats of it, passing it to this gate which has been functioned, which will determine the needs. So, and then we are dumping it back. We should give us a list of all the tea. These are the tea, the needs, or something like that. We should check the tweets. This is sorted, right? So maybe this is the early days of Twitter. For machine, early days of Twitter. We can also print the tweet as well as this. So we can have better understanding what we need. So now this is because in the UDF, we will be doing that. We are not going to take that. We are not going to take that. We can just add one filter. And then maybe we can also add a group byte to find out who is the person whom we are tweeting. I will take it. Any message on the Facebook or the UDF? You can share this code. Yes, I am sharing this. So I will put it up on the screen. The UDF. I mean, including the data. Before I just start with another thing. You can copy it. You can do that. I will put it in the model. So I will put it in the proper directory structure also. So we can just keep it. We don't even have to change the part. Sure. I need to take that job. I need to take that out. You have the little examples on some of the things that we are doing. We are going from participating in autopie. We have multiple such scripts that we have. So what I do is, for all the scripts you showed me, I have the something that I will be publishing. And maybe if you need very specific examples of things, which you need. Because he asked me about doing it in top of the local thread. That is something I am doing. So those things, he asked me about the elementing. Like that, if I need to take this, that will not happen. But whatever I have shown today, I will share with you. So any questions? This one after lunch was a little old. I didn't. So the thing is, I have made this UDF. I selected this because I wanted to do something which is as minimum as possible. But at the same time, it is not brilliant but shows the functionality. This is the lowest I could get. The lowest, simplest example. But at the same time, it is a complete example. There is no point in just writing one or two. One question, you seem to change a lot. You get asked to format events almost the same. And as I said, you probably copy it over every time you write code. But the other one, exact, is that get overloaded function where you can return whatever you want and it will match. If you look at the function definition for exact, the return R is what? The return R variable is what? In this case, we did a back exact. You are overwriting some base function from some other class. No, no. So this is a... Oh, okay. They were abstract. Abstract class. And this is an abstract method. Data back is... No, not data back. It is a data. So this is using generate, right? Or as a template. It is a template. Since it is a generate, so this is map. So you can actually go to the definition of this class. So this is a generate. And the return type of the exec option is map of the same generate. So you have to... So since it is an abstract class and it is an abstract method, you have to instantiate it. I think this was the part where I had that. I was thinking data back and I didn't recognize the fact that whatever you have in the template, the data back is the one that you write up. So that's how it's written. We're looking at people that actually follow it beyond the class. Sorry about that. So as I said, people is always equal for exec. That definition doesn't change. What changes is the return. What changes is the return. And you could pass in multiple parameters also from your... from your PDF. And all of that will be coming in as a parameter. Can you go down a little bit to the other functions that you have over it and get? So output schema for example. That definition doesn't change any kind. It will change based on what you have to come up with. Oh, right. So that's where I am. Here I have written a data back which contains a couple. Yeah, data back which contains a couple which contains a string. And you need to indicate that here. I want to give a name to this piece. So like when you do a disk type data, okay. In pink, we should be... Right. So I'm going to write this part. The first thing I'm going to write is the... The end of it. Yeah, but the sequence... The line is... So the line is... It's the last part. So this is fast. It's a lot of reflection utility that's going on here. You reflect on the class and then you... So I mean basically... So actually if you think of... This is the... That is actually representing a part of it. The back... That's where I am. And then the outer schema represents a matter. The way we want to set the data type of a schema. So probably the name is a bit different. This should be for people's schema. So the way... So the... This is the outer schema. This is... Oh, yeah. So this is the people's schema. And then you have a schema for the class. Like your political media function... For every line of data type. Yes. So... Where is the output to be... In a separate area or in the same area? I think if you think... Is there a reason for it to be a separate area? Just... It can't be the same area. Is there a reason why you want to execute it? The only way we can go... Some of the processes have been... Talk about the exact invention. Or the external... Steaming. Steaming when you do... SED and SED out. Okay. You're basically going in to... That's when it goes out. Otherwise it's... I think you do the same. I think he had a... Beautiful slide here. Uval... UDF is used for... Uval UDF is used for... All that. Pretty much all of that. We used a lot of UDF. These functions are pretty expensive. They think they call it in an iterative manner. Right? So that doesn't make sense for it to go out. So one question is when you are... In your... When you're looking at a lot of rows... Across the rows... The tuple that you're expecting here... Is it... Fat to say that you're getting one line... One row... What I'm saying is this... Entire dataset that you are looking at. And you're passing something into the UDF function. Does that mean that the entire data that represents the variable... Is sent to the UDF in one row like collected and sent? Or is it given in some format? Like... I would think I'd go in one row. But because you define it as a... Sorry. Because you... It has to be messaged to the tuple and then send it. I think it will be sent to you. I don't think it will be sent to you. But I may be wrong. Is there a way you can debug that by putting in some SV out messages here? Would you see it in the log file that you have that gets created? It never comes out, is it? So if you dump one... Just break it up. You can't break it. There is another way by which you... You may break the standard error or something. Standard error or standout. There are some functions here. I see that there are some logging functions that were being called from this... Whatever. Yeah. So if you need to do that... You can't do like... SVD or... Right. Standout or... That would work. Also the other thing is... Any such thing that you print out would then show up on your... After use... I don't think it will also come in... Okay. Where would it come? In the context of when you run your thick script... It's just in a different client. And yeah, we have a root cluster running. And the zent I think gets bunched and sent over to the root cluster like you said. What it basically does is it sends it over to the... So... I think... I mean we will also talk about error landing and logging. That would come to the peak front end. Or the peak view. Or the peak view. Okay. But... Whatever you want to print here, I don't think it will come to the peak. Probably I think might come here in the... Arup cluster. Arup cluster. Arup cluster. I mean the job tracker log... The job tracker log... The next type. So we will talk about... We will talk about eval function. So eval function, what it was doing... It was taking any input. And it was giving any output. Right? One, two... Yeah, one, one. So I mean what I mean by... It is any data type and it can be data type. So what filter function will do is it will take any data type. Like... But it will return a boolean. Okay. And these can be used in the filter state. So the way you basically would print it out is something like this. So filter data by the UDF need. And then you pass in some data. Okay. So this data would appear as a tuple. You are packed a tuple. Then you can return a boolean. And then that boolean will be used with this boolean. Okay. So one important thing to remember is the eval function and filter function. The implementation which happens in the UDF. For eval function we were extending something called eval function. Right? So there is a concrete implementation of it for filter. Where the return type is hard work. That we will be using for filter function. You got that. Right? So for eval we are using a generator. Correct. This is generator. Yeah. So this guy is... This is generator. So this is the generator generator. Correct. But this is the abstract capacity. Okay. But for filter functions what happens is we will be using something called filter function. Okay. If you actually get into the definition of UDF function it internally would be... Okay. It can internally would implement the eval function for boolean. So here the... Here is the form of boolean. Yes. Okay. So here this is the more generic, more abstractness. But filter function is a concrete implementation of eval function. Okay. And the return type is hard work. Okay. So apart from that pretty much all kinds of things are seen for filter functions. So you have to use a filter function which is a eval function boolean. And it should return a boolean. And it would be the same as eval function. So what are the... Same way in whatever logic you have to use. We should also check for the empties and nulls in the same logic whatever. Very interesting. And the same thing. We get up to function mapping should also be done the same. But here we don't need the outcome scheme on that. Because the output scheme of mapping is already done by the filter function. Okay. So it gives a boolean address. Yes. So you can actually look into the source code of this. And then you can find out how they have defined the output scheme. So for this, for this what we are going to do. So I told you we have this big CSP plan. Right. So there are lot of fields involved. So there is one field which basically is going to say what is the client you use to tweet. Okay. So I have used multiple plans. I have also used my Wim editor to send tweets. Okay. So what we will do, we will find out all tweets which I tweeted from the... from VF. So that's what this uref is going to do. So we just need to extend this. Okay. And we have the ecc function. In ecc function we will get the info. Right. So we are going to get the first info of it. And if it is not null. Then we try to see if it contains the word tweet name. So this is the account which I use. Let's see if it is null. Okay. What is the same function here? This is the... So here you can see, right? If you delete something then you add that. Right. I tweet from this. Right. This is the source. Here I tweet it from my phone. It's a further... I mean for something... This is how the source for the... So instead of passing the word, I am just checking if that contains that word that we know that this is... I tweeted that from... So here... So that's what I am doing. So I am trying to see if it contains that. If it contains then I am returning to... And I am returning false. Okay. So you might wonder that... I told you that we should also check for nulls and IT and all that. Right. I am not doing anything. The reason is... Earlier, if it was null, we were returning null. But here we are not returning null. We should either return to our false. Okay. So... But I surrounded it with... Try catch. Okay. So if there are any error, then it will show an exception. Okay. So that exception will take you. So it will take you... And you know that particular... That feature will function anyway with that. So we don't need that. And this is the same here. To get... Get R2 from mapping funny function. It's the same here. It's nothing. Because we're going to just take one string into the unit. We're not taking... So this is for the argument. But when we are passing the tuple back, should we also do the same thing with the tuples which are passing back? Yeah. That's what we did with the output scheme. Output scheme. Okay. So here... So this is for the input. Input of F. And this is for the output. Okay. So the input requires this list function spec and output requires a scheme. Okay. So this is the output scheme. Okay. But here we don't... We don't need to give output schema because that would be done later. So the script... The big script for this. So we register the jobs using the script code. So we are defining the... See. Then we will be loading the data. So here... So what I'm doing is... Filtering it by using... The script code just would give me the source. Source text. Whatever is the source one. That I'm stripping the code. And then I'm passing to the UDF function. The UDF function will give me the true one. Okay. And so based on that, I'm filtering. So this win-meet will contain... Only... Meets... Which... Only rows. Which has... Which are... But this will add in all the... All the... All the columns. It's the entire... All the columns will be there. And from there... I'm... I'm doing one more thing, right? Well, I'm... This is... You just get only the... Win-meet text. Not the entire text. You're just getting only the columns there. Okay. I'm talking it here. I could have probably done it in one single script. But I just... Split it up. You... So this was just to demonstrate the effect. Because otherwise you could have simply done filter The other six equals... Cummings. So I wanted to do a VF, which is... Complete. I'm not doing it at the same time. Something which is... Something which is... Something which is... Rather just... Cummings. So this will... Win-meet sent out all the tweets. All the tweets. Which I did come from the VF. So there will be... This is clear now. So this is one line. This is one line. So you get the list of... The list of tweets. So are you able to find a problem in this? I saw this problem only today morning. I did testify. But are you able to find a problem in this? We have a smiley here and there. Sometimes it's going to be bad. See the line. It's so complicated. What is that? Krishna? The last one. It's so complicated up to there. But if we have... Passing some of the Java is such a pain. I don't know why it is so complicated. Talk down to me once a little of these. You write that. Go up a little bit more. There is one coming there as well. So this is... So this is my... So this is your smiley. This is the key. This is just the top. People are just coming. That's not the problem. So can you say... What do I say? There is a problem with the way this... Where the tweets have been passed. No. The way that tweets are... They realized it only today morning. Can you see the tweet? What is missing? The encode. Actually what happened was... These are tweets which have Kama in them. Okay. So they got translated. So this tweet actually had some more text in it. But it was Kama-seplited. But in my peak loader... I was using Kama-seplited. So this got split into multiple... So how do I escape that? That? UDF. So you can use a UDF instance to split that. But I think there should be some way to do this. And even I am trying to find a solution. Because I realized it when it came out. Okay. See, running these samples again, I realized it. So I didn't have time to look for a solution. But I need to find a solution. So this is a problem we've been solving... We have our own UDF for both. And also it's very slippery on what... Your source environment... Let's see if we can read from oracle because this... MySQL is just something else. Into the root cluster. Into the root server. We will do something like that. Okay. So the thing which came to my mind... First was just to use a loader function. I mean... Except the loader and then use it. But I think this is a pretty common problem. There should be some way in which to solve this. Why couldn't you use the text loader that they have? We could use that. I mean... No, text loader, for this we can't use. Because this is... You've already said tokenizes the... What's the text? Text loader is for tokenizing voice. This is CSV file. Everything is played like a mouse. It's a tweet file. So we can add something like 10 or 15 columns. And everything is a critical amount. There could be something like... Yeah. So some... Either we should write a custom logic or... There should be some way. I have not figured that out. That is something. Yeah. Good CSV parser really has options for that. Yeah. But maybe there's a CSV loader. There's a CSV loader for a tweet. Or a tweet. Or a tweet. One... Is this problem coming because you... From the first line in our server to the... As... Loads... You know, the file... This is because... I didn't realize it when I wrote the function. I was just like re-running all the... Example today morning just to check... Verify everything. I said... At least like me that... Data... Data means... If the data is true... You could actually store the tweets. Yes. So this is storing the entire... Load. Entire... Which is true. Whichever is true. I'm storing the entire... Go for that. And this... This statement is basically... Strict... Taking only the text. You'll see above... And I'm handling it like this. So I... Didn't have an eye. Told you about that. We should... Handle all exceptions in the... Data, right? In the UDF. So there is a reason for that. So... So the problem is... Whenever you're loading some data... Or whenever you're processing some... Text file auto... There is a possibility that... Some rows might not... Follow the pattern or... Make some... Long data or something like that. Some problems in the... Right? There are scenarios... Like that. Or there are scenarios where... There is some... Something which is... I mean... There is some problem which... From where you can't recover at all. Maybe... The UDF... The jar is not... Something like that. Some problems from where you can't recover at all. So there are basically three types of... Areas which could occur. The one... Where... The problem... Is for that particular row. But you know for sure that... Other rows might work properly. It's just that that particular row will not. So in those cases... What you have to do is... You have to return null from here. So that is the first... Where... We check if that particular... Renewal is null. Or that... The tuple doesn't have... The required number of... All those two things. So those are cases where... It affects only that particular row. The other rows might be... Good. So in those cases... We should return null. And this is a general behavior. This is the recommended... Ways of standard functions also. Do the same. Yes. Standard functions here. Built-in functions... It also does the same. And this is how you should handle. This is the... Error-hatching tool. So... This is particular. It's given to the media. That is... You should be for one row... Or is it for all rows? One row. It is coming... It is all the columns in that one row... From the previous dataset that's given to the media. Yes. And therefore... It could be... It could be all the columns... Or it could be just the one column... Which you have passed. Right. Depending on... Depending on what you have passed. So that is the first thing. So if there is any error... Which would affect only that row. But if there are... If the error affects... Other rows as well. But you can recover from it. Then you should throw an IO exception. So... Examples of this is... Let's say you are... Using some config file in a PDF. Trying to read that config file. You can't read it for some reason. So any data which comes to that particular node... Or any data which comes to that particular node will fail. But others might... Throw the scenarios... So there you should throw that IO exception. But there are other scenarios where... You can't do anything. And you may or may not know that things will work out. For those scenarios you should also throw an IO exception. But what happens in this case is... If the IO exception is thrown beyond a threshold. Or if the exception... The normal exception is thrown... Thrown beyond an exception. Then the entire job might... All those will fail the entire job. How do you set the threshold? That I think you have to do it in your job. So that doesn't come with me. If this is... I mean this is... Same for both people. So even if your map produces... So it's keep on throwing an error. We want a threshold which will kill the entire job. So the job conflict is created by big, right? So you can also... You can have global competition for your job. So Hadoop has his master gesture. And then the job kills. It's called Suki. This job... Task management. So these two things are the main thing. So if you feel that the problem is only... With that particular low... Then it's always better to throw an IO exception. But if it could be a problem with other roles... Then you can throw an IO exception.