 Our next speaker is as Anand. So we'll be talking about exploring network classes using Python. So, Anand, what do you know? Anand, what do you know? Oh, sorry. I missed the queue. Thanks, Kalyan. Okay, let's dive in. What I wanted to do was talk, actually, a bit about movies, to be honest. We've come to the Python part of it in a short while. But my story starts, actually, with Govinda. See, it was in 2000 that I first heard of Govinda. Maybe it might have been 98 somewhere around there when I saw Bademiya and Chotemiya with Amphabachan and Govinda. And like, who is this guy? And his style of comedy was completely crazy. But after I saw that movie, I was kind of hooked. I wouldn't say like, oh, that's Govinda, right? But while I could name every single movie that Rajnikanth starred, I couldn't have named a single movie with Govinda. That's when I learned about the Govinda, David Davan combination and so on. And one piece of news really clipped me in the recent past. I'll talk to you about that. See, it turns out that in Apki Adalet, Govinda made a statement that he was once offered a role in Avatar, the movie, presumably as a hero. So we said, okay, what if? And this is a pure hypothetical situation. What if Govinda in Apki Adalet had also announced that he had been offered a chance to star with Angelina Jolie in a movie Jolie number one? But in the interview, he declines to say who actually connected him to Angelina Jolie for this to happen. Now, let's make the speculation who possibly could have connected Govinda to Angelina Jolie. Now, it turns out that he's friends with a number of his heroines, and his top heroines include Neelam, Kimi Katkar, Karishma and so on. Now, one possibility is that maybe he refused to share who it was because maybe they were actually connected to, they might have connected him to Angelina Jolie. But the trouble is not a single one of them has acted with the Hollywood actor. So that rules them out. What about his male co-stars? If you look at the male co-stars, the top actors that he's acted with are Shakti Kapoor, Kader Khan, Gulshan Grover. In fact, you could argue that Shakti Kapoor is practically his boyfriend. Now, among these have any of them acted in a Hollywood movie? That's an open question. Let's explore. Turns out that the two people that have acted in Hollywood movies are Gulshan Grover, who's acted in a bunch of them, and Anupam Kher. And there is a chance that these might be connected to Angelina Jolie. But for that, we'll have to start the other way and look at Angelina's co-stars. Now, if you look at the list of co-stars that Angelina has, then there are a little over a hundred. But out of all of those co-stars, Dustin Hoffman and Jack Black are the only actors she's acted with a little more. But among the Bollywood actors, it's only Irfan Khan, only one person. So what this means is that the only way Govinda could reach to Irfan Khan is to Angelina Jolie is through Irfan Khan, which can either be through Gulshan Grover or Sanjay Ghat. But there's a third possibility, which is Tabu. And it turns out that Govinda has acted in four movies with Tabu and with four movies with and she has acted in four movies with Irfan Khan. So what that means is that Govinda needs to connect to Tabu, who needs to connect to Irfan Khan, who needs to connect to Angelina Jolie for this to happen. And this is what can possibly cause something like Jolie number one. Now, here's the thing, right? How does one figure this out? How do we figure out what does it take for Govinda to be able to reach Angelina through a co-star network? That's what we're going to explore and through a network exploration. This is useful in a number of scenarios. And every single one of those scenarios is considerably more boring than movies. So I'm not even going to talk about where you could apply network analysis, at least not yet, you can ask me the questions. But we are going to explore this journey. How do we do that? Well, let's get into the code. So the good part is all of the data is available on IMD. And what I have here is, by the way, you can figure this out at IMDb.com slash interfaces. This is a page where you can download the IMDb data sets, the Internet Movie Database data sets from data sets.imdbws.com. And there are a bunch of files. These are all gzip files, which are Tabd limited, and they contain information around all the actors names, the title alternate names, basic information about the title, the details about the actors. And those are the four things that we really need for this piece of analysis. But here's what we're going to do. First offline, we're going to download all of these four files and unzip them. And I've already done that. These files are pretty large, by the way. So if you look at the, let's see, from a nice way, the principal title's file is about 1.5 gigs, and this is about a gig, and this is about half a gig, and so on. So what we're going to do is first use pandas and read the title basics file. What that does, let's run it live. What this does in 3D is have, it has a set of fields identified by T-const. T-const is like the idea of the file, idea of the title, the film. And then a title type, which can be an indication that it's a movie, or it's a shot, or something else. A primary title, that's the name of the movie. And the start here, which is basically the year in which it was shot. Now, that's all the information about the movies. Now, if you want to connect two people, we also need to know who's acted in which movie. So that comes from the next one, which is the principal's information. Now, this has three pieces of information that are relevant to us. The first is the idea of the movie, that is T-const. The second is the name of the movie. And the third is the category. Now here, the category can be, is this name, the person whose name it is, acting in this movie as an actor, or is it a director, or is it an actress, or is it a composer, is it a producer, etc. We realistically just want to restrict ourselves to actors and actresses, though in reality, you could argue that Govinda could just as well talk to A. R. Rahman, who probably has better Hollywood contracts than, let's say, Taboo. So that is certainly possible. But we are going to restrict ourselves just to actors and actresses. And this is really the network. Now, incidentally, these networks are what are called bipartite networks. You don't need to have special data structures to have network graphs. You just need two columns, saying A connects to B. That's it. And you have a network. We'll also further reduce the data set by saying that we are only interested in the movies, not in the shorts, not in the TV series, etc. And we will only pick those actors who have acted in movies together. So if we take that, this list is slightly smaller. And this is what it looks like. So in this particular movie, this particular person was an actress or an actor. The last piece of information that we need is the names of all of the people. So if you say NM101955, who is that? So you need the ID, which is the end const, and their name. And we'll also get their year of birth. It's something that might prove handy when we want to filter across generations. So with this information, how do we figure out whether two people have acted together? Well, or how we connect between two people. So this piece is the central information that we need. This movie has this actor. The library we're going to use is NetworkX. Now what NetworkX does is incorporates a whole bunch of graph algorithms. So the first step to creating a network is saying create it from an edge list. An edge list is basically a table where each row has one edge. An edge is basically a line that connects two things. In this case, one thing is the title of the movie. The other thing is the name of the actor. And we'll then say that I specifically want to connect the titles and the actors. So this gets created. Now what I'm going to do is just confirm that it actually works. So that I'm not missing any stuff. All right, now here's we'll come to this bit in a minute. Now, once you have the edge list, how do you figure out the shortest path between any two actors? Turns out that there is a function called NetworkX.shortest path, which takes a graph. And what is a graph? A graph is the rule. I ran out of memory, did I? Okay, fine. Then question is, there goes my talk. Let's see if I can recover. Here's where, here's where having a saved version of the file helps. So I'm not going to run, but actually, you know what? This is the fun part where you say, can we actually beat the odds? So I am actually going to run everything up to this part and see if we can get it still working. But I'm going to continue talking about some of the stuff that we, that I've at least saved. If I took two actors, like let's say Archie Manorama, who some of you might know is a South Indian actress, and Angelina Jolie. And I won't introduce Angelina Jolie because if I did, then that kind of defeats the whole purpose of my talk. Now, how could Manorama connect to Angelina? It turns out that she's acted in Riksha Vodoo with Pareshra, who has acted in What If with Irfan Khan. And obviously, we know that we need Irfan Khan to connect to Angelina Jolie, and they have acted together in the mighty, in a mighty heart. That's the network. How do we get to this? Well, we used this function that I've just written called path. How does path work? What it does is first looks up the name. Given the name, in this case, it was Archie Manorama, it says, give me the source ID, that is basically the end const. Similarly, given a name like Angelina Jolie, give me the target ID and the end const again. And then run network X dot shortest path, which takes G and the source ID and the target ID. The result is an array. And that array has all of the nodes, the movies, as well as the actors who are connecting them. And it gives an entire list of these. So there may be multiple paths to connect from A to B. Let's see if I can show you an example of multiple paths. Yeah. So Sylvester Stalin, for example, can connect to Salman Khan in a couple of ways. The first is through incredible love, which is acted together with Akshay Kumar. And Akshay Kumar has acted in Musseh Shadi Karogi with Salman Khan. The second still connects Sylvester Stalin to Akshay Kumar. And Akshay Kumar has also acted in Jaan-e-Man with Salman Khan. But net net, actor-wise, the only path that Sylvester Stalin has to connect to Salman Khan is Akshay Kumar. But you can see how he has two potential paths. So to connect from one to another, there may be multiple paths. And what networkX.shortest, all shortest paths does is returns, what all shortest paths does is returns all of those paths. There is the shortest path function, which just returns one of those paths. Which one? I don't know. One of those. And then what we do is convert those IDs. By default, it stores it as IDs, just to save space. And we convert those IDs back into the title, if it's ID starts with TT, or we convert it to the actor name, if it starts with the actor name. And that's what gives us this list. So let's see if we have actually managed to get to this and are able to create the edge list. If this works, then Kalyan, I'm going to invite you to participate with me on this. What I'd love to do is have you pick two actors, just choose. And we'll figure out what is the shortest path between those two actors. So I pick Dwayne Johnson and Bianca Chokhra. Okay. So let's create a new cell and construct Dwayne Johnson. Didn't they actually add together? Yes, they were in Baywatch week. Okay. And hopefully you just say Baywatch and there won't be too much of a path. Let's see. Bianca Chokhra. That's doing its computation. And if it crashes, then I'm going to give up. But like, okay. Yeah, they acted in Baywatch. So that's easy. They've already acted together. Two actors who might not have acted together. And let's try and figure out how they might have connected any others. Amitabh Bachar Nagarjuna. You know what? I think they've acted together as well. No, no, you're thinking of another one. Who knows, probably would have actually more connected than you think. Nagarjuna. What's Nagarjuna's IMDB name? IMDB Nagarjuna Kimeni. Yeah, Nagarjuna Kimeni. Let's copy that title and use him and see if they're not together. That seems to be, yeah, they've acted together in Kudagawa, slightly old movie. But yeah, what about his wife? Amla, what's her name? Amla. Nagarjuna IMDB, Amla Kimeni. Yeah, Amla Kimeni. They might have acted together too. But no, interestingly, it's easier for Amitabh to connect via Arvind. Not easier. But he can certainly connect through Vinod Khanna. But that may not be his only path. Let's look at all of Amitabh's paths to Amla, which sounds weird to say. But Amla has acted with Aruna Irani and also Satyendra Kapoor, whom I've never heard of, and Chiranjeevi and a whole bunch of actors. So I guess this really is a pretty long list. He can also reach out to Amjit Khan. He can reach out to Nagarjuna, of course, understandably. Even Rajnikanth has acted with Amla and via Mithun Chakravarti. And these are all the shortest paths that he has. But let's do an exploration of this, meaning, see, it's not just the shortest paths that can be interesting. What can also be interesting is who are the co-stars. Earlier, I had mentioned that, yeah, if we take Angelina, she's really acted only with one Indian co-star. And the way we can figure that out is to look at the network of neighbors for an actress like Angelina and see what all the movies Angelina has acted in and who are the co-stars in each of those and give it a count. So if you look at the 10 most co-stars of Angelina Jolie, apart from Angelina herself, there's Jack Black, Dustin Hoffman, and a bunch of others she's acted. Or if you take an actress like Sridevi, then her co-stars are, well, apart from herself, Krishna, Rajnikanth, Kamal Hassan, Gumadi, and then it falls off. So any actor whose co-stars you want to explore? Mithun Chakravarti and Rajnikanth. Mithun, okay, let's see. Mithun on IMDB has an ID of something like this. Let's see what Mithun's network looks like. So Mithun's acted mostly with Shaktikapur. I did not know that. Prem Chopra next, but all of the others just fade in comparison with Mithun. And let's take Rajnikanth as an IMDB ID of something or the other. Let's copy that and run this. Okay, so this is one of those things that keeps happening occasionally. There are some actors who don't have names. So it turns out, for example, that this particular IMDB slash name slash, yeah, is not necessarily there in the database, but clearly is Padma Kriya and is the same as this ID. So there are these corrections that occasionally we have to make and say that, look, the, where were those corrections that we made? Names of, thought we made a few corrections here and there. Yeah, names. Yeah, corrections like these we need to make. And in this particular case, I'll have to say name.lock of, this is Padma Kriya. And this is the real name, but there is one more ID which actually represents the same person, Padma Kriya, which has been redirected. So let's create an entry for her and add this. When we do that, hopefully we'll be able to see the pairs of Rajnikanth. But while that's coming up, what I'm going to do is now switch over to a slightly different thing. See, we can actually not just get the immediate costars, but we can also count the number of times they've acted together. So if we see, for example, all of the actors between 1980 to 2020, who have acted together in films, and what is the most who are the most common pairs? Turns out that Kavya Madhavan and Dilip Malayalam pair, they've acted together in 17 films. Koel Malik and Geeth have acted together in 12 movies. Prithviraj and Jagathi have acted together in 12 movies. Trisha and Takash Raj have acted together in 10 movies and so on. And it might be possible to create a network out of this to see what the structure of actors look like. So that's where a site like Kumu.io comes in. Kumu is a pretty cool site where what you can do is upload your data set and start exploring the structure of that network. So I've uploaded all of the Bollywood data onto one site and specifically restricted it to only two actors that have acted together, sorry, actors who have acted together only in four or more films. So let's unfocus this and spend a few seconds to look at the network as a whole. This is what the whole network looks like and you'll see that it's kind of a heavy structure. Some of you may not be able to see too clearly but I do hope you get a broad sense of the structure of the network. And firstly, you can see that there are two halves. There's one part out here and one slightly more diffuse part out here and you probably have your guess on what exactly this is. But let's zoom in and verify. This is a network of actors, remember? So if I just focus on the actors, you'll be able to see what this is about. So it says Chittur Nagaya, that says SV Rangarao, then there's Savitri. So these are all South Indian actors and more the Telugu movie industry. And you can see that there's another cluster fairly close by out here. Let's zoom in into that if I net out that in crash. Okay. This has actors like Ratish, Sukumari, Muradi, I don't know how many of you are able to see those but hopefully you'll be able to see some of the names, Shobanaam, Vikar, Nalini. And you can guess that these and well because Mamuti and Ohalal, Nidumuni Venu are here, this is the Malayalam industry. And clearly the Telugu industry and Malayalam industry doesn't mix as much as and now we can start looking at the rest of the structure. The Tamil movie industry would be pretty close. The Kannada industry is either this one or this one. But all of South India is reasonably closely distributed together. But on the other hand, this is the slightly more diffuse network of North Indian actors. But even here you can kind of see two substructures. The major one is and as you might have guessed, this has actors like Sanjil Kumar, Veena Roy, Sujit Kumar, Dharmendra. So that's mainstream Bollywood. But there's also another slightly different cluster out here. Let's pan towards that. And this has actors like Anil Chatterjee, Savitri Chatterjee, Kali Banerjee, Madhvi Mukherjee. So that's obviously the Bengali film industry. And then there are peripheries out here. But it looks like there is a North-South divide among the actors as well. But are there crossovers? Well, let's look a little closely at who are the actors that are at the junction. So let's explore here. So some that are slightly closer to the Hindi movies side, but have still clearly acted with the South are actors like Vajainti Mala, Disko Shanti, who is actually a very interesting actress. Suresh Zarina Vaha, Girish Karnad, who's actually again like Vajainti Mala, South Indian, but became more popular in North India, actresses like Chitra and so on. And then we have actresses who are again borderline. Let's see if we can This is NTR, Chief Minister of Anuradhavish, who has acted in some Hindi movies as well. And we have actors like Arthi, who's acted in a few Hindi movies as well. But by and large, you can see that barring a few people that are in the middle, it's pretty much a segregated network. And you can see the clusters, Bollywood, Bengali movies, not sure what this is, South Indian, which has sub clusters Telugu and Tamil movies reasonably close together. Madhyaala movies, slightly separate Kannada movies, slightly separate and then diffuse network of other movies. All of these become possible with a pretty much one library in this particular case that handled all the network structure, which is NetworkX. And what I'd invite you to do is try out the NetworkX library. Apart from the shortest path, which we certainly used, you can also create network graphs. I didn't use NetworkX to create the graphs because graph here because it was too complex and I didn't have time. But you can create these graphs and you certainly can also identify measures and even cluster the networks to see whether the kind of structure that you saw here can be automatically derived from data. More than anything else, the important thing that I realized was networks are not a special kind of structure. Anywhere you have two columns, like name of a movie, name of an actor or product and channel or pick country and language, you have a network. Anywhere where you have two columns that are categories, not numbers, you have the network. And the network exploration, like text analysis and a number of fields that are emerging today, is so underexplored that you just have to scratch the surface and you'll find something cool. So with that, I hope you learned something that was interesting and hopefully useful. I'd love to throw it open for questions. Yeah, and that was a really interesting and fascinating talk. So here we have a question for you. Are network clusters the same as knowledge graphs and any real scenario where people printing data in this format helps? Okay, knowledge graphs are networks and knowledge graphs can be clustered. So let me explain each of these terms one by one. A network is where you have points and lines connected together. That's pretty much it. Network clustering is a specific approach where you, like you cluster things, you say this bunch of points belong to one group, this bunch of points belong to another group. You can do that with a knowledge graph and the question then becomes what is a knowledge graph? A knowledge graph is a network where specifically you have information representing some domain. So for example, you may have obtained a knowledge graph from Wikipedia, which says that this particular person is related to this particular place by being his place of birth. This particular person is related to this particular year by being the year in which he or she was born and so on. So you have various things of various types like places, years, people, and they are related through various ways of linkages. So from such a knowledge graph, which is obviously a network, you can create clusters. What might be an example of creating clusters? Well, one simple cluster would be to say I will only filter a knowledge graph by year of birth. Like we filtered for actors here, we filtered only for year of birth. And then once we start clustering people, we will find that the clusters simply have people who are born in the same year because each person can only be born in one year. That becomes a very simple cluster. Or you could say I will only cluster people with places, but it could either be place of birth or place of death or place of work or any such thing. Now we will find slightly more diffuse clusters. You may find that there's a whole bunch of people that are concentrated, let's say just in New York. But some of them have some of them came to New York having originally been born in London. So they are also connected to some Londoners and there's a London cluster and you'll find those crossovers. So network clustering can be applied to knowledge graph and I hope that what I just gave you was a real-world scenario that helps. Yep, that really makes sense. And so one of the general questions I mean, this is from my side. So what was the motivation or reason behind the choosing actors in order to explain this network clustering scenario? Okay, so see here is how I overrun my PyCon talks. I take something that I've already done and see which is the most interesting and put it out here. We had a hackathon where one of my colleagues, Niyaz, he put together this exploration. So I had it and I said, this is cool. But the reason why it shows this particular representation for a hackathon is that networks are actually things that we don't really understand too well. It's not something that we cover in courses. And very often, at least from my early education, I found that graph theory can be pretty boring. Network analysis can be pretty boring, even social network analysis, which sounds really cool, right? It's covered in a pretty boring way. So Polywood was really my way of getting out of that border. And I just hope that I've passed on some of my way of getting out of the border to you. So yeah, and I think that's all we have for today. And thank you again for speaking in PyCon India. Always a pleasure. Bye everyone.