 Since this is PyData, I wanted to show you something with BigQuery that is really found. Everyone here is a Python enthusiast? Yes? You know what are the most popular Python libraries? So do you know BigQuery? I already showed you a little at the beginning. But in BigQuery, we have a lot of public data. And everyone here, everyone gets one free terabyte every month to analyze wherever data you can find in BigQuery. And one dataset that I have not publicized enough but I find really cool is that all the downloads done with PIP, PIP, they are all being published here streaming every minute. So you can come to these tables and, for example, if we go to look what were the projects most, the most popular libraries, you can just find it. So, for example, let me write a query, file project, and let me count. How do I do my count of everything that is being downloaded? It's harder to type with one hand, group by one, order by two, descending order, limit a hundred. So this is a real cool way to see what is happening with Python, what are people downloading. Can I zoom this in? So these were the most popularly downloaded libraries yesterday. PIP, PIP, Python, Python, etc. But this is really fun if you have your own obscure library and you want to see how people are adopting it or what countries are downloading it, you can see it here and not only find out the number, there's out most of these libraries are super popular because people are just launching 1,000 VMs, 10,000 VMs every day, so it doesn't really count. But you can go and find on every dimension by operating system, by kind of CPU, and you can find really what things are being downloaded together and how different systems work. And I don't have a lot more to say here, but if you have any questions, if you want to try it out, you can find the data here. Any questions? Yes? Tens of flow questions. Yes, just a quick one on Tens of Flow. So I was wondering how you could actually integrate things like unbounded collections into Tens of Flow to train your model and do stuff like that. Integrate what? Unbounded collections like, for example, in Beam, you have your P collections and so on. So how do you use that in the context of Tens of Flow? Because I see you've been loading stuff into memory. So what if the stuff that you're trying to train on can fit into memory? Yes, okay. So the question is around training data that is bigger than what you can fit into memory. And that's absolutely a very real world scenario. And this is where the input function kind of comes into play. You can, Tens of Flow A has semantics for basically streaming data in. So you can, like there are other formats, right? Like a NumPy array cannot be streamed. It's just not a format that's streamable. But a lot of the kind of, so the one I'm thinking of is it's called the TF record. So Tens of Flow has its own kind of proto buff, like JSON like binary format that you can write out. And so if you, you can do that as either a pre-processing step or you can do a live conversion and then stream it in that way. And now you'll enable it to load in to its memory buffer and shuffle within that buffer as it's reading out samples. The other option is to use, you know, you mentioned Beam, you know, and I'll mention Spark as well. You know, both of them and like Hadoop file systems and things like that have ways of reading from streaming data out. And they all have Python libraries for doing that. So you could just call those Python libraries from your input function and now you're streaming data in, you know, one batch at a time or several batches at a time. Does that make sense? Yeah, sure. So do you know like offhand, does it have any integrations with BigQuery and all for reading and stuff? Yeah. So on the point of BigQuery, there's a TF.cloud now. And I think it's like tf.cloud.bigquery and you can read from BigQuery natively from Tens of Flow now. So is Felipe around? Oh, he's in conversation. But yes. In short, yes. Okay. Thanks. Other questions? Hi. Thanks for the presentation. So I got a question like, is there any way in the Tens of Flow that let's say I'm preparing a data set and I have numerical, categorical and some other variable which contains paragraphs. Paragraphs? Yeah. Okay. Let's say like a logs. Like a what? Like a what? Logs. Yeah. So is there anything available in the Tens of Flow packages that some techniques it has so that I can make it into some kind of unique representation in the columns so that I can use it for my modeling? How long is the text per value, per row? It depends. Like let's say I'm taking like lost 15, 30 minutes of logs in my case. So sometimes the unique columns are like 20, 30. But sometimes it goes up to 50. Uh-huh. Like 50 what? 50 rows you can see. Right. But like what I mean is for each row, you said there's a column that has text. Right. Exactly. Just like the random text. Like a little message. Yeah. Single text cell is a message. Okay. That contains 10 sentences you can think of. Sure. Yeah. So once you kind of reform text, we're getting into like the natural language processing space and you know, there's kind of a spectrum of answers from like the simple answers to like really complicated implementations. Right. Yeah. What I'm currently doing is like I'm using the pandas extract regular expressions to extract those things and then just get some keywords. Okay. So I'm thinking of something like if the Tens of Flow has certain packages or inside packages, which can do the job much easier here. So it doesn't have anything built in like kind of what you're talking about, like a built in out of the box natural language processing toolkit. But you can certainly apply, you know, any number of existing, you know, NLTK is a classic one, natural language toolkit. Alternatively, if you wanted to, you could also use the Google text API, the natural language API that Kaz showed earlier. And you can use that to pull out salient entities. And so what it will do is it will pull out like the entities from the sentence along with scores of like how relevant they are. And it will score the phrases in terms of positive or negative sentiment and things like that. So that's useful for like specific use cases. I don't know if that particular piece will be useful for you, but that's one option for like a simple preprocessing because you're doing, in that case, you would be preprocessing using an existing model, right, an existing text model. But presumably, if we wanted to think, you know, bigger, the kind of messages in your particular data set is going to be kind of phrased in a particular way, use particular vocabulary that is specific to your use case. And so ideally you would be able, you would extract out just that one column. All the text stuff that you have and train a dedicated text model that understands your vocabulary of your data set, right? And so then it will be able to pull out salient features with much higher confidence, much higher accuracy and performance. Right, so, okay, good. So kind of layering machine learning on top of each other, right? You're like doing training one thing, using the outputs and feeding it to another, and so like, you know, you zoom me out on the architecture. So you're repository containing the field-specific vocabulary sets to process in here. Exactly, yeah. Thanks. Other questions? Also, questions for Kaz and Felipe wondered off, but... I have a question. Yeah. If I have a large data set, and I want to train it iteratively, and in Keras, like, we use data iterator to train, like, in TensorFlow data sets, can I make my own data set into TensorFlow data set? Are you referring to, like, the data sets API in TensorFlow, the new... Yeah, yeah. Yeah, so what do you mean? What kind of... You make your own data set into... Sorry, I didn't completely follow... Casting into TensorFlow data set. Mm-hmm. Wait, sorry, I missed the beginning part. Is there a way that I cast my own data set, for instance, like a CSV file into TensorFlow data set? Oh, yeah, absolutely. So the data sets API allows you... There's, like, a whole bunch of ways to read in data. Like, CSV is one type. You can also pass in a Python generator. So that's... That just opens the door to just about anything you want, right? Like, anything you can wrap in a Python generator and yield out values. And then... I forget all the other ones. But, like, yeah, there's a whole bunch of different things. And, you know, you clearly have played with it a little more than perhaps other folks. So I'll mention it. I didn't show it today, right? The data sets API is like a relatively new feature in the TensorFlow library. Support for distributed data loading and, you know, iteration, shuffling, batching, and all those semantics. It's just a much cleaner API surface. So, yeah, definitely... So thank you for bringing it up, and I definitely encourage folks to check that out as kind of the de facto, you know, new standard for working with data sets. And I just don't like the name, because it's like... When I say data sets, you don't know if I'm talking about data sets, like, in the general sense of, like, data, sets of data, versus, like, the .datasets API. But that's gripes. Yeah, I would like to hear your views. What is the... What is the distinct differences between... Or how does TensorFlow compare to PyTorch? So, I haven't had too much time with PyTorch, unfortunately. I know they both kind of come a lot from... They originate largely from the same kind of roots in Theano, and have a lot of stuff based there. And in terms of functionality, like, they both do, I mean, largely the same... Like, there's so much commonality. Like, you look at Keras, you look at MXNet, all these other libraries, like, all the common things that you would do are supported across all of them. It's just, like, par for the course at this point. I would argue that the big... One of the big draws for TensorFlow, for me, is the community is being able to go on Stack Overflow, and the question I'm thinking about is already answered, as opposed to, like, I can't even figure out how I want to phrase it for, you know, the community is just so much bigger with TensorFlow and the support is there. That being said, I do recognize that, you know, one of the big things with PyTorch that people like is that immediate execution, right, not having to wait on the response when you construct that graph and then make a session, et cetera, et cetera, as you would with TensorFlow. So, right now, you can... In TensorFlow 1.4, in Contrib, there is TensorFlow Eager Mode, and so you can enable that flag, and it will allow you to, basically, you can create something in the graph and you can just execute it right away. You don't need to create a session, you don't need to do the session.run stuff anymore. So that's kind of where that's going, and that's, you know, available today in a, you know, relatively stable form. Hi, Miss, question is perhaps for Kass. Question was... You may... Do you know if there's any difference between the engine used by YouTube and the engines for... GCP, yes. GCP Speech API. Any difference? Also, question was to YouTube... Difference between YouTube Close Caption and... Ah, I don't know anything about that. Sorry. Do you know anything about the YouTube... What did you say? YouTube what? Close Caption. Subtitles. Speech recognition for subtitles. Is it the same engine? No, no, I don't know anything. Yeah. Sorry about that. And maybe we are not allowed to, you know, talk much about the details of YouTube. Yeah. We only talk about the public, the paper from the YouTube team, such as recommendation used. Yeah. Right, yeah. Like, the YouTube team did publish a huge videos dataset. I don't know if you're maybe aware of that. 8 million YouTube videos that are labeled and annotated. So really useful for any kind of video machine learning training use cases, you know, you might have. Yeah. So a lot of cool public datasets out there. That's like a whole nother talk. But yeah, in terms of the speech recognition, I don't think there's so many different power projects inside Google. So it's highly possible that we are sharing the same backend. Yeah, Android team and YouTube team and Google's speech API. You're specifically referring to the automatic subtitle generation? Yeah. So, I suspect that's just like different versions kind of lagging behind each other or something. The other thing I want to, I think this is largely true. It's like most of the kind of, most if not all of the state of the art like research that's coming out like Google research and things like that, it's all being published. And so some of it with more publicity than others but like, you know, the tech is there like at least in principle. It's building out the scalable system to support something as large as YouTube, right, of like doing that at that kind of scale that is like the really hard engineering challenge. Oh, between like speech API versus YouTube subtitle. I have no idea. Or you can't say it. You can't say it? I really don't know. There's no way to really see how good the YouTube subtitling is from that external perspective. I wouldn't be able to figure that out, I guess. So, if I have a use case that is automatic generation of karaoke lyrics. Yes. Creating time tags. Which products will you recommend? Interesting. Automatic karaoke lyrics generation from songs. But you may have to train your own model for that because it's not the generation of dictation kind of thing. But that should be interesting. Yeah. Thank you, thank you. Maybe you can join the YouTube team and find out. Looks like the finally arrived. No more starving. Thank you so much.