 Hi, and welcome to my talk deep dive into building streaming applications with Apache Pulsar I'm Tim span I'm a developer advocate at stream native and I'm pushing all things open source. Hopefully you'll like it I wish I was there in person. Hopefully next year So I've been working with different open source and streaming technologies for over 15 years Things like 9-5 spark spring big data Hadoop data lakes IOT all those fun things I run a weekly newsletter called the flip stack weekly Check it out great way to find out what's going on with flank pulsar 9-5 spark Delta Lake All kinds of other open source technologies access to demos code webinars videos articles Cool tools. I've seen out there great use cases Some cool pictures from events and pictures of my cat so come out take a look Bring me talking to you about patchy pulsar This sums it up in one sentence. There's a lot packed into it cloud native We designed for the beginning to run in the cloud Which means we like to separate compute and storage we do both messaging and event streaming I'll show you how we accomplish that with one platform any cloud every cloud every day So we have one platform to do all your messaging you'll ever have to do Guarantee that your message you'll get delivered Very resilient system. Make sure you keep running even when servers go down even when things are not available We got you covered and scale out as large as you need to go This has been proven and a couple of large cloud companies in India and China to go out to petabytes thousands of node thousands of millions of Messages events a second speak as you need to go Why use pulsar very easy to build microservices real-time apps? Asynchronous communication is bare minimum highly resilient system built into your storage Does whatever you need to do there under the cover the architecture is not too complex But it's separated into three different types of servers We've got pulsar brokers to handle the message routing and connections Stateless with a little cash make things quicker Automatically load balances. This is what you're going to be communicating with as a developer Behind the scenes all the messages get stored in Apache bookkeeper bookie nodes These handle how messages are stored and retrieved Both the pulsar brokers and the bookkeeper bookies Use a metadata storage layer to do their metadata and service discovery This can be traditional Apache zookeeper or especially for the Kubernetes people at CD and on the small scale You do with rock CV. This API is open so we expect more will be there And this just handles storing any metadata you need and there's a lot of it in a big system as you might imagine The main thing that we handle in pulsar is messages That's what we call any of the data and it's broken up into a couple of key components You got the most important one for you is the value your data payload. This is raw bytes But since we have a schema system it can conform to a schema so we can ensure that hey It's Jason and as these fields are not nullable all those sort of things We really want you to have a key. It's optional But make your world a lot easier and I'll show you how easy it is to set it We use this for partitioning top of compaction Help you try to identify messages when you're doing debugging or logging or auditing Very please put a key in there Properties any other things you want to add to your message. It's not in the main body of message goes along with it key value pairs of data Put a couple of them make sense for you helpful for logging or auditing or another way to Hide some extra data in there if you need to Set a name for producer will set a really bad one for you if you don't please put a name there helpful for auditing used in message e duplication Set it just like a key important to have sequence Make sure you're in an order inside the topic again for streaming that's important And this is helpful for when we have to do de duplication So that you don't have to touch it'll be done for you We mentioned messaging and streaming. They are really similar but not the same and message queuing And we use these usually for work queues You don't have to be in an order Because you just want things to take a message and run it Whoever gets the next one great. This is great. You want to scale out getting data done. It doesn't matter when it arrives I just want to get it processed as quick as possible Put things in a dumb pipe comes out the other end people are running with it Message queue we do that with bolsa streaming This is a lot of use cases you've seen with data lakes the dupe and other systems I want things in order because it came out in order. Maybe it came from a log Came from a time series database Came from iot came from cdc from tables I want it in a controlled manner in order. I use you want things, you know at least once exactly once All those semantics are handled for you. You make a couple decisions there Now if that's all we did messaging and streaming that's a lot But you need to be able to connect to the system and do lots of other things So one of the important things out there is using connectors Now this makes your job easier. So you don't have to write code to do everything. Maybe you like doing that I certainly sometimes do but sometimes it's nice just to sit a little configuration file and have it work for you Do the fun stuff not the boring stuff of picking up one message dropping it into a table But the connectors do that for you There's connectors for reading things like divisium mysql Cassandra Kafka You know data lakes s3 hadoop all those kind of fun things get the data from them Using simple configuration file in yaml or jason Puts it into a pulsar topic for you In a proper format remember those schemas And then use it however you want Easy as can be now. We got the same thing for coming out So, you know once something gets into a pulsar topic Maybe you want it always to go to mongo always to go to sila db Always go to a delta lake always go to s3 always go to kafka wherever it is Set it up little config file runs for you automatically just keeps going nice feature Related but not the same we support functions. These are lightweight bits of code in java python or go Either runs in the pulsar broker or in your own kubernetes cluster. We have a function mesh to Empower that this is not to replace spark or flink Or our buddies at time plus or decodable This is just for doing little bits of stuff Convert one type to another type Change a couple of fields doing an enrichment do a look up Takes a big blob of jason Break it up into smaller bits of jason and send it to different topics Make decisions on data to do routing That sort of stuff maybe run sentiment analysis run an existing machine learning uh library you have do that Great fun very useful kind of like database triggers Pulsar protocol handlers. These are awesome. This is what sets pulsar apart from anything else We mentioned you got all these sources and sinks. That's great but what if All of my people want to use kafka And I want to run pulsar Well, let's let's do both I'll run a pulsar cluster Turn on a protocol handler to allow kafka. So now all those existing kafka libraries Including ksql db and k streams Uh rikardo tested this live pretty impressive. I'll send you that link You could just use this as a kafka broker That's great. Now. I have a kafka cluster in a pulsar cluster only running one piece of infrastructure Someone else wants mqtt turn that on someone wants rabid a mqp turn that on someone wants rocket turn it on What's nice is I don't care how that data is coming into pulsar or how it's going out I can mix and match someone sends pushes a message into me kafka. I pull it out with pulsar mqtt in a mqp out Doesn't matter or all of them at once I could have someone subscribe to the data via any of these protocols plus web sockets very nice Now I mentioned Our functions are not spark So we support a very robust spark connector robust flink connector and a robust presto trino connector All these allow you to write code against pulsar and also sql So this gives you a real-time sql engine on any of these So I could access an event or a message as it arrives pretty powerful stuff And in case you tell me well tim, I want to put petabytes of data in there But I don't really want to run petabytes of ssd. That's a lot of money Okay, well, let's once we get to a certain age of data or size of data or some random other thing you think of Let's tear it out into object stores like s3 where it doesn't cost that much to store Hundreds of terabytes thousands of petabytes. Let's do that It's transparent to you. I still get the data the same way. I still sync up the same way I still send it produce it consume it read it from spark. It doesn't matter just saves you money Now sometimes that data is raw bytes. Sometimes it's not Very often it's not I mean, maybe 95 of my use cases are this data looks like a table You know, maybe it came out of a table. Maybe it's going back into a table Maybe it's coming pretty structured semi-structured. Maybe it's jason Maybe it's avro. Maybe it's coming to par k Maybe it's a csv Well with schemas I can make sure that data stays consistent So we can have a contract between all my producers of data and all my consumers of data Whether that spark sql flink sql your own app It's in the c sharp. It's in go. It's in python. It's in kotlin. It's in rust It's in no j s doesn't matter. Let's agree what this data is what the field names are What the types are what type of data it is. Let's build a schema Version it automatically for you. I don't have to know how to write schemas If I know how to write schemas great just upload it to the system. You're good If you don't want to do that create a class in python or java Put the names you want put the types you want Describe if it's uh nullable send it to the system Boom now I've got a schema Most people do avro or jason you could also do proto buff and some others Again, everything in pulsars open source and extensible want to add some more do it So we mentioned those different protocols kafka's one the thing I want to point out is this is not a paid add-on This is not Some half written proxy It's not some kind of hack. It's not doing double the work This is a native handler the same as the one that's used for everything pulsar does Except we have one for kafka as well She could run both Use all those kafka libraries out there And it's just another way to get your data into and out of pulsar doesn't change how it works Underneath the covers I could still get all the features of pulsar get the data in and out the same way still have Tenants and namespaces and full security and all the libraries and all the sources and syncs Just makes your life easier if you've got existing apps So you've got an existing device like one of these that only can push out mqtt You could still use pulsar. I turn this on it looks like a pulse Mqtt broker to the outside libraries. Boom. It goes in. We're ready to go No, uh, no difference for you same for if it's rabbit or amqp and we've got one for rocket Extensible write your own now if you want to get data out And you don't want to have to write a little client or use the command line You're like, I just want to see what's in my topic right now Presto trino very fast great sql tool There's a web ui's and there's guis you can install to connect to it JDBC odvc great python connectors great way to query data Well, great way to query data that stored in pulsar whether that's in those bookkeeper nodes Or it's running in that tiered storage doesn't matter Get all that data out run full sql really fast great way to see your data now Whatever the current state is you're running query it completes. There it is I say this because that is different from what I do with things like flank and potentially spark Depending on what type of sql I want to do Now in the example app. I'll show you I've got data coming from a device Going natively into pulsar could have done mqtt could have done web sockets I've got examples for those two could have done Kafka You come up with a protocol will do that too Um when it comes into pulsar, I've got uh function does a little management I also have a sink dropped it right to delta lake And I could use that data in delta lake query that from spark or get spark point right into pulsar Couple of options there We've got those sinks in there. I'm doing the delta lake one There's one for hoody one for iceberg lots of lake house options out there support them all Very straightforward to get your data into these impressive lake house structures Not a lot of uh extra work there Pulsar does a lot for you Built in geo replication And I want to emphasize this is all open source. This is not commercial All these features are there all these connectors are there everything's a patchy license You use it the way you'd expect to use it and it just runs And it doesn't matter different sources and sinks you might be connecting to Doesn't matter what clouds you're running this stuff on I can have a cluster in aws in In an availability zone another in google or microsoft or on premise Great way to get your data from on prem to the cloud Maybe in different amounts maybe converted and cleaned with functions Brought it to special topics that just get geo replicated Very easy to configure these things great way to buffer data before it goes somewhere Batch it up into chunks if you need to route it wherever it needs to go filter out some of the junk or dupes Aggregate it up and just send an aggregate to somebody Enrich the data along the way replicate it between clusters different parts of the world Get rid of those dupes Decouple your different systems distribute it to as many Uh places that want to get the data people subscribe to the data They get it you can have as many as you want and a subscriber can be something like spark It could be elastic surge could be your own python app A rust app a c sharp app a no j s app a kotlin app a scala app a java app spring app quarkus app tons of options Now we mentioned those functions before and maybe I downplayed the importance it is a full serverless event streaming framework Kind of like aws lambda, but all open source and you run it either in your brokers or in a function master open source uh kubernetes environments Use the pulsar as the message bus to just connect all these pipelines And very easy to run it. What's nice is you got your options of java python or go you run ml Libraries or whatever libraries you want Specify one or more input topics to come in could be a wild card Inside your function you could have stuff logged out to a topic Or you could have it go to one or more output or no output if you're just doing some kind of processing Maybe it goes into the context buffer. Maybe it goes into a file system or some other data store Really easy to write these functions If you want to run along with me Save these slides they'll they're in the uh In the website download those follow these along. I've got all the details on how you can Run pulsar and try out different use cases here learn the basics Download one if you want to do it on premise untar it Tight bin pulsar stand alone. You're ready to go. You'll need a jdk other than that You find to run runs on mac windows all the Linux is whatever If you don't want to worry about that infrastructure stuff just run it in docker Or another kubernetes like environment or something that runs containers There's some other tools out there that do it just as well Pretty easy to do that once you've got something running or say you're using Say stream native cloud or someone else's uh pulsar hosted environments Easiest way to interact with it is the command line interface Certainly use rests. You can certainly use different web uis But this is a great way to Learn how to use it Alternatively or set this up in a dev ops tool We create a tenant create a namespace and create a topics underneath there Now the important thing I mentioned there is tenants Pulsar is multi-tenant. That's why we could be the unified messaging for everything So you could get rid of your kafka nqtt And every other rabbit every other messaging system you have put it all under pulsar create tenants for all the different apps Turn on all those different protocols have namespaces for different apps groups companies whatever Everything secured under each one based on what security you want to set up At the end create your few hundred thousand topics under each namespace and you're ready to go Now the naming of this is not Is not uh transparent until you look at very carefully that first word persistent is not a generic keyword Or something that's not important persistent means i'm going to store these messages perhaps forever Now if you don't want that you could use non-persistent Which is the case where say i've got a very loud device keeps saying the same thing constantly every Every fraction of a second. Maybe I don't need the data. I just need to sample it or pick messages at random Don't matter if I lose them choose non-persistent. Most people use persistent Stored forever if you need to then I set up my comp as tenant Europe is namespace first is the name of my topic. You see all the topics in that same area Very straightforward. Now if I want to interact with stuff send some data consume some data Easiest thing to do is install the python library Make sure you're running python 310 or better. I would hope uh, just do pip 3 install pulsar clients Uh, you could stop there if you just want the current one with the basic functions If you want everything on the latest version I've got that listed this will install on mac windows Uh nvidia jetson raspberry pi bunch of different architectures Which arch 64 arm m2 and tell That's probably all you need to do if that doesn't work. Maybe you're on some funky architecture You can install and build the c++ Edition and then install python on top of it. That's a little extra work. But if you need to do it, you can Very easy to send data with python import the pulsar library connect to a cluster Create a producer for that tenant namespace topic send the data here. It's just string And I'll just do udf simple as use case That works. That's really simple use case. Most people have security Okay I have ssl pulsar plus ssl put in the right port Uh same topic namespace whatever it is and then apply my authentication parameters in this case I'm doing oauth and it happens to be at the stream native Uh Cloud could be anybody's cloud. There's other authentication options. This is a good one Pretty easy to do with full security Doesn't take much extra work. I turn this on and off in apps with a single Uh parameter passing very easy. You want to add a schema? Okay, let's add avro a schema is easy in python I just set the class and I put in uh What I need there and I'm going to do avro Avro around that class. I've got the name of the field the type And if it was required or not this one's not Wrap it with avro schema Send it to a specific topic And we're ready to go And I send it with uh that record And a partition key And we're ready to go And then if I want to do the same for jason Very easy I just have uh again the class Connect uh wrap that with the jason schema Send it again. See that I put the producer name in there as a property And I'm sending that key again useful for partitioning uh debugging Uh lots lots de-duping lots of reasons. Send the key Now if I want to get the data back Connect to that client now if you're using all off it's that same thing again Now the important thing you see here is that subscription And I give it a name that is my subscription now people can share them But If I'm the only one using this The server will manage where it is and I will never lose a message and here I receive the message Until I say acknowledge that message does not change in the system Once I acknowledge it it's acknowledged in the server And then if I decide I want my data to expire and not be stored forever. It will be gone Something to think about you don't have to acknowledge it. You can negatively acknowledgement or just ignore it But then it'll stay around Potentially stay around forever and that's a get petabytes of data which may be something you want Maybe not just remember that until I acknowledge it Sitting in that subscription for me and it won't be deleted unless we have no choice So we'll keep it there Maybe you want to use another library. I could use native libraries for those protocols. I mentioned mqtt Easy none of this says pulsar other than I put the name and the server in there just so you know where we are Web sockets It's a little bit of special formatting to do web sockets. I've got a base 64 and code my message Other than that very easy to send it. This is just going to my pulsar cluster producing a message That's persistent. It's going to this tenant namespace and topic Very straight forward. I could use the Kafka library to do that Now now that I've got my data in let's create a topic Let's create a function And it's going against the topic and I've got a python function to do that It's going to log to a specific logs topic and put the results somewhere else This is just going to run sentiment analysis on anything that comes into that topic And send the results to another topic That makes it very easy for me to put that in a web page pull that with web sockets Nice way to run a simple app With very simple coding Very easy to write a function in python as you see here If I use the full pulsar sdk I get access to a logger get access to metadata message id All those sort of built-in things that are nice to have If I want to do a go app go supports connect into pulsar as well This is an extremely fast library for pulsar So if you're a go person enjoy same with rust those guys just fly when you're pushing data through pulsar Java is a first class support if you want to write your apps in java very easy to do so You'd also use the new spring library or the carcass library lots of other options out there You're producing data in java You know, it's a little verbose but we got the o off just like you do in python if you need it If you want to use a different library from java you use all those ones including the spring one for rabbit The spring one for mqtt spring one for cofka Very easy to do if you're making a very simple one in java Create a producer Send a message with a key and a value Send a couple properties. You're ready to go very simple If you want to create a function in java you could do that just using the standard java functions Nothing pulsar specific here pick an input pick an output give it a name ready to go If you want to take advantage of the full sdk and get access to a logger Get the ability to send to different topics get to use the uh, built-in key values store get access to metadata You're gonna want to use that sdk and get that extra data there Uh, when you want to subscribe you could pick your subscription name like I mentioned I didn't mention subscription type So we only have 40 minutes here. We could go deep into subscription types This is how we decide if you're messaging or streaming with a shared one I'm messaging I get the first message available and so could 50 of my friends and we process this data all together As a team whoever grabs one does it next one grabs it That may not be what you want if you want in order or Kafka style streaming So you'd pick a different subscription type. That's the only difference data comes in Well, however it comes in to be streaming I change that to exclusive or a different subscription type if I want to do messaging Set it to shared very easy and again with that subscription name. That's what ties you to the topic And multiple people could use that same name But you know, if you're in shared whoever gets it first, that's their message Something to think about how you want to do that Uh, very easy to produce events from Java again use the schema pointed to a Java class I don't have to create a topic a schema with uh You know some kind of schema editor I could just create a class with the field names. I want and the types and ready to go If you want to create it with a formal Avro schema, perhaps you could do that, but this makes it easier. I could set timeouts I could send things sync or async lots of options for working with things lots of access to metrics Whether you do it through the rest api Or you look uh exported to brafana or other systems Or you look through jmx Or use the uh command line tools Lots of different metrics on every part of the system you can see what's fully going on everything's transparent Big things about being open is we're open source And open for discovery of the data and the metadata When i'm done very easy to clean up if I want to Delete my topics to meet my name space to meet my tenant shut down the docker we go home Uh, you don't have to be Keep it running forever lots of use cases commonly used having one message platform for everybody ad tech loves us Real-time fraud detection works out really well. There's some big credit cards doing it Connected car get a lot of different data coming from a lot of parts of the car Works out great with pulsar doing real-time analytics and iot Again a great use case Microservices really easy to connect them doesn't matter the language you're using So it's a nice way to do microservices And still connect it with other systems feed your data any way you want Lots of different apps you can build here. I've got one for real-time air quality Air quality comes in from a couple of different consumer Producers of data want a spring boot app want a patchy knife five get it from different sources Get it into different topics. I've got a function takes all that data Cleans it normalizes it then splits it out into the different types of air quality readings Whether that's p.m. 2 5 p.m. 1o or ozone levels I push that into those topics and then a continuous fling sequel app runs and looks at them Sends that into another topic or aggregates it Spark takes batches of that drops it into some s3 files for other people to do analytics I've got links to everything you might want to try out But let's look at a demo. We'll see if everything is timed out Hopefully not. I've got a full rundown of the full demo here You could try out all the sensor parts you try out sent it setting up the Delta Lake sync deploy it Run it see the stats Look at the output as bar cave files. Look it in a delta lake shell. What are you doing a scholar or python? Run a spark app against it Then create a fling sequel app against it Create a presto sequel against it do a spark structured streaming app against it Display it in a live dashboard using web sockets and jason Let's see. So this is the topic. I'm sending you to there's a number of subscribers on there And as you see it'll tell me what they're doing there Let's start up the data so we can start seeing some things coming through Now this is my iot device on a pie And I just want to show you the the code here. We've got a little thing to run it So python app connects to some sensors connects to pulsar. Here's that record You can see it's got the names of the fields the type and if it's required That's all I need to do to build the schema Pretty easy. So I grab all my sensors and stuff Get some data get some timestamps build my id And then from there. I'm just going to populate a record Send it to uh pulsar Pretty easy. So I'm just going to run that it's going to warm up those sensors there There's like four of them once they're up. It's going to start sending them to that pulsar cluster You can see it's formatting it as jason jason data is coming into the system We are live and ago Let me show you uh, this is the stream native cloud to show you that schema And this translated that for me automatically it named it based on the class put those fields in there Put if it's nullable or not all that type of fun stuff. So that's in here if I refresh this This is the open source tool here. That's my connection This one is available for everyone. Uh, the stream native one is only if you're using that particular cloud So, uh, this pulsar manager just comes with everybody like a look at again all the tenants All the namespaces under that tenant and then find my topic And I was something sensor. Okay, it's this guy And then I could see the throughput going on for the whole thing I could see who's subscribed and who's got a backlog And I could clear that out or unsubscribe people if it was junk Like, uh, I don't know who this guy is. He's just wasting my time So I could just unsubscribe that one if I wanted, you know, just to get rid of that if we need to So we could see there we could see there's a partition there You have an option to partition every topic underneath the covers if you only have One partition it still exists under there. You specify no partition. There's gotta be one Uh, partitioning is something to let you expand out how many consumers you may have This is typical if you come from the Kafka world, you know, what's going on there Okay, so I have some data coming in From that system and I want to Use it for an app So I have a delta lake connection here in spark I connected this up use the delta lake functions connected to my object storage here That's getting dropped by the sink and I can see the schema for all that data coming in That's going to look familiar. It's the same fields And then here I could just query Those that delta lake file all those different files in that directory Which are a special parquet and I'll just pick some fields order it and show just five And this just gives me a hunt maximum space of a hundred for the columns So these big fields fit And I could just see that data coming in that's recent data And it's stored in my lake house Now if I want to do some real-time analytics, I've got flint connected here To uh pulsar via the pulsar catalog Let me show you what this table looks like this table is that topic So that's why we have that schema so I could do this Automagic So I'm just going to do a simple sequel query Now if I wanted this to get stored somewhere I could do an insert into another topic or some other flint catalog Uh Location easiest would be to just send it to another Topic so I'm running that sequel that gets deployed as an application I go to the flint dashboard. I could see that it's got that already And since I have live data flowing it's going to start displaying it As a new event comes in about once a second a new one shows up in my query continuous query This was an insert new record would get there Here's where I could do fraud analytics I could put this rack this up into a scala java Or python flint gap and have this do real-time analytics on it Now if I want to do things like aggregates, I could do an aggregate as well That old job got uh cleaned up and the new ones deployed already And you could see here runs the source coming out of that topic tenant namespace And now I've got a group buy on it. We're starting to get data there Now there is a way I can get more data I could specify my table that I want the First record that ever came into the system because remember I could keep it forever I didn't do that. It's just doing them as they come in So just what's there now, but I could go back in time if I need to Now I also have running A web socket display of all this data. I'll just refresh my page. This is A simple html page With uh some simple j query That's calling web sockets to pulsar to get the latest data from that topic And as you can see it's refreshing. I've got an eventual time out on this because I'm not The best html person, but it makes it easy for you You just connect to that pulsar topic over web sockets and you get the data You want to send data the other way you could do it And I like this little library because I can sort things or whatever I want This is just sort by the data that's coming in pretty straightforward But I could do that with other data as well from pulsar Let me send some data here. I've got that patchy nifi connected sending some transit data And I think over here. I might have some weather data Let's send is this weather. Nope. That's the same place And I'll go do some weather data. It's nice to get Fresh weather data in So I'll grab all the weather reports in the united states And I'll just send them to pulsar And we're sending a couple files in there. They'll keep coming in and then when I go to my Screen here. I've got the transit data coming in from new york area A lot and I've got All of my data coming in from the weather feeds and that's live Again via web sockets. I can read in web sockets I could read that same data down here in flink like if I wanted to Say look at the weather Let's see. Is that the right one? There's a couple weather topics to weather or pie weather And we see we'll look at the weather table Yeah, there's probably Try not to do that on the fly. Okay. That's a binary one. That's probably not the one I want. I probably want pie weather I keep changing my mind when I call these tables But yeah, you can just select what you want from one as long as they have a schema If I forgot to set a schema, then it's not going to be there again, you don't need a schema for things like JavaScript because you just parse the Jason for some reason you don't want one of those As you can see, there's a lot of a lot of that data coming in more than that those sensors There's a lot of weather data in the united states We're running out of time I want to show you I could also do the thing with real time airline data I could do that with weather data transit data sensors logs Whatever source you might want to use Very easily. These are linked here If you have any questions, I'll be around the whole conference or email me or hit my twitter up I'm always looking to show people different ways to work with open source tools that I call the flip stack flink Hulsar nine five spark all their friends. Thanks for coming. Hopefully you liked my talk and I'll see you next year in