 Hi, there we go. There's our camera. And can you share your screen? Sure See so good to see An angle sharp. Yeah. Yeah. So Florian we've we've talked over over the years here When angle sharp joined that foundation big fan Thanks Sure, yeah, I'm quite happy to share now what angle sharp is about I mean for me, it's anyway like Christmas two best events of the year coming together Dotnet com happening globally and the October fest is just around the corner here How can you forget Florian The stage is yours Thanks So without further ado, I will just jump into a topic and hopefully I make it in time The topic is angle sharp which aims to be Dotnet headless browser. Let's see where we are and before I jump into it Fully in the water a few words about myself. I'm Microsoft MVP for developer technologies I'm also a contributor to open source projects Mostly on GitHub, of course, and I'm very enthusiastic about writing some articles of technical nature And speaking at events if the time allows it Always be be happy to be invited and appreciate that Professionally, I'm a solution architect at a small startup called smart yacht there I'm specialized in distributed web applications and the web is also the topic of this session one word of remark Since this is an online session, I cannot see all your faces. So excuse me for being either too fast or too slow I hope that the recording will then at least a little bit Well, give you a payback on this one Otherwise, you can always reach me on on GitHub or on Twitter if you have a specific question Now, yeah Yeah, thanks. Thanks. All right Now what is on our plate today? We will first start by a quick introduction. What is angle sharp? Why could it be helpful for you? Since angle sharp is an HTML browser, we will also need to have a small excursion HTML 5 We will do that by example. I selected as three examples That should illustrate why HTML parsing is not so simple as just putting out a regular expression Then we will go into the topic of extensions This is one of the things that makes English up special and we would try to to always improve our Reach and what we can do with it and finally a Small outlook what the future of English up will bring so what is angle sharp as already told you it's a Library for parsing HTML, but it's in fact a little bit more than that What we try to do in order to pass HTML 5 correctly. I mean there's core specification But there are all kinds of specifications actually and if you want to really Get a grasp on what's on the web you should not consider just one specification But all the site specifications also play a very important role like for instance, this one little thing called JavaScript, right? I mean many websites are unfortunately these days only accessible if you also have a JavaScript engine running or part of the information may Only be displayed in conjunction with special CSS rules or may make only sense with these CSS rules So there are all these different technologies that come together and form what we call the web today And if we really want to get access to that information We need also a browser engine that's capable of doing that Of course your standard browser can do it, but can you do it in dot-net without any let's say RPC calls or whatever That's the mission of Engelshop Now what we can do is we have a full stream processing unit like a standard web browser does Which means we do not stop when bytes arrive. We always Evaluate them. We don't wait until all the content was received. We do that on the fly We have a collection of web utilities for that most notably In the encoding space, but also for instance, we have our own URL parser That's just because Yuri type that's in dot-net Unfortunately is not capable of well parsing all the valid URLs that are out there This URL parsing that we included follows the specification by the heart and Therefore can handle all the cases What else we can do besides HTML CSS already remarked that JavaScript plays an important role Yes, we have a solution for that. Unfortunately at the moment it can only deal with simple JavaScript The vision is of course to to improve the reach here And this is where customizations and extensions come in because at the end of the day We don't want to create this one giant monolithic library We want actually to foster a whole ecosystem of plugins where everyone can say oh This is one of the things English up cannot do or doesn't do well at the moment But I can just customize this and come up with my own plug-in Now looking at the history of the project it all started I think around in 2012 It was actually after an MVP summit that was on the plane and I thought yeah, let's let's write an HTML 5 pass I mean what else could be it could be done on a plane Honestly, it was a different angle So to speak back in the day on the project, but I realized an HTML 5 parser is one of the things That's missing in the dotnet ecosystem There have been other HTML parsers, right? But none of them have been following in the HTML 5 specification by the heart at this point in time in 2013 I put it out on github and Initial reaction was quite good when I was really surprised. So there seemed to be some kind of a demand And so I kept on going and in 2015 we had an important milestone with integrated extensibility We demonstrated that a scripting engine can be brought in Which was quite cool because suddenly you had not only just a static document object model static representation of the page you're dealing with but actually It could be made alive what javascript brings for instance to the table And then from that point on just a lot of bug fixes a lot of improvements and a major refactoring of the API to the end user also happened and The milestone here is the version 010. So right now we are at 013 and we want to hit the one zero milestone all the major Breaking changes that happen since 010 are all quite minor but still breaking In one or the other area now from 09 to 010 that was a huge change. So there We really made a drastic Direction change, but I think it was for the better and The ecosystem also lives now on top of the 010 version and of course the successors Now one short glimpse at what how parsing HTML looks like It's like any kind of parsing. So if we could draw pretty much the same picture for let's say Even programming languages like C sharp there, of course, you may have What this then called a back end with with optimizations, etc. And coding mission happening That's not the case here, but nevertheless the parsing stages alone the front end is pretty much the same So we start with a stream a stream of just bytes and They are interpreted in a special way by a preprocessor. It also does some sanitization In the end the goal of this preprocessor is actually to get us characters that we can work with Then a tokenizer comes into play. No, the tokenizer takes a bunch of characters and says oh now That's a valid token for instance an opening tag or a closing tag or an attribute or Text or a comment all these kinds of different building blocks that we have now until this point We are still in the linear phase where we say, okay, we started with a linear stream of bytes We happen a linear stream of characters and now we have a linear stream of tokens Now where does the tree the document object model come from? Well, that's done by the tree constructor That is is fed on the tokens coming from the tokenizer and now here is the semantic information that says, okay I've seen that Opening tag. I cannot close it. That's valid or here. There is no content allowed. I will just place it on the sibling element so all these things happening in the tree constructor and Then we have a dynamic object model which is called the DOM So At the end of the day, this is all what what's included in angle shop So you don't need to do anything you just for instance present the stream to angle shop and angle shop That's all the rest at the end. You get an I document for instance instance and With this instance you can actually play around you can see lies it back. That's what we will see you can append new elements or Get further information out of it Let's just look at some examples by learning a little bit. What makes html 5 so so complicated First very simple piece of html code What you should recognize here is I left out some of the Let's say standard elements like we don't see an html tag here. We don't see had we also don't see buddy Well, that's not an error. That's actually a valid html 5 document And there are big sites for instance the Google arrow page out there which use exactly these rules to Save a few bytes here and there so What and valid html 5 parser should do it should insert these things for us So it should automatically insert for us an html opening tag It should insert for us and had opening tag It should also close the head and it should also when the magenta Code is reached Create a body element for us. All these things should just happen automatically. That's by the specification English app does that we will see that in a second Now a second example Where it gets a little bit tricky is there are some special kind of scoping rules in html for instance, we could be in a unsorted list and enter the space of a list item Now if we are in a list item We can just write another list item with the first one or the former one being implicitly closed So that's all by the specification Simple parser may not like this Because here the tree constructor needs to have all the additional logic as an example if we look at razor for writing Fuse in ASP.net core MVC We will recognize that we need to close the list item now That may have been a good design choice for performance reasons, but on the other hand, of course it limits The output that can be generated because you can never output an html like this using razor unfortunately Now the same rule that applies to a list item can also apply for instance a paragraph There are multiple of these Cases again, these are not errors. These are not even warnings. This is just valid html and the automatic closing just happens for you a Third example I want to give is in the table space. Well tables are one of the most complicated parts of the html 5 specification Because there's so much that could go wrong and every edge case is essentially handled There's another space which has to do with formatting elements But since formatting elements are more or less a legacy thing, especially with the edge cases described in there I will just focus on the table with this simple example Now what I brought in here is there have been some elements inserted The magenta ones so we have a break row here and we have an iframe. They're just inside the table Note also, there's no t-body element for instance, which is also something that needs to be inserted automatically by the html 5 parser In addition to these let's say misplaced elements. We also have an invalid closing tag So it's not even an invalid tag. I mean web components or Angular or any any kind of spa framework these days uses custom tags. So that's that's that's no problem And then we have also in green the table row Which is just let's say an orphan here It needs to be placed in the table itself and it isn't so or what should we do with that? So let's have a look And all these three cases in that demonstrate demonstration of engelsharp All the demos you'll see today are available on github The URL is on the screen All right, so For the first demo, which has briefly explained what the engelsharp part does What we do here is we create a new so-called browsing context browsing context you can think of as a Like a tap in your standard browser. So that's one instance where now a page can live What makes a browsing context special is that you can configure it You can tell it what it can do and what it cannot do like when you'd say in your browser to your current tab Well, you are not allowed to run JavaScript You can do that here too So we got a new browsing context. We don't specify any configuration Which means it's the default one and then we open a new page as the stream may be Well Evaluated asynchronously we need to do that in an async Method, but luckily see sharp cut us covered here What we use for simplicity is not now some remote remote source We use actually the smallest snippet that's on top of here So we just supply the content via what is called in engelsharp a virtual response. So we say oh So you don't have an address where your page lives. You don't even have Anything like that. So you can just construct how the response to another request would look like and we say yeah Our response has the following content. It's this source and it also comes from a certain address So the address is completely optional I just included it here because we will see it in the document object model appearing as the base URI and base URIs are very important because they give Relative URLs well the base that's required for resolving them Alright, so when we do that we end up with this I document instance And what we can now do is to illustrate that engelsharp did everything right? We serialize it back to an HTML string again using the to HTML method So if I use this and run the code you see the output It's pretty much the same document that I inserted except we suddenly get the HTML We get the head. We also close the head. We open the body and at the end of our Whole HTML document. We also close everything that was still open But that's all done by engelsharp now The second example we had we use the same code Just a different HTML snippet We should see that the list items will be closed before we open a new one or before we close Obviously the unsorted list and the same applies here with the paragraph. We also need to close them properly So let's just run it and we see same action as on example one except now Of course in addition we see all the scoping correctly evaluating So far so good. Let's also have a look at the third one and here. Let's also debug what actually the document looks like So Before we write it to the console, let's just see what's in there Could be a little bit too small. So I will just Make it a little bit larger. I hope Skype plays with me here So what we do is we have all these capabilities that if you are familiar with the document object model API from JavaScript will will look very familiar, right? So for instance, we have an all property and that will contain all the elements in here We can also iterate over them There could be a cookie for instance or we see of course our base Yuri that was Successfully applied. So all these things are there and that's it's a full document object model just done in C sharp without any remote procedure calls to to Chrome or any other evergreen browser happening now regarding the output That's what we expected so the Table row that was outside of the table that was completely omitted Otherwise, we see the standard construction happening The break row and the iframe have been pulled in front of the table and we see the insertion of the T body happening So all done for us by English up. It would have been done, of course by chrome or Firefox or any other evergreen browser out there, too so This is as the specification dictates Okay, now let's talk about extensions for a moment So what do you saw is what the core of English up can can do for you in a nutshell? So it really makes sure that whatever the HTML looks like on the page is Interpret that 100% as a real browser would do right and that's of course important because you don't want to end up with something That's where not what you would expect from just for instance debugging it in in chrome Now English up core doesn't deal with JavaScript English up core also doesn't deal with with CSS but they are luckily there are these plugins and How the ecosystem looks like is we have this base layer of angle shop core Providing the common utilities and then we place on top of it useful libraries like for instance English up CSS Which deals with the CSS object model and we try also here to be fully w3c Conform which means whenever they came up with a spec how an API should look like we follow that spec So it's not only about behavior. It's also about what the API looks like And that should give you some kind of a learning improvement because if you know it already from JavaScript You can apply it directly in Angle shop if you know it from English up and someone asked you later to do it in JavaScript Well, you can also apply the knowledge there and this two-in-one thing in my opinion as always It's great to have We also have English up IO, which will demo in a second. This brings additional IO capabilities like requesters or cookie providers And then we have libraries that are either an experimental stage like English up JS is one of these or Which are just planned like English up media could be one of these things that would then also support certain kind of streaming capabilities and Could also be quite cool if you say oh, I got this site and there is a video stream on it I can log in and then suddenly I can bring this video stream to I don't know WPF That would be quite quite awesome But we're not there yet, but that's part of the vision All right, so let's have a look at persistent cookies using English up IO So a little bit of background to this demo I got a web server running locally a really simple one There's a page which Needs a login mechanism to display a secret now All the locking mechanisms in the web pretty much where these days either is of course some some APIs But then we are anywhere on the safe side if we have a JWT or anything like that or we have a cookie based authentication That's for most of the sites that are relevant for English up the case So when we have this cookie problem, we potentially need a cookie solution, right? out of the box Engelshop already brings a cookie provider But that's based on the cookie container of net and that has several disadvantages Most notably it doesn't work with all the cookies out there on the web. It may crash. It may complain Oh, there's state format. I don't do not recognize So what we did in one of the extension libraries called Engelshop IO We created a cookie Provider following the official specification and we even went further what we have in there are two ways of using it One way is in memory where you say, okay when the application closes cookies are lost It's good and the other way is saying oh you can persist it Well, however you want to by default on the local file system and That's what it should show. So we use a custom configuration now That's done like this You have the configuration class and we just say we'll start with the default one and then we add additional capabilities So what we do is we add the persistent cookie capability and we say oh, yeah You need to store it somewhere the sync file path is in my documents the file demo cookie We also add additional requesters like an HTTP request that's based on the HTTP client That's just more modern than the one that comes as Engelshop out of the box and Then we say yeah Engelshop you allowed to actually make requests to the network. So we say with default loader Now when we run this thing what will happen is, uh, let me just switch To authentication We will have different kinds of stages. Oh, sorry. I'm still Need to remove This little file. So let's run it again. Sorry for that. We have different kinds of stages. So we start not being locked in Um, so the page looks like this you need to log in for obtaining the secret Luckily, we have to log in link here. So we navigate there Then this page contains a form We fill out the form with the user and the password press login. It's all done And then we are automatically redirected and here we see the secret. So obviously proves Wayne is Batman What? But yeah, so that's just how the word works And uh, this was the the Engelshop or the the code that we used in C sharp using Engelshop Uh, I say, okay, we can use a query selector if it's there We navigate to it. Then we submit the form and then we are logged in Now if I run it again, this cookie file is created Has been graded and so we are already logged in. So we see the secret directly. I'm Batman Great And the reason how it works is because sorry Is because um, yeah, we have this file that follows the old Netscape cookie file specification and that actually Includes all these different cookies that have been used now for the local host domain I will just transport that over All right, so going into the final stages, uh, I also want to show you javascript before wrapping up Now javascript as I said is also just a library In an experimental stage We only need to use with js and then we can apply some simple javascript. Let's just make the demo before we run out of time Pretty much the same thing now what we will change is the document title Because that's right now sample and it will be changed to simple manipulation What we will also do is we will write out this special kind of spawn element So let's just run it And what we can see is the title changed now in the serialization. That's now simple manipulation and we got this spawn Great. So this simple javascript was Applied correctly evaluated correctly again, uh, englchub.js is experimental and You will not be able to to run. Let's say large scale single page applications with it at the moment Okay, so Next steps, um, obviously shipping one zero is very important improving englchub.js super important and then refining, uh, things of the ecosystem like englchub css or bringing up new, um additional libraries like wasm is for instance a thing Especially with with blaze around the block. That could be really interesting Englchub media is set but also things like englchub renderer could be quite interesting Especially if you want to say everything is just managed code. I don't need a web browser for displaying html Which could open a lot of interesting use cases in my opinion We are always looking for contributors. I would be much appreciative if you have a look Any kind of contribution may be finding a bug Fixing something on the documentation or also discussing how the api could be improved Would be superb I appreciate all your time You'll find more information about the project at englchub.github.io and you can always Give me a tweet or reach us via via for instance our github chat Thanks a lot All right florian that was great. Thank you so much. Yeah, that was great I've I tried parsing html with some regular expressions back in the day. It's a madness. It's just mad. It's it is madness It works for the simple ones though, but Once you don't know what you are receiving you're just out of luck. I guess html has got so many quirks as you showed And and we're thankful that you're doing the hard work. So we don't have to Appreciate it. Appreciate it All right. Well, thanks so much florian. You have a good day and uh, we'll enjoy octoberfest Thanks Come over guys, uh, splint your space here. Alrighty. Take care. Take care. Bye