 team manager at Barbit, upstream developers of CISlogNG, and doing packaging, support, and advocacy of CISlogNG. First, let me give you a short overview of what I'm talking about today. A quick introduction to logging and CISlogNG. Next, some of the most interesting features related to dealing with security logs, like message parsing and reaching messages, and so on. And once I introduced you to basics of CISlogNG configuration, I'll be talking about scaling CISlogNG and some of the security related analysis possibilities, like creating a heat map, anonymization, and other stuff. So first of all, what is logging? It's a recording of events on a computer. In case of Linux, it's usually under the bar log directory. And you will find there are similar messages on screen. This one is from an SSH login. So what is CISlogNG? It's an enhanced logging daemon with a strong focus on high performance central log collection. By test, why central log collection? First of all, it's ease of use. So you have just one machine to check for your log messages instead of giving too many if you are looking for information. It's also availability. Even if the central machine is down for some reason, you can still check your log messages at the central location and figure out what happened, why it's not accessible. And last but not least, if we are in a security track, it's also related to security. That's the first thing, when a machine is compromised, that people try to remove tracks of the compromise. But if log messages are pushed from the machine in real time, then there is no chance to get away without traces. Next, I would like to talk about the main roles of CISlogNG. First of all, it collects log messages. It also can process them, filter them, and at the end, store them somewhere either locally or forwarded somewhere over the network. Let's start with log collection. CISlogNG can collect both system and application logs together, and they can provide quite good contextual data for either side when debugging or looking for a problem. CISlogNG is quite platform-independent, Linux and Linux platforms, so it can read from many platform-specific sources like Devlog, Journal, Sunstreams, and so on. And as a central log collector, it can collect messages over the network from using the legacy or the new CISlog protocol, using UDP in TCP or encrypted connections. And there are also many other possibilities for log collection through FI, sockets, pipes, or even application output if that application is parted by CISlogNG. The next tool is processing. It's not mandatory, but in my opinion, this one is the most important part of CISlogNG. You can classify, organize, and structure log messages using built-in parsers. I will talk about this in detail later on. You can be like log messages, and you don't have to think about falsifying those log messages here, but rather, for example, if you have compliance requirements, you can do elimination. You can also reformat log messages using templates. So if your destination needs a specific format, you can change how the date looks like. You can use JSON format, and so on. You can also enrich your log messages using GIP or create additional fields based on the message content. The next role is data filtering. It has two main uses. First of all, you can discard log messages using this. For example, you don't need debug-level messages forwarded to your scene. Or the other one is message routing. So you can make sure that only a small set of events is forwarded, for example, only authentication events are forwarded to a scene system. There are many possibilities for configuring data filtering in CIProGNG. It can be based on message content or different message parameters. One can use comparison pie cards, regular expressions, or filter functions. And best of all, any of these can be combined using Boolean operators. The last important role is to store log messages somewhere. Additional log messages were stored to the Syslog server. And later on, different database support for different databases was added. And during the past two years, we have been support for different big data systems like distributed file systems like Hadoop, NoSQL databases like MongoDB or Erastic Search for messaging systems. Here comes a tricky question for you. What do you think? Which Cyslog NG version is the most used one? The project started almost 20 years ago. Red Hat et al. has version 3.5. And the latest stable release was just released a couple of months ago. What do you think, which version is the most used one? Sorry? The second question, right? Yes, it's a tricky question. Well, 1.6. It's running on over 100 million ebook readers. So I don't think any Linux distribution has a larger number of installations than Kindle. It's 1.6. For the rest, I think that 5 is the most used. Back to more serious topics. Let's talk about log messages. How do they look like? If you look under your bar log directory, you will see that most of the log messages have the following format. They start with update, next a hostname, and finally some text. The text part is usually an English sentence with some variable parts in it. Here you can see an SSH login. My favorite example, you see accepted, then the authentication method, the user name, the source IP address in a nice, long English sentence. It's quite easy to read by a human. And it was originally meant to be read by a human. But if you want to create reports from your log messages, well, it's quite difficult to process these messages by scripts. The solution for this problem is the use of structured logging. In this case, events are represented as name value parts instead of free-form text messages. For example, you can describe an SSH login with an application name, user name, and IP address. The good news is that SQLGNG had name value parts inside from the beginning, date, facility, priority, and so everything is represented as name value parts inside the SQLGNG. This is the only way to make sure that you can create templates for reformatting messages filtering. And recent SQLGNG versions, well, that's the reason they started starting with 3.0 almost eight years ago. Parts that were added to SQLGNG, which make it possible to turn unstructured and some of the structured data into name value pairs. Why is it important? One of the parsers is the JSON parser, which can turn JSON-formatted messages into name value pairs. Normally, if SQLGNG receives a log message, it treats the message as text and doesn't do anything with it. If you run JSON parser on it, then you can use the different fields in the JSON message for filtering or to store just part of the fields or forward them to a database and plug the whole message or to change it to rewrite it, you better have it parsed before. The next one is the CSC parser. It can be parsed not just the CSC size, but any kind of columnar data. The most typical example is Apache Access Log messages. If you take a look at the configuration of the screen, you will see Apache Common Log format fields in the upper part of the screen, the field names. And at the bottom of the screen, the second slide from the bottom, you will see a template in the file name where the user name parsed from the Access Log is used to create a separate file for each of the authenticated users. The key value parser was added to CSC just recently with version 3.7. It can find key value pars inside log messages. The most typical examples are firewall log messages. On screen, you will see a few messages from IP tables, but many other firewalls have very similar format. And if you use the key value parser, you don't have to write long log parser descriptions. Just say that pars with key value parser and use the fields afterwards as your log messages. The next parser is a button BB parser which can extract information from unstructured messages into name value pars. And not only that, but it can also add status fields based on the message text. And just like log check, it can do message classification as well. The downside of this is that you have to know your log messages beforehand to use it. So it needs an external database to describe your log messages. And it can only analyze those logs which are included in this database. On the other hand, it's very, very fast and efficient and requires a lot less resources than, for example, analyzing log messages using regular expressions. Here, you can see an example back to my favorite SSH login example. In this case, a failure. You can see from the first line information part out from the log message, like SSHD, the application name, the user name, the source IP address. Next, a few status fields based on the message text were added that it's login action and the actual status is failure. And based on this information, you can say that it's a violation, usually at the collector. But you can practically put it anywhere. Here, you can see a pattern, actually just a very small part of it. But this is the most important part of it which describes a log message. There are some fixed parts in it, like failed. Then here, this one is a parser. You have to use the name for the name value pair to create. Then, none of these parsers fit your needs. You can also write your own parser. We had the possibility to write parsers in Rust modules on GitHub for regular expression parsers similar to button DD written in Rust. It was added about two weeks ago. So it's not yet released, but already merged on GitHub. So you can write parsers in Python. It's not as fast parser on the other hand. It doesn't need any compilation. You add your log messages by adding additional name value pairs based on the message content. I already mentioned button DD, which can do this. We also have a GUIP parser, which can find geolocation of IP addresses. Originally, it could only add the country name. And recently, we changed it in a way that you can also add longitude and latitude information from IP addresses. You can use it for many good uses for security logs. So you can detect anomalies, like if you have a login from one country, from your office, and from another country, one after the other, from the same user, then there is some trouble. Teleportation still doesn't work. You can also use it to display your log messages on a map. With the last release, we also added the possibility to add metadata from CSV files. For example, a host role or a contact person for a host to speed up the incident response or create a lot more accurate alerts or dashboards than possible without it. My personal favorite is the enlist filter. Some people also call it the poor man's scene system, which is a bit overstatement. But it's a filter based on black or white listing. You can compare a field with a list of values from a text file. And you can use it. For example, there are many databases on the internet, updated regularly for known spammer IP addresses or for malware for command and control IP addresses. And you can check your logs in real time using the enlist filter if you are affected by any of these known IP addresses. But there are many other uses, so you can filter based on application name or just about anything where you need a list. Next topic I would like to talk about is configuring syslogang. And my first advice is don't panic. I often meet people who, yes, syslogang is nice, but I looked at the configuration and it scared me away. Actually, syslogang configuration is quite simple and logical. You just need to take some time to look at it and understand it's a pipeline model, which has many different building blocks, like sources, destinations, filters, parsers, and so on. And at the end, you can connect all of these pieces together using log statements. So here is how the syslogang.conf starts. You need to declare a version number. You write the rest of the configuration. You can use includes to include external configuration size. And by default, scl.conf is included. Scl stands for syslogang configuration library. So there are many configuration snippets already available, which you can reference from your configuration and simplify it quite a lot. For example, a nice, long, long, regular expression for credit card number simplifying. You can use comments in your configuration wherever you want and make it a lot more easy to read and understand. Here, you can also define some global variables. Most many of these can be overridden in later parts of the computation. Here, we define a couple of two sources, a network source correcting UDP, these block messages and interfaces on the standard 514 port. The other one is for local log messages. Internal stands for syslogang's internal messages. And system is for platform specific local log sources. If you want, you can use, you can specify here journal or whatever you want. But if you use system source, then you don't have to care if you will configure syslogang on Linux, FreeBSD, AIS, or it can hide away platform specific details. And you can use the same configuration on any of your machines. Next, we define a couple of destinations. The first one is a simple fly destination, log log messages. The other one is a bit more complex. It's for elastic search, where you have to define an index name, type, cluster name, and a template as well, where you specify which name value pairs you forward to elastic search. And elastic search expects JSON. Here we use the format JSON template function to send messages. Next, we define a couple of filters and parsers. The first one is a simple one. Just discarding any debug level messages. The other one is a bit more complex, using four different filter functions combined using Boolean expressions. This one is typical for log log messages. And at the bottom, we define the pattern db parser and load the pattern db database. Here comes the configuration, the log pairs. The first one is a simple one for log log messages. So you define the local messages using the messages filter and then send to the flat file the log messages file. The other one is a bit more complex log pairs. Here you can see that two different log sources are in use, the network source and the local messages. Next, the filter discards any debug level messages. The rest of the messages is parsed using pattern db. And after parsing, log messages are sent to elastic search. I'll show you that it was worth doing all of this long, long configuration. It's a screenshot from Kiba, where it's an older one. In the upper right corner, you should see IP addresses parsed by CystalGangy from SSH logins sorted by the number of appearances. Your organization increases the number of log messages you collect. If you don't care about scaling your CystalGangy, then soon you can lose part of your log messages and your log analysis won't be that perfect as it should be. So it's quite important step to take care of scaling CystalGangy with the amount of logs you receive. Additionally, people use many clients sending local messages to servers, which search quite well for a while. But if you have too many parsing, we will overwhelm yours. Also, if you have UDP log sources like many network devices, you better collect. You can have trouble if the central server is far away from the UDP source. That's why we recommend using a client relay server architecture. You can distribute some of the processing to relays instead of the central server. And in some cases, you can also do that on the client. Also, to make sure that you don't lose UDP-based messages, you should put a relay as close to the device and then UDP messages as possible. You can add, this way, you can add reliability to your local architecture. And if you use relays, you can collect your log messages even if the central server is down for some reason. And you're just seeing back to the central log collection slide that if you have a relay, your log messages are sent from the client machines even in real time, even if the central server is down. You don't have a car, so it can be done using log routing. But you send the right logs to the right places. And only the right logs. It's based on filtering, as I mentioned in the beginning. And message parsing can greatly enhance the accuracy of log routing. Even as much as seen, example, you can create an email route log-ins if you parse your log messages. Using log routing, you can optimize your scene systems for localizing tools, which are usually licensed based on the amount of log messages. This is for energy and make sure that only relevant messages reach these localizing systems. Another interesting topic is analyzing log messages. There are many regulations, requirements about what can be and what cannot be locked. Yesterday, we had a very good discussion about PCI DSS and credit card numbers. There are many privacy regulations that IP address users' names are often allowed to be locked. There are many possibilities for locating sensitive information using regular expressions. It's quite slow and resource-intensive. This thing can be found this way. But for example, credit card numbers, IP addresses, are possible. And it also works on unknown log messages. On the other hand, if you use pattern DB or the CSC parser, these work only known log messages, but have the advantage that they are very fast and don't need as much resources as using regular expressions. When it comes to analyzing, there are two possibilities. You can rewrite the sensitive information with a constant, or you can also rewrite the original with a hash. This is important if you want to analyze your log messages afterwards, and for example, follow log messages. If a sensitive part is overwritten with a constant, you cannot follow the sessions. If there is a hash, you don't know what was there originally, but still you can see how long a session took. If it was just two clicks on the website, or someone was using for many web pages. This one is a heat map. It's my home router in a quiet suburb of Budapest. And took just 15 minutes to have this many external connections. This is something not I initiated, but someone around the world, which is, let's say, quite scary. You can do this as well. If you have IP tables or any kind of firewalls using Cystlogangy and Cystlogangy EarthStick Search and Gibana. Here, we used the GUIP parser. Actually, first, we used the key value parser. So we have the incoming IP table slots. This is the first part using the key value parser, which finds the source IP address of the connections among many others. And then, we used the GUIP parser to locate where the connection was coming from. We also have a rewrite rule. Often, Gibana needs a specific format how it expects geolocation. At the end, the log is sent to EarthStick Search. If you want to do it yourself, my slides will be uploaded after the presentation. And also, I have a blog about this one to implement it. There are a few words about an interesting application still under development. It's ELSA, Enterprise Log Search and LHive. It's based on Cystlogangy, what can be EarthStick Search and web interfaces written in Python. It's still far from being production ready, but very promising development. It's created for network security analysts, has many parsers for different IPs, for different firewalls, and inclusion detection systems, and has a couple of tools helping security analysts or easily look up who is information or other blacklist, whitelist from the web interface. New features in Cystlogangy, we had released last summer and also right before Christmas. Here, I took just a few interesting cases. The first one is disk based buffering, which arrived last summer. If you want a simple correlation, previously, it was only possible using Python DB, but now that we have many parsers within Cystlogangy, there is a new correlation engine called grouping by, which can work independent of Python DB and use name value pairs from any of the different parsers within Cystlogangy. You can write port for rustic search 2 and 5, for that bit, recently. We have HTTP destination as well and many performance. Finally, I would like to summarize what benefits Cystlogangy has a larger security-related environment. First of all, high performance and reliable collection. You can simplify your logging as Cystlogangy can correct both system and application logs. Logs are easier to use as the messages are part ready to use for one. Filtering can also lower the load on the tool model of Cystlogangy. Our central hub of information is Cystlogangy.org. Our source code is on GitHub, which also has our issue-talking system. If you have any slides, show the answers that will be posted. Yeah, I plan to post that in my presentation if the Wi-Fi provides. Yeah, it will also be posted on the scale website. How do you compare Cystlogangy to other log analysis tools? Which you have? OK. How about Logs Dash? Logs Dash. Yeah, I had this question quite a lot. Last few days, Cystlogangy can replace Logs Dash. We sometimes say that it's an ESK stack instead of ERK stack. And the benefit is that it has less resource usage. Cystlogangy is implemented in C. The elastic search driver itself is in Java. But only that part is in written in Java. The rest is in written in C. And Cystlogangy can handle a lot more log messages on the same hardware than elastic search. So major benefit. And both can process log-message. I mean, both can do many filtering and parsing and so on. But with Cystlogangy, it's not less resources. OK, thank you very much. And one more thing. If you would like to have a T-shirt like I have, then come to booth number 713. And we have still a few T-shirts left. One, one, one, one. OK. So let me try to make a key. Well, hello, everybody. I would start a couple minutes early just so that we start a conversation and introduce while people are coming in. We have a lot to cover today on Sunday after lunch. So I hope that I will do my best so that you have a great nap in this afternoon. So my name is Dmitry Powell. I work for Red Hat. I'm with Red Hat for nine years. And all these nine years, I have been doing identity management related projects. Before that, I worked for RSA and spent 10 years in different roles, mostly doing the security project and related. Well, developing it and then managing and so on. So today, we will be talking about the integrating Linux systems with Active Directory using open source tools. I'm glad that I have been given an opportunity to present it. I have been trying to get to scale for several years. And finally, I managed. That's great. Thank you. So let's start because we are very close. So there are a lot of. Yes. OK. Is it better? Is that better? Perfect. OK. Great. So let's start. What we will be talking about here is we will start with the problem statement and understanding of what we are trying to solve. And then look into what are the aspects, what are the dimensions of the problem and how we can measure the solution. And then we'll look at different options that are available right now for you to integrate Linux infrastructure, Linux systems into Active Directory environment. And then we'll talk about recommendations. The flow of the presentation is such that we will pause in the middle and I will get questions. So just wait. I will explicitly stop and query you. OK. Thank you. So let's start with the problem statement and aspects of integration. For most companies in a lot of environments, Active Directory is the central hub of the user identity management. So the users and their properties, their credentials, and policies around those are stored inside Active Directory in many enterprises. I think according to analysts, more than 90% of different enterprises have Active Directory as an authoritative source for identities and authentication. So what does that mean? That means that all systems that those Active Directory users that are stored in Active Directory, so all systems need in some way to interact with Active Directory so that these users can access systems, resources, applications. So there should be a connection in some way. In some cases, Active Directory is the only allowed central authentication server due to compliance reasons. So in the past, there have been solutions that would allow you to copy information from different identity sources. For example, from Active Directory to some other place and have the users and their credentials be stored in some other database or identity store. But the problem with that is how you accomplish compliance and meet the registration requirements and audit requirements in this case. So people more and more move to using Active Directory as a single place, put all the audit controls and compliance controls in one place and expect the whole rest of the infrastructure to take advantage of that. So you can't really say, oh, just copy my users and credentials into some other place, right? And then last but not least is DNS. DNS is very important when you deal with Active Directory. So Active Directory has a built-in DNS. And historically, it happened that in many environments, Active Directory DNS became an authoritative source of how you manage your name services and zones and how your clients in Active Directory environments, how, what naming convention they follow and what DNS zones they are a part of. So these are the things that you need to solve, that you need to deal with when you are integrating your Linux Unix infrastructure, Active Directory, let's look at the aspects of integration. So there are four of them. I have two slides, two on each. So first of all, we need to focus on the identities. We need to understand where identities is coming from, where they are stored, what are their properties, what are their group memberships, right? How this identity information is delivered from the central place to the client system or application, where it is cached, how these caches are updated and maintained, right? So the whole flow of the identity from where it's stored to where it is used. Authentication, what credentials are being used? Is it passwords? Is it smart cards? Is it to factor authentication? Is it a combination? Is it adaptive, right? What about single sign on? How you can bridge between different protocols and you like logged into the system and then you go and file, mount a file share, right? And is it mounted with a single sign? I don't really have to be prompted for another authentication and then you go to some web application and you are prompted for username and password to gain or for some other credentials. So can you avoid that? Can you make so that your credentials that you provided once at the beginning can be reused so that you are not prompted to gain in the game? So things like that. Access control. So what's the difference between authentication and authorization? Authentication is the step when you establish that you are who you claim you are. And that's it, that's what authentication is about. So once you established who you are, it doesn't mean that you can do what you're trying to do. So access control is very important in terms of while we know who you are, are you entitled to do what you are trying to do? So and how all those access control policies, where they are stored, how they are delivered, how they are enforced and what are the mechanisms, what are the protocols, including what are the management tools and what are the client components that allow you to actually enforce those access control rules. And then policies. So policies is a loaded term, but there are all sorts of different rules that need to be taken into the account around authentication, access control, so and single sign on, so like what is the length and strength of your credentials, right? The renewal, how frequently your password has to be changed. What about the single sign on? Is it allowed here and not allowed here? What about two-fact authentication? Is it allowed here but not allowed here and so on? All sorts of different policies and those policies in many cases come from your audit and compliance requirements. At that point I stop any questions so far. Is it a nice snap? Okay, good. So what are the options? How we can integrate with Active Directory what you have in Linux Unix? Okay, so there are two main parts to Active Directory. One is you take your Linux systems and you connect to Active Directory using client components like talking directly and getting information, the identity, the authentication, the access control, the policies directly, pulling it from Active Directory. So that's called direct integration. Or you can do indirect integration. So you can have sort of an intermediary central identity server that would manage Linux systems and then work with Active Directory sort of as a proxy. But it is not really a proxy. And we'll look and compare why we have two parts. What are the benefits of each and when to use these of that, okay? So we'll start with direct integration. So with direct integration, there are four known integration options. First of all, there is a slew of vendors on the market that provide the solutions to integrate Linux Unix systems and Mac and now phones into Active Directory. So then there are some legacy components that are available pretty much everywhere in all distributions and on Unix. That allowed you to in general connect things to Active Directory and other directors. There is a traditional which is based on WinBind and there is contemporary which is based on SSSD. So before I dive into the details, just some assessments. Who actually does an integration right now in the audience at all? Like who does integration? Okay, so maybe 30%, 25%. So out of you who do integration, what do you use out of that? Third party like Centrify or Quest or what do you use? Quest, anybody else? Quest, okay, Centrify, okay, likewise. Okay, so you pretty much covered the vendors that are in this bucket. Okay, who uses legacy stuff? Okay, okay, good. Who uses traditional WinBind? Okay, and who uses SSSD? One, two, okay, wow. So we'll talk about that. So let's start with the third party. So they're different, they're different vendors, right? So Active Directory effectively provides three different pieces, right? You have Kerberos, Key Distribution Center. You have Directory Services, Dale Dappen interface that Active Directory provides, and then there is DNS. So those three pieces can be consumed by the client system. So with the third party client, like Centrify or like Quest or likewise, you can do authentication and identity. And in some cases, access control. But for access control and any other policies, you need to have some kind of plugin on the Directory server. So you can do basic stuff with the LDAP filters and lookups, but as soon as you want to have something more advanced for the health-based access control, you need to have the plugins so that you can manage things centrally. So for those who said Centrify, do you like Centrify? Does it meet your needs? Okay, so do you use like the basic stuff, Centrify Express or Centrify Fool? I don't remember how it's named, but do you use extra like Centrify Direct Audit? No, just Centrify works for you and that's fine. Okay, good. So what about Quest? Do you like it? Okay, is it stable enough for you? Okay, okay. So they provide, yes, so VES provides a way of effectively centrally managing the pseudopod. So, okay, anything else? So, okay, like Quest, somebody said like Quest, do you like like Quest? Okay, okay, so all these, to some extent, right, all those vendors is extra cost to your environment. All these vendors, you have to install something on the Active Directory. One of the questions is how you manage POSIX stuff, right? Where does it come from? Does it come from Active Directory itself? So do you put your services for Unix? Let me step back. Active Directory provides a way of managing POSIX attributes. So you can put a schema into Active Directory to manage your POSIX attributes inside Active Directory. Does everybody understand what I'm talking about? POSIX attributes, does it make sense? Okay, so they need to be somewhere. So Active Directory Microsoft came up with a schema. It was services for Unix, then it was identity management for Unix, and at some point in 2014, they deprecated all the tools to manage POSIX attributes inside Active Directory. So you really have to have some management tools if you want to manage POSIX attributes inside Active Directory, and you probably have them from Quest or from Likewise or from Centrify, right? So Centrify also provides a very interesting feature called Zones. If you come from multiple disconnected environments, niche environments, for example, in the past, and you have UADs and GADs in different environments, meaning different things, how you disambiguate, how you have one account have different UADs on different systems. So Centrify allows you to do something like that. It's called Zones, right? Do you use that feature? Ah, okay. Okay, so this is what third party do. So they provide a lot of capabilities. So they're a pros, definitely pros with using those vendors. So everything is managed in one place and single sign-on can be accomplished over Kerberos because all of them provide Kerberos client. So there are some challenges. First of all, it's a third party vendor. It's another vendor, right? You have to buy something. And it is extra cost. They are not cheap. They are priced per system. So on top of Cal that you have to pay for for having a system as a part of the Active Directory, and Cal is client access license, unless you have huge volume licenses. So on top of Cal that you pay to Microsoft, you have to pay those third parties for the enablement, right? So you effectively depend on Active Directory in managing your Linux system, Active Directory guide. So anything you want to change, anything you need to do, you have to have some privilege and access to Active Directory. And if it is one team, that's fine. If it is different teams, Active Directory is one set of guys or girls, and Linux is a completely different group than this is an organizational challenge. Also, it requires software on the Active Directory site. So you really need to install something on the Active Directory site. And if you need to manage your Linux, Unix infrastructure from Active Directory site, sometimes Active Directory guys don't like it. So, and for policies you really need to, for management of the policies, you really need to rely on the offerings from those third party vendors and the add-ons and solutions, both on the server side and on the client side. Okay, moving forward, unless there are questions. Okay, so next one is legacy direct integration. So in the past, there was a set of PAM and NSS modules. PAM is Plugable Authentication Module. This is for authentication. And NSS is Name Service Switch, so for Identity Module. That can be put into Linux Unix system and connects Linux Unix system to different identity sources. So there have been PAML Dev, and it's still there is still PAML Dev. That allows you to get identity and authentication against any of that source. So you can use PAML Dev. Unfortunately, it's not that easy to use PAML Dev with Active Directory. Active Directory is Active Directory with a lot of customization that Microsoft did. For example, group membership. Group membership is not exact, well, at the moment when it was implemented, there haven't been enough standards of how something should be done. And Microsoft did it the way they thought that makes sense for them. And this way, there are several things that work in Active Directory, not the way how generic of that expects. So, for example, the group membership. So you can't look at the user and see what groups he is a member of. Only direct groups you will see. So you have to actually traverse the tree to be able to resolve group membership. Another thing is when you pull data from Active Directory, you need to do some kind, if there is a lot of data, you need to do some pagination. So there are special controls that Microsoft implemented to allow you to do pagination of the group membership. And the default, as far as I remember, it's 2000 entries or 2K, I don't remember, but there is like a default that allows you to do it in chunks. But is this only the way how Microsoft does it? Pamela Dev doesn't know how to do all that. Another challenge is domains. You can have multiple domains and usually in Active Directory you have a collection of the domains. And domain is usually associated with a specific data center. So creating configuration in LDAP that would cross multiple domains in Pamela Dev is really closely to impossible. Very hard to do. Another issue with Pamela Dev is that you have to have some kind of the credential that you are going to use to authenticate to the central server, authenticate system to the central server to get some identity information. And where do you put this? You put it into the file. So now you have an administrative credential somewhere laying on the system. Okay. So another piece is Pam Curve 5. So you can do Kerberos. You can do Kerberos with Active Directory. Microsoft implemented Kerberos though, extended it in a specific way to carry authorization information inside Kerberos tickets. So you can get a basic ticket and pretty much do the single sign on but Pam Curve 5 doesn't understand anything that comes inside this ticket with the authorization data. Another challenge is force. So when you have more than one Active Directory force using panel depth or an SSL depth or Pam Curve 5 becomes really a challenge. So effectively you can to some extent, if you have a small environment, if you have a very simple Active Directory, you can do authentication, you can do identity to look up but as soon as the environment becomes more complex it becomes really, really hard to manage. And of course the policies are not managed at all. Okay. And host-based access control is not managed at all. So that's embarrassing. It crashed. Do we have any Firefox guys in the middle? That's embarrassing. So the access control, yeah. So I talked about access control. So what are the pros and cons? Well, first of all, it's free, it's simple, it's intuitive, it's basic, it's easy to set up. So you can even use LDAP to do OTP authentication if you use it in Azure, but I haven't tried it. And it's available everywhere. So it's consistent across Linux, Unix infrastructure. But it is very basic. And it requires extensions on the server side and it's really, there are no tools to manage POS X attributes on the server side if you are using basic solution. You have to effectively script things yourself. And policies are not centrally managed and there is no SSL with two-factor authentication. We'll talk about that. Any questions about legacy setups? Okay. Traditional. Traditional is based on WinBind. So WinBind is a part of the Samba project that has been started about 20 years ago, maybe a little bit less. So the idea of the Samba project is effectively to duplicate Microsoft Active Directory file sharing and Windows client components in open source. So WinBind is the client side component that allows you to connect a Linux Unix system to Active Directory. So the idea of the Samba project is that to make everything look as if it is a part of Active Directory domain, to mimic Active Directory, to mimic Windows, to mimic so that your Linux system looks like Windows system for the central server. That's kind of the mindset. So instead of touching anything on the server side, Active Directory as is with its own protocols, with its own capabilities, WinBind talks native Windows stuff. And native Windows stuff is much more complex than just the LAP and Kerber. There are all sorts of other protocols that Microsoft implemented to do service discovery, to do resolution of the identities, the implemented global catalog, which is a special entity, a special thing that allows you to collect identities from different parts of the domain aggregate and have a fast lookup. So rather than going all well that it's different protocol. So WinBind understand all these things. So it provides you with the authentication, it provides you with identity information, it provides you with basic access control, but it is expected that in the Active Directory environment, you use so-called group policy objects to manage your access control. So group policy objects are standard ways in Active Directory environment to define who can do what to the client systems. And it is very Microsoft centric because they know how client systems look like. They know that they are Windows client systems. So WinBind tries to map that to Linux concepts, which are Unix concepts, which is not that very good mapping. So group policy objects, WinBind doesn't support group policy objects. So there is no way to do the host-based access control using those. There is also no way to do something like Pseudo using WinBind. What WinBind can do, it can do Kerberos and in KLM. Before Active Directory adopted Kerberos in 2003, there was another single sign-on and authentication methodology called NTLM. And pretty much every single client in the Windows ecosystem can talk NTLM and fall back to NTLM if something happens, something not work with Kerberos. So WinBind knows how to deal with NTLM. NTLM is very old. It is based on the hashes that are currently, I think RC4 hashes, which are currently crackable in matter of seconds, but it is what it is. Okay, so what are the pros? It's a well-known solution that have been there for years. It doesn't require a third party. It doesn't require you to install the extensions and this is good because you can actually manage projects attributes in a dynamic way. You can say, let's take my identity that is in Active Directory and dynamically create your UIDs and GIDs and synthesize other attributes inside on your Linux systems and through the WinBind integration. It supports forests. So it is the only solution that are not... It supports the domains. So inside one domain, it can discover multiple domains and knows how to talk to different domain controllers. It understands how to talk to global catalog, but it also understands that this is one forest, there is another forest, they are in trust relations, so how to discover and work in that environment. So that the system might be joined to one forest and the user might be coming from a completely different forest, right? WinBind will figure things out automatically. And it can fall back to Intel M if something is not working. So the cons of this solution, it can connect only to Active Directory, it doesn't care about anything else, any other identity sources or configurations. Only Active Directory. Polishes are still not centrally managed, there is still no solution for the policies and it doesn't support GPO and there is no two-factor authentication support in WinBind. Okay, any questions about that? Moving forward. SSSD, so let me introduce it. SSSD as only one person indicated that they are using it, no way. So SSSD stands for System Security Services Demon. This is the open source project that has been started about nine years ago and has been available in pretty much every Linux distribution since about five, six years ago. So it's in FreeBSD, it's in Ubuntu, Debian, Fedora, Arch Linux, REL, CentOS, what did I miss? Suzy, yes, of course. Open Suzy, yes. So everywhere, the only difference is diversion. So, but take a look, it's in there. It's for Fedora, Red Hat, and CentOS, it's core package, it's in there, it's installed, it's not configured by default. So what it allows you to do, it allows you to connect your system to central identity sources and it understands specifics of Active Directory, it understands specifics of other LDAP and Kerberos servers as well as FreeAPA server that we will talk a little bit later. So what it does, it allows you to do authentication, identity lookups, and define access control. So when it works with Active Directory, for example, it allows you to do group policy objects. So if you define group policy objects centrally, you can pull them down and apply then this is your access control in the same way how it is done with the Windows clients. So the goal of the SSSD project was sort of take all this legacy stuff and bring it to the next level and provide something native out of Linux distributions that is comparable in its capabilities to simplify likewise and quest. And that's exactly what it does. So it is a competitor to those solutions and it is capable of pretty much the same things but it is a part of the operating system. So SSSD allows you to combine multiple sources of identity so that you can say I want to get some identities from Active Directory and some identities from this LDAP and some identities from this database, I have multiple domains, I can sort of merge that together on a single box. So you don't need to have a Meta Directory, for example. SSSD is built around caching and caching is the central component of it. I think I have a diagram. Yeah, okay, so we'll get to that. So it handles well the remote data center cases or laptop. So this laptop runs SSSD for like seven years starting with all the versions. So my identity information, group information, all that stuff is cached. So when I close the lid and lose the network connection and then open it again, I can authenticate. So it's not required to cache credentials because in some cases you don't want to cache credentials. In some cases you won't. And SSSD has advanced features for different popular servers like Active Directory and Free API. So let's look at the architecture. So this is very simplified architecture of SSSD. On the left side, the clients are the applications and processes running on the operating system that need authentication or identity through GLIP-C, through standard get and call, whatever you want. Groups, net groups, user information, or through PAM calls to do the authentication. Inside the SSSD group of processes, there is an NSS responder, a process that is responsible for preparing information and streaming the information for your group membership or net group membership if you are iterating and getting that. And it takes it from cache, but also works with the provider. The provider is a part of the SSSD that is responsible of talking to the central identity source and getting information from there. PAM responder is on the other side responsible for the authentication workflows and password change, things like that. So the providers, and you, as I said, can have multiple providers so you can connect different identity sources in parallel, consists of four parts. I am showing two here, but they are the main ones, the identity provider, which pulls the identity information and caches it, puts it into the cache, and then authentication provider, which is responsible for the authentication. There are two other providers. One is for access control. So you can define how you do access control, whether you do it by white listing, black listing, GPOs, host-based access control, a depth filter, or something else. And the fourth one, quiz for you, what it can be? The fourth piece, I mentioned it. Now, policy is sort of the access control. Well, it has other pieces to pull like pseudo SSH, auto mount, information, but this is specific about of the identity provider, subparts, but there is one core function that I mentioned earlier that we have a provider for, password change. Password change is a separate protocol in different identity sources. So you have a specific implementation of the password change. SSO is not a part of SSSD because SSO, we can talk about the Kerberos workflow if you want. This is where you do your initial authentication. SSSD is where you actually interact with the credentials. If you want SSO, for example, you have SSH and your SSH, then SSH will be your service with which your single sign on. You can turn it on in your SSH, but then the identity information for SSH will be pulled from SSSD through NSS. So effectively, Temstack won't be involved in the case when you have SSH and SSSD. So there won't be an authentication part involved or password change involved. There will be identity involved and there will be authorization involved so that you can define whether this user can actually access this system. Does that make sense? Questions about SSSD? So why SSSD? I sort of talked to that slide a little bit. So brings architecture to the next level, does everything that before that has been done, has specific features for the popular identity sources and has feature parity with WinVine and in some cases surpasses it. Okay. Couple words about RealMD. Studying Fedora, I think it is 18 or 19 and then Red Hat Enterprise Linux 7 there is a component called RealMD. RealMD allows you to quickly configure WinVind or SSSD as a part of the Active Directory domain. It's effectively RealM join parameters, boom, you're done. Very, very simple way of joining your system into Active Directory or for that matter, into IP. So it uses D-Bus interface, it's very flexible, it's a very good thing. So if you want to do some automation, RealMD on the modern systems is the way to go. Just read about that, okay. So with the SSSD integration, you get the authentication, you get the identity, you get name, resolution and discovery of the clients and the whole Active Directory ecosystem. You can do GPOs for host-based access control from Active Directory or filter-based if you want. But still, some pieces are not managed like Suda is not managed, as the Linux SSH is not managed. So it's a good solution to replace basic stuff and if you are just using Centrify or Quest for authentication and access control, SSSD out of box provides you with everything you need without any cost. So it doesn't require anything on the Active Directory side. It can take advantage of your UIDs, GIDs if you continue to manage them there, but as wind-bind, it can do dynamic mapping. So it supports multiple identity sources, as I mentioned. There is a solution for trusts. It doesn't support it natively, but it supports it with free APA. It can work in the same way as wind-bind with CIFFS and Samba file server. It supports host-based access control through GPOs. And I hope that after this presentation it is well-known. Okay, so there are some cons, definitely, you can't go without that. So there's no NTLM support though. There is a plan for them to do NTLM support. There is still no single sign-on with OTP. And not all policies can be managed, right? So we talked about host-based access control, but other things can't be managed. So the summary of the options, if you are using direct integration, is that user SSSD provides good enough integration out of box. It's free and well-supported. If for some reason it's not enough, it doesn't work, for you, then wind-bind will be a fallback because you, for example, need to do NTLM, or you have to use cross-forest trust and you don't want to use free IPA. So wind-bind can help you there. If you want to pay money without money, you are welcome to pay Centrify, Quest, and likewise. Oh, Powell Broker, I'm sure, they are now cold. So, and please don't use legacy setup. It's not good anymore, okay? So, to continue on that, there is a blog. This whole presentation is based on the blog. There are multiple blog posts in this series. I started it about three years ago. So if you want to drill down and get this whole information again, or explain it to somebody else, you will find enough information there to do that, okay? And there is a comparison there, okay? So, there are still issues with the direct integration in general, so policy management is still not central, and must require deprecated extensions on the Active Directory side. You have to deal with CALS for every single system, client access licenses, and you don't have control if you are a different team from Active Directory team. So, free IPA or IDM is to the rescue in this case. So, I am familiar with who have heard about free IPA. Okay, great, thank you. So, IDM is the downstream of free IPA. Free IPA is an open source project. IPA stands for Identity Policy Audit. It has been started in 2007, and IDM is what Red Hat bundles and provides in Red Hat Enterprise Linux on CentOS. So, the audit piece, yeah, I forgot to mention, the audit piece is sort of on the back burner now. It has been extracted from the project because we realize that audit is much bigger. Than the domain control. It covers the whole infrastructure. So, the project focuses on identities and related policies. It does two things well. It replaces the existing old LDAP and Kerberos and Mies-based solutions and does it much, much better in terms of automation, configuration, manageability, compliance, and so on. And it is free. It's a part of operating system. It's not something that you buy anywhere, you just get it. So, and it acts as a gateway between your Linux, Unix infrastructure and Active Directory. So, it plays dual roles, depending upon how you want to run it. What it is, it combines Kerberos domain controller based on MIT Kerberos. It uses LDAP backend based on 389 directory server. It has some optional PKI components like CA or key store based on DogTag open source project. It uses optionally DNS based on bind. These components all combine together in simple to install, simple to manage, with simple to install, simple to manage scripts. There is framework to manage data inside free IPA, manage all the user's identities, policies, and so on. So, there is a nice UI in there. So, SSSD and IDM works hand in hand, and SSSD knows how to pull additional information from IDM in a similar way, how it can pull additional information from Active Directory. So, with IDM, with free IPA, you can manage sudo, you can manage cross-base access control, you can manage net groups, auto mount, SSH keys, SLINX user mapping, so all that SSSD can pull from IDM. I wanted to run a demo, but we don't have time for that, but we can talk about that after if you want, we can talk about that in the hallway. So, IDM has a lot of features. I'm not going to read every single bullet, we don't have time for that, but it allows you to centrally manage identities and related policies. It can be a standalone domain controller, but it can be connected to Active Directory, and that's what we want to focus on in here. So, there are two ways how it can be done. You can synchronize users and passwords from Active Directory to IDM, but that's not a preferred solution, as I mentioned at the beginning, with the compliance, it's not, with the compliance, more than compliance requirements, it's not going to fly. So, the cross-forest trust is the recommended way of how you integrate free IPA with Active Directory, and it gives you a lot of benefits. So, how you do it? You can optionally chain your CAs, if you are not required, but you can. You need to delegate a DNS zone so that you are effectively building a new forest. That Active Directory thinks that it is talking to yet another forest. Then you establish a cross-forest trust, it's one command. And now you have the separation between your Active Directory world and your Linux world. And you manage your Linux systems the way how they need to be managed, with all the POSIX, NetGroop, S-Linux, Suda, SSH kind of things that are completely foreign to Active Directory. So, you get the best of both worlds. The Linux system gets identities and authentication against Active Directory because the users are still in the Active Directory, and you can put your audit monitoring on Active Directory. All the authentications will continue to happen there, but all the policy stuff, all the access control stuff, all the things that the Linux systems really need will be pulled from free IPA. It solves the problem of user mapping. With the user mapping, you can, so it can take advantage of POSIX attributes that you have inside Active Directory, but it also can do it dynamically like WindBind or SSSD does. And it actually does it in the same way how SSSD does so that it is consistent dynamic mapping across the whole infrastructure if you want. Unfortunately, dynamic mapping works only for the greenfield deployments when you don't have anyone already owning files on their operating system or on the file shares. If you have files owned, then you need to maintain UADs and JADs. So there is a thought option. You can put POSIX attributes inside free IPA into ID views. And this is sort of an extension of your Active Directory object that contains additional information. And right now it contains POSIX attributes, but it also contains additional other things that are valuable, for example, SSH keys or certificates. And now you can manage the additional information that is relevant only to your Linux Unix environment for your Active Directory users the way you want to manage it. And in future, there will be also OTPs, the way to do OTPs for Active Directory users. It's not there yet. Probably here from now, it would require many pieces. It would require the SSSD, it would require Kerberos libraries because it is a step up authentication. There is actually a lot of things involved, and we can talk about that after that, okay. So the trusts come in two flavors. So you can have one-way trust or two-way trust. In one-way trust, IDM trusts Active Directory, so that means that users from Active Directory can access resources managed by free IPA, IDM, and not the other way around. With the two-way trust, it would be nice if users from IPA would be able to access Active Directory resources. Unfortunately, it is not the case because free IPA needs to provide global catalog for Active Directory world to define permissions for those identities. And global catalog is not there yet. It's being built as we speak. Hopefully in the next release, it will be there. So the trust would be fully functioning so that you would be able to trust free IPA users on Active Directory site. Okay. Pros and cons of the trust-based solution. So apart, nicely spelled. Free IPA, IDM is a part of a trading system, so it reduces costs because you don't have to... Every single client is now a client of IDM, not of Active Directory, so you don't need to pay cows to Microsoft for that. Polices are centrally managed, so that's a big benefit so you can do your suit of management or code-based access control to S-to-linux, other things. Gives you control to do whatever you need to do on your day-to-day operations. Enables your independent growth the way you need to deploy your systems when you want to deploy these systems. It doesn't require synchronization. Authentication still happens in AD and that's important for the monitoring and compliance and it's a great tool for troubleshooting Active Directory because what became apparent is when people deploy trust, it reveals all sorts of misconfigurations on the Active Directory site related to DNS, related to firewall and other things. So if you want to clean your Active Directory, it helps. Okay, it has a requirement of proper DNS setup. That's a big thing. So it's actually a free APA brings in a new domain, right, a new forest. And that's a challenge in some cases because you might already have your Linux systems directly joined to DNS zones controlled by Active Directory. So there are some limitations. There are some blog posts and articles that explain what can be done. So if you look up in the blog, there is an article called I Really Can't Rename My Hosts. So it explains what you can do still but of course there are some limitations. And there is a big con that people always throw at free APA in this setup. It's like we don't want another piece of infrastructure. It's another thing to manage. And that usually comes from Active Directory guys and that's understandable. So but the argument there is that think about it as an Active Directory forest to manage your Linux infrastructure. If you don't do that, you will have the same set amount of Active Directory domain controllers to manage your Linux. Yes. The question was can we separate free APA into separate modules? I assume you mean containers, right? Or modules, what do you mean? So is there any way to separate pieces of free APA into different components that are installed separately but still have the value of free APA as a combined solution? Unfortunately right now, no. So IPA can be easily available as a container, as one container, or container that contains everything. But as we move forward, there will be splitting it into more pieces and first optional components like DNS and ICA and QRA, trust agents, they will be forked out as we move forward. It will take some time. So right now, right now it is not. And some time meaning years, not months. Okay, so the summary, there are different parts to Active Directory that are direct, indirect. There are benefits of different. You can use SSSD for the direct integration if you want separation of the responsibilities and better manageability and reduce your costs by eliminating third party vendor and free APA allows you to do that. So that's pretty much it, question. Maybe this is just a misunderstanding but you made it sound like, earlier you made it sound like when bind would automatically revert to NTLM if LDAP went down, which sounds like it could be an issue with some kind of downgrade attack. So like you break LDAP somehow and then it's using NTLM which uses an insecure hash. Is there a way to avoid that? I've never been just disabling NTLM with one bind. So the question was about fallback to NTLM in Active Directory world. So first of all, it is not LDAP, but rather Kerberos. Second, I'm not aware of any way of disabling that and that's actually it's built in as far as I know. So if it can't talk LDAP, I'm sorry Kerberos, then it has to use NTLM. And in the setup, when you have a machine that is not a member of Active Directory domain but you want to use file sharing, NTLM is actually the only way how you can use file sharing in that setup. Can you? Sorry, can you switch back to your blog slide, your blog address slide? Okay. So the question was about my blog and this is the blog slide. So, but you can search Dmitri Pal blog and that would be enough. Yes, I do. So thank you. So I'm kind of new to this area. So we have a problem that, I'm assuming that for trust, to benefit for the trust that if we have user name change, like in the AD that we have name changes and cause the either UPN or email changes or user name changes, will that stop that problem for thinking that... Okay, so the question was about if you manage users in Active Directory and you do the changes of the UPN username, is the information synced? There is no syncing if you use trust. So you did your changes in Active Directory and they are reflected immediately. Quick question about the legacy you said you didn't recommend. Is it not recommended because it's not capable anymore or is there a security issue? There are some security, okay. So why is the legacy not recommended? Legacy is not recommended. For example, because for panel depth, you need to put some kind of credential somewhere on the operating system, which is not safe. So it's hard to do right. It doesn't scale to the more modern environments in terms of configuration, like supporting domains and forests, things like that. Caching capabilities, it doesn't have caching capabilities, so these are the limitations. Yeah, I was gonna ask you this question afterwards, but there seems to be some interest. OTP support. I know it was something like two years ago or more that some free IPA work was done with MIT to get support on the server side for fast OTP and it seemed to be working. So I'm curious where that is. I guess you still plan on delivering that capability, but it's been a while. So the question was about OTP support and the answer is it is available for a couple of years now or a year and a half. So free IPA supports two-factor authentication with TOTP and HTTP tokens or Kerberos protocol. It also supports the capability to proxy authentication through radias to your existing two-factor authentication solution based on like security or Vasco, active identity, whatever you have, or Duo, for example. So we have a lot of deployments with Duo for two-factor authentication. The caveat is that it is only for users that you manage inside IDM. It's not a free IPA. You can't do that same thing for Active Directory users yet. And that's why I didn't emphasize it in this presentation. Data cross. A trust with Active Directory and let Active Directory inherit the support that it doesn't provide. And I would think the same approach would work with Azure. Can we do the reverse thing and sort of authenticate users to factor authentication with free IPA and then have a trust with Active Directory? Yes, that's why we're building global catalog. That's exactly the reason why we're building global catalog so that we can now be in a mutual trust relationship so that you can have two-factor authentication on the IDM, IPA side, and then Active Directory will trust that. Yes, so we're going there. Yes, absolutely. It's the, yeah, yeah, and then making things work together. So that's the answer. So one other thing on top of that, just to finish that, what we did is the ability to mark the tickets with the authentication indicators so that you can say this authentication has been contacted with two-factor and this with single factor and this over radius and this will be with a smart card in the next release so that you can make authorization decisions saying this service, you will get service ticket only if you conducted two-factor authentication. So you can now partition your single sign-on. And that already works in 73. Could you go into a little more detail on where Realm D fits into your SSD architecture slide? Like how are the two related? Okay, the question is about relationship of Realm D and SSD or WinBind for that purpose. So Realm D is a configurator of SSD or WinBind. So Realm D, if you run it, it will configure SSD, Kerberos stack, NSS, get the identity, enroll with the domain. Before that, it will just censor the domains and give you, detect which domains are available. So Realm D is a configurator. It runs once, it joins the domain, it's done. SSD or WinBind are services that are running constantly and providing identity and authentication capability. Thank you very much. Thank you very much. Thank you for being so patient on a technical matter. Our next talk is introduction to reproducible builds. Our speaker is Vagrant Cascadian. He is with reproducible-build.org, he's a contributor. The reproducible builds project creates infrastructure and fixes upstream codes so that binaries can be independently verified as a result of compiling source code. Without verifying the connection between source code and binary software, tool chains will become a tempting target to inject exploits. This talk will demonstrate why reproducibility matters, common issues and fixes, and tools used to identify and troubleshoot issues moving towards reproducibility as a set of best practices when developing and improving software. Without further ado, I give you Vagrant. Hello. Welcome. Sorry for the late start. I got here as early as I thought was reasonable. So, this is introduction to reproducible builds from Vagrant Cascadian. I work with the Debian project. I'm also working on reproducible builds, which is really more of an upstream endeavor. But yeah, so this talk will mostly be kind of focused on Debian, but all of this in theory should apply to any project, really. So, we'll start off with some goals. We're aiming to get to a situation where when you build software and then you build it again, it comes out the same. So, this just kind of gets under my skin sometimes. Like, I mean, that seems wrong that we have to work at this, but apparently we do. And so, we've identified some of the issues that lead to this situation. So, source code. How many people here know what source code is? Come on, yeah, get those bodies moving. Great. So, source code is generally what developers work with. Computers don't generally run source code though. They run binary code. And how can we be sure that the binary code we're running is actually produced from the source code that was written? Or, if not sure, at least more confident. So, reproducibility in a scientific sense. Really, I put some stress on that we can independently verify things. So, a lot of projects have something kind of similar to reproducible builds where they'll like have this complicated build environment that sanitizes everything and then it builds it. But the average general user isn't gonna be able to verify that build. They're not gonna be able to set up an entire build infrastructure just to do a verification that the source code and binary are descended from one another. So, you know, and how many of you identify as a, or maybe you went to school for computer science? All right, yeah, handful. So, where's the science in computing if, you know, we can't reproduce a binary from some source code? So, here's a little simple program I wrote. It outputs two. We can verify that the output is producing the correct output using simple methods of checksum. I refastered this program and I could verify that I'm getting the same result. So, checksums are really integral to an important part of reproducible builds. And as you know, you might get some number that's kind of like to, but it's going to come up with a different checksum. So, when we're talking checksums, that really gets into the bit by bit identical. It is exactly the same and that's what we're really striving for. So, your typical software building is a little more like this. You got some source code. You may pass some arguments to the build tools. You got the tool chain itself. And then there's all this other stuff that always seems to make it, it's way into the binaries. You got, you know, the time, you built it. You got the running operating system, maybe the kernel version, all this stuff typically doesn't really need to be in your binaries and maybe you didn't even realize it was getting in there, but it'll actually end up changing the results of the build. So, we've identified a lot of these kinds of issues and who knows what you get out of it when you don't have control over your input. Your project needs a liker, okay? Okay, okay, a linker, okay. Anyway, so our goal here is when you build a given source code with a given build environment and you pass it the same instruction, you get a bit-by-bit identical copy. So, it doesn't matter if you're running in your home dirt, Jane certainly developed and what we mean with reproduciblebuild.org and there's a link to it there. And yeah, we just developed that in December at a meeting where we had a bunch of developers get together from all sorts of projects and that seemed like an important thing to do because many people, they'll give a talk, if they'll say it's reproducible, but we're trying to define reproducibility as expressly as bit-by-bit identical because that way you don't need to do any parsing of the artifacts you're trying to verify. You can just run a check-sum on it. That's the goal. So, a little background. There were mentions of this on some Debian mailing lists as early as 2007. I've seen references to projects going back even to the early 90s unrelated to Debian, of course, but it really didn't gain much traction until more recently. I'm guessing it has something to do with the leaking of certain documents where people suddenly realize, oh, security. That's an interesting thing we should be thinking about. So, automated rebuilding of Debian's 25,000 source packages began in late 2014. We've continually grown those networks. We build them faster and build them more. And we're currently building on four different architectures. And in the ballpark of, what did I write there? 1,600 to 2,000 packages per day, depending on which day you look at it, whatever. On both AMD64 and i386, ARM64 and ARMHF. And my main involvement is I run about 20 little, 20-some build machines that are running the ARMHF architecture, and I'll get into that a little bit more in a moment. So, this is a look at kind of the history of reproducibility in Debian in our longest running testing infrastructure using AMD64. We're currently at about, about 19% of Debian unstable is unreproducible. You'll notice a big drop. The green stuff is the good stuff. That's when we've actually gotten to the point where the vast majority of Debian we can verify reproducibility. But somewhere, I forget when it was exactly, we decided, oh, we're doing too good at this. Let's make this harder. So, we started adding build path variations. That was one of the things we previously have decided, that one's just too hard. We'll document the build path and then people can do that. But we kind of came a long way with this stuff and we decided, okay, let's make this harder. So, in testing, I don't know if you know much about the Debian infrastructure, but unstable is kind of where packages initially land. And then testing is where they migrate to once they've had a period of no major bugs filed on them. So, testing we decided, okay, we'll use testing kind of as a stable ground and testing we're really looking at only about 5% of Debian is unreproducible if we ignore the whole build path issue. Which is 5% of 25,000, that's 1,300-ish packages. So, that's a lot of software out there and most other projects also probably carry a lot of this software. We have historically patched a bunch of things in custom repository. But I believe at this point we're to the point where we don't have tool chains that are modified or patched. They're all integrated into Debian and we've tried to push most of those patches into the upstream projects like GCC or I'm blanking on what some of the other projects are. Basically anywhere we can fix something in the tool chain we fix it in all of the packages that use that tool chain to build it. So, GCC is a really important target for that as well as some of the other major compilers. Why does reproducibility matter? I'm guessing this isn't a security track. People coming here are probably mostly interested in the security aspect. Maybe you have a hunch why it would matter. Back in 2002 there was a single bit or byte difference in the binary in OpenSSH due to a bug that caused a remote root exploit. So tiny changes in the binary can have huge effects on the results. More recently, which I find really interesting, in 2015 the XcodeGhost exploit was a malicious tool chain produced for the Apple's development kit and it infected over 4,000 apps in the Apple App Store. So, if you recall, we started doing this big rebuilding all of Debian in 2014. At that time it was like, well in theory somebody could release this malicious compiler out there in the world and people might use it. And we want to kind of be a head of the game but then along comes XcodeGhost and surprise, it's actually a real world problem. This isn't just a theoretical problem. So hopefully GCC hasn't been compromised somewhere along the way but we have proof out there that this is a real world problem. So a lot of this goes all the way back to 1984 with Kent Thompson's talk on Reflections on Trusting Trust in which he demonstrated that it was possible to build a compiler that will inject into a compiler you build with it, a backdoor that then injects into anything else it builds backdoors progressively into perpetuity. It's hard to maintain over time but it's a really difficult attack to actually address. So really there's been very little research into how to actually address this attack until fairly recently. In 2005 and 2009 David A. Wheeler came up with some papers and projects on diverse double compilation whereby you basically take a compiler, a reduced compiler and then you build a compiler and then you build another compiler and you compare the results. The problem is you can't do diverse double verification, diverse double compilation without reproducible builds. So it's basically a precondition to actually address this issue that was raised in 1984 very publicly and actually I think the issue was reported in the early 70s by some military researchers if I'm not mistaken. So we're just getting around to addressing some of these longstanding issues. So now into the meat of it, I kinda wanted this to be like here are some things you can actually do as developers. I wanna point out the most common issues, these are by no means all of the kinds of issues we encounter but timestamps, time zone, file sort order, locale, addressing timestamps in your build that can address nearly 80% of the issues. So that's a huge one, really simple one you can target that can have a huge impact on the reproducibility of most software. This is actually, timestamps is kinda what got me standing on this stage today. I maintain U-boot in Debian which is a small boot loader and it was marked as reproducible and I scratched my head like, cause I know every single time it boots it prints the build time and so I looked at it and I got, oh of course on AMD64 it's only building some tools and not actually any U-boot binders. It was wrongly marked reproducible so I started with a few arm boards and built up a network to start testing on other architectures where we actually build different things from the same packages. So I'm here because of timestamps. Who would have thought? Really there's no timestamps like no timestamps. Ideally if you can convince the other developers on your project, let's just not embed the build timestamp in the build. That's the best way, by far. But if you've got some grumpy old people who just don't see it your way they're like, we need some sort of timestamp in this binary somewhere we've got a solution for that. If you really must you can use the source date epoch environment variable which basically specifies a specific timestamp that you can use from a revision control history timestamp, a change log just about any other timestamp than the current time something you could get you could use the most recent file in the tree of the build, any number of things but basically whatever makes sense you specify the second since 1970 and then various projects will incorporate they'll look and say oh, source date epoch is set rather than injecting the current timestamp I'll use the timestamp they've specified. So it's a pretty reasonably good alternative that makes timestamps basically a non-issue. Related one is time zones. So sometimes when you're injecting timestamps this gets back to why it's easier just to exclude them but yeah okay you got source date epoch but then you might embed the time zone that you're building it in. So if you do really need to include timestamps please don't make sure that you specify what time zone you're building it in probably UTC just for a simple straightforward standard. Now I think it may be surprising is when you get a list of files in a directory depending on all sorts of factors they don't necessarily come out in the same order. Different types of file systems will read them in a different order. Low cal settings can sometimes have an effect on that. So in say a make file you just wanna make sure anytime you're using a wildcard globbing thing that looks at a directory you make sure that you just sort the output. So this is gonna be a common theme throughout a lot of the reproducible build stuff. Yes, you got a question? We've thought about some of this stuff, it is, oh yes the question. So the question was why shouldn't Gmake just handle that? That's actually a good proposal. There may be some technical reasons why the make author or Gmake or whatever haven't done it or maybe there's a specification for how make is supposed to behave. But definitely that's touching on the more ideal way to fix things is fix it in the tool chain rather than in each individual piece of software. In the meanwhile, make doesn't do that yet. So please sort your inputs. I don't know how many of you, I'm guessing most of you at least understand or speak English fairly well, you know, you're at this conference. Linguists would have a hard time identifying the difference between the C locale and English as spoken in the USA. But surprise, there are differences in these two languages. So I don't know if you can see clearly but the sort order for AA, lowercase A, lowercase B, it actually comes out different. In C, it'll come out with the capital letters first and then the lowercase letters. And in English, as spoken in the USA, rendered in the UTF-8, it does the lowercase capital, lowercase capital. So the locale can have surprising effects on packages. Again, in U-boot, I had the most interesting bug where U-boot was failing to build on ARM and it had nothing, failing to build reproducible on ARM, it had nothing to do with the fact that it was building on ARM. It had to do with the fact that we narrowed it down to Italian had translated a string for the output of LD and translated it as LD-di-gnu and everything else just didn't bother to translate it and it just said GNU-LD. And so in U-boot for a while, there were two languages, Italian and everything else. I tried everything. I tried Korean, I tried Japanese, I tried Arabic languages and they're like, these are all coming out the same. What is going on here? Why is Italian so special? And it turns out they translated a string in LD that nobody else bothered to translate. So U-boot has been a surprising source of all sorts of unreproducibility issues that really baffled me. So getting on to the hardest stuff, the most difficult stuff we really have to struggle with is the build path. So I'm building it in home, happy GNU user or you're building in home, I'm a smart person, whatever. That frequently will get embedded into the binaries or into the debugging symbols and that's been a trickier one to solve. There's been ongoing work by Shimon Loh to get patches in GCC accepted similar to the source date epoch specification where you specify with this source germ, map it to this other string so that then you can still have a consistent way of referencing the debug strings and things like that. But it's still in progress and we're still sorting out exactly what we actually want to even propose. And build paths, typically we work around them by normalizing in the build environment. So that's the easiest workaround if you have control of your build environment. So it's been one of our compromises that up until recently we were kind of putting it on the back burner, but now the time has come where we're bringing it back to the forefront. We already got some patches accepted in GCC where we had a command line that you can specify but then the binaries were embedding the command line that GCC used to build it and so that's why we're using an environment variable. So we thought we had it, we were so close, but no. So that's kind of the last mile on one of the major issues we need to resolve. Some of the other major tool chain issues we have is most things that generate PDF, embed all sorts of random data into it and we haven't come up with good patches to fix those. But so a surprising amount of packages are reproducible except for the documentation, which is, oh well. But we'll continue to work on those. So those are some of the major common issues that are kind of more or less straightforward to understand. There are lots more. But just as some general practices, I'd like to encourage you to write your code intentionally. Make sure all of the stuff going into your binary, you want it in your binary. So removing unintended inputs, making sure to sort inputs, not embedding the time strings, please. And that can lead to verifiable build results that other people can work on and you can gain confidence that your builds are actually the result of your source code. I mean, we're dealing with free and open source software, but can you prove it? I mean, really, like, so you've got the freedom to use, modify, blah, blah, blah, but what if your compiler is doing something else than you told it to do? So basically we want to get to the point where multiple independent parties can verify a build, submit those results somewhere, and then long term maybe we can even integrate it into the package management system saying, I want only builds that were verified by end builders or these various trusted notaries or something along those lines. We're moving in that direction, eventually. So another important thing we do is we generate a build info file, which lists various information you would need in order to reproduce the build. If you build with a different version of GCC, it might come out the same, but it might not. Say it gives some new optimization or improvement. So we're basically, we're not expecting reproducible builds to mean reproducible regardless of what tool chain you use. It might happen and that's cool when it does, but for the most part we want to document the tool chain you use to build it and a number of other factors. And we've been uploading various build info files to the buildinfo.debian.net as part of our continual rebuild of the Debian Archive for some time now. And pretty soon we might actually be able to expose those through the Debian Archive itself, which is pretty exciting. And so there's a specification right now. The best reference I had was how the Debian Package Manager implements it using Debian BuildInfo, but at some point we would like to move that to a distribution agnostic source. So I'll just give you a little example of what a build info file looks like. Can people read that fairly well? But fairly understandable, okay. So basically a build info file, it has a few basically name and value settings. And so we basically specify what source you're building, what version of the source, a checksum for the artifact you produced. In this case it's a .deb file. And then what architecture you built it on, the build date, the build path, just in case you're not trying to do one of those fancy new GCC things. But someday we might be able to just drop that part. And then the installed build dependencies. This is kind of an idealized version, if you'll notice the checksum is a little short and it's a very small tool chain we're documenting here. And so it fits on the slide. But typically these will be a bit longer. And we also document various options, tasks like the level of parallelism that was used for the build. That can sometimes affect outputs, which is kind of terrifying. But hey, at least if you document it, then you can try to reproduce it with the same level of parallelism and hopefully everything goes okay. And at the very bottom are very own environment variable source date epoch. We're kind of gradually approaching 1.1 billion seconds since 1970. So down at the bottom that's basically what a source date epoch environment variable looks like. It's a bunch of numbers. Some building tools we have, re-protest basically can take a command and then you pass it argument of artifacts, you expect that command to produce. And then what it will do is create an environment, run the build on the command you're running, and then change a bunch of things and then build it again and then compare the build results. So it kind of automates some of this stuff for you. So that's a really useful test. A Debian-specific one, which is similar in nature, is Dev repro. It's much simpler and varies less, but it's a good, useful, easy to use tool. And the tool that sometimes people get very excited over is Difascope. So Difascope is one of the most clever diff tools you've ever seen. So if you give it, say, like a Java file and you compare it against another Java jar, it will actually realize, oh, this is a Java jar file. That's basically a zip file. Why don't I unpack that for you and then unpack the other part and then I'll compare the results. And so it can do archives of archives of archives of archives of archives. It'll basically look down and try and get to the point where it can actually get you the most human readable output it knows how to do at the moment. And it's generally improving all the time. So Difascope is really, a lot of projects have seen this tool and have found it to be really useful outside of the context of reproducible builds and that's great. Love it. So Difascope is a really awesome tool. It's written in Python. And if we still have time, I'll show you some output. TryDifascope.org is, it's just as a service. So basically it's a website you can go to, submit some files and it'll give you the output. Because Difascope tries to handle everything under the sun, its dependency chain is large and vast. I think when I installed it on my laptop, it took up about two gigabytes of extra space and added packages. So it provided as a service as well if you don't want to install an extra two gigabytes of packages. But maybe you already have some of those installed on your laptop, I don't know. And there's also a TryDifascope client for those of you who love the command line. It basically uses the TryDifascope service but uploads it and behaves otherwise just like the normal Difascope. I've mostly talked about Debian, mostly because that's where my home is. But there are numerous other projects that have been working on a lot of this stuff. Bitcoin is one of the ones that really started doing this for presumably obvious reasons. People run a Bitcoin binary, it's handling who knows how much virtual cash and people want confidence that this is really the binary that Bitcoin developers intended. NixOS and GNU Goix do some really interesting things with their builds that kind of, it addresses some of the issues of reproducibility by design where they actually, a given package install is basically a checksum of the tool chain and the source version and I'm not sure of the extensive details. But those are really cool projects that I've found myself at two in the morning running a live image just to see how cool it was. Fedora and OpenSUSE have started getting on board with this. We had a big meeting in December with a whole bunch of projects and I was really happy to hang out with the RPM people which I don't know, for some reason I was attracted to this crowd. It's kind of like, okay, we need you guys on board because we got Deb, we got RPM, like we'll take over the world, right? We've even got FreeBSD has been working on it and I think it was NetBSD that's recently, recently started testing for reproducibility. I think we've had some work on Arch Linux although I haven't heard too much lately. And Tails, if you've ever heard of it, I think they are, if they haven't already, they're on the verge of being able to produce a reproducible image. So everything I've been talking about so far has just been mostly with packaging but they're actually trying to build an image out of Devian packages and make that reproducible which is awesome. Coreboot, TorBowser and there are countless other projects. I think there's even a project to produce reproducible Windows binaries. Cool. Yeah, we're really hoping to make this much broader than Devian, it already is but don't let my primary experience color your view of how this works out. Yeah, a word from our sponsors. The Core Infrastructure Initiative has sponsored numerous developers of Devian and previous D and I've recently been fortunate enough to be among that pool of people getting sponsored to do some of this work. Profit Bricks has hosted most of our Jenkins infrastructure and various other, a lot of our build machines and CodeThink has recently donated a bunch of ARM64 machines that are really fast. It's crazy. And there are tons of awesome reproducible build folks some of whom are in the audience today that have been really encouraging to me to get involved in this stuff and keep working on it and stretch the limits of my understanding so it's been great. Yeah, so let's see. Why don't we just, how much time do I have? Yeah, so I'd like to open up the floor for some questions and then I might want to demonstrate some of the tools that we had. You were saying that with a bit for bit identical output, the bar is much lower. The bar of verifying your build output. So when we're talking about reproducible builds, if it's not bit for bit identical, we basically don't consider it reproducible. I mean, it's great if it's mostly reproducible, that's obviously better. But you used the phrase lowered bar and I was surprised because I would have thought it would be completely infeasible. Perhaps I missed the floor, I'm not understanding what you were talking about. Okay, all right. Yeah, there are some projects, I think like EFTROID and one of the big challenges with RPM is that they embed signatures in like cryptographic signatures to verify the results. And that's kind of a tricky issue because, well, by nature, you can't reproduce the cryptographic signatures. Or if you can, that's a whole different world of security problem that they would need to address. But yeah, so some of those projects, they're like reproducible if you extract out this part of the binary and they have a systematic way of doing it, that's almost reproducible. But yeah, there's some tricky things we need to sort out there. So you mentioned about, was it 95% of the Debian archive can be built? Is that other 5% important or trivial? If I said I want a reproducible system, could my system actually boot? Right. I think there are some of the core systems that aren't. We do have, if you go to test.reproduciblebuilds.org, there are some views that do specific package sets like GNOME or the base system or things like that. So we're almost there. Oh, sure. Okay. All right, so apparently any font built by, any true type font built by FusionForge is fontForge, okay. Is not gonna be reproducible at the moment, okay. Sure, okay, work in progress, yeah. What is the story right now with cross compilation and specifically could I get a cross compiler and a native compiler to mutually reproduce? As far as I know, we haven't put much effort into that. Debian mostly does native builds, but I'm very interested in cross compilation. That's definitely a goal on our eventual agenda. And if you're interested in helping us with testing that, that'd be awesome. But at the moment, we haven't gone too far in that direction. There is somebody who's been working, helmet grown has been working on being able to automate bootstrapping and new architecture in which cross compiler is pretty essential for building the initial base system. So there is some work on that direction, but I haven't heard too much recently about it. But that's definitely a goal. If you can cross build it and native build it and they come out the same, that's pretty awesome. And we wanna definitely verify that. And definitely with all this stuff, we're always like, once we reach a certain level, we're like, okay, how can we make this harder? What are more variations we can add to this environment to test for reproducibility? And cross compilation is definitely one of them. On a somewhat tangential note, I have verified a number of architecture all packages that actually build the same on AMD64, i386, RMHS and ARM64. So we do have a number of packages that do at least for some parts of their package build reproducibly, which is pretty exciting. Question. Yeah, I decided to come to a lecture about which I know nothing. But so I'm gonna ask a stupid question. Excellent. It seems to me that a lot of the information that you're saying gets into the executable files is disposable. So why not file headers? And why hasn't anybody come up with like a universal compiler specification that says all this identifying information that's gonna be different, we put in a header somewhere and then we just ignore that when we do a file comparison. Yeah, there are various projects working in that direction. There are so many projects out there that embed various macros like file macros and they just embed the time. And it's very hard to convince toolchain maintainers to change, like especially for a well-established toolchain like GCC, it's hard to convince them to change that. So what we need is to implement workarounds like source state epoch, the source prefix maps, those sort of things. Okay, I understand. Because we need to be able to get the upstream developers to accept it and in some cases that means we need to compromise on not just stripping everything out. But if the upstream developers is amenable to that, by all means, those are the best patches from a reproducibility perspective. Did that answer your question? It seems like nobody has sort of like thought of it as like get a standards board for like making compilers and make sure that they all are capable of actually doing that. And that you need, because I've been till then, of course I'm a developer, so it's software and it's like we're always having to do workarounds of some sort in order to make sure that things get accomplished today that can't be, that we accomplished today what we can't accomplish because people are not cooperative. Yeah, I definitely think that's a good effort. We're trying to get to the point of a real world reproducibility scenario before we find some sort of universal specification, though I think that is an excellent goal to strive for and we do try to propose standards and specifications that we believe are achievable and adoptable, but that does seem like a very long term view, which is good. Yeah. So as a guy who builds a lot of stuff on Ubuntu, how close are the, what's the minimum set of patches I need to apply to the Ubuntu tool chains to start doing this stuff there? I don't know. And my impression has been Ubuntu is basically just waiting for Devian to solve it because then they just inherit it. And which is in many ways good. I kind of wish we had a little more help on their side, but yeah, I don't know what the exact patches are. Most of our patches have been going upstream that I'm aware of so that it's just a matter of time, but yeah, I don't know if there are any major outstanding patches in the Devian archive that haven't already propagated to recent versions of Ubuntu. I remember back in the dark ages when GCC wasn't a standard part of your system that you had to download the code and build it and build it again, and then build it again and compare it and make sure that your compiler actually worked. How much of what you're doing was inspired by that state of affairs. I gather that's a little difficult to do with current GCCs, but I haven't had to build it from scratch in a while. Yeah, I don't know. I don't know how much the historical inspiration comes from that. I think the current generation of people working on reproducible builds are mostly people who didn't live through that, for better or worse. So yeah, I'm not sure I'm qualified to answer your question. Thanks, when you were showing the build info before and it had the dependency tree, the elephants all the way down problem in terms of dependencies and dependencies and dependencies. Are there tools to help understand that with the build info or is it all being rolled up when I see my installed build depend? Right, so in Debian packages we define, I believe what we're incorporating in there is the package specifies what package it needs to install and then Debian has a concept of a base build tool chain. So it basically, a build essential is what Debian calls it. So it's like a base set of things you need to do to build Debian packages, which oftentimes includes a bunch of stuff you probably don't actually need to build it. If you're building Python stuff, it pulls in a C compiler, but there are things that are assumed to be there if you're building a Debian package and then plus all the other stuff. And so we document the versions on those. We would love to get to the point where actually documenting the check sums of those packages, but that's a technical problem where we have many to solve. Is that worth three? Okay. Do people want to see some of these tools in action or at least some examples? Okay. Unfortunately, I didn't have off the top of my head a good example for a package that failed to build. I thought this one would fail to build, but it didn't. Is that legible? So here, I'm in a U-boot, and I've talked about U-boot plenty, so we may as well actually build it. And running repro-test, telling it to run de-package build package and don't sign anything because that's just gonna get complicated. And to build the binaries and then compare the results of the binaries. So here, it's showing you that it's varying. Here, it's showing you some things it decided to vary. So it varied some things in the environment. Some binaries have historically embedded the entire environment into the binary, which is pretty uninteresting. It's varying the file ordering, the home directory kernel, so on and so forth. So it lists a bunch of things you can vary. I believe there's a way to specify don't vary this or don't vary specific things or there may be some optional things that it doesn't vary by default and long-term. So let's see. So as you can see at the bottom, it didn't find any differences. I should really have a better example. But I'll show you something with the scope later. Well, anyway, so it basically built the package, changed a bunch of things the first time, and then it builds it again, doesn't change things or changes them in a different way, and then it compares the differences using Diffoscope. The build reproducibly. Anybody see anything? They wanna see some Diffoscope. Which one? LSOS, let's see here. Is it roughly alphabetical? No. So this is part of our test infrastructure. We have notes documented that it captures the build test and elaborate explanations of what that means. And then over on the side, we have some examples of some Diff output. You generate it using Diffoscope. And this one, so here it's showing you the objects compared and they've got differing checksum files. It shows that there's some differences in the size of the files. So obvious differences, but then it's starting to unpack things. And here we have the build ID is apparently varying. So that's one issue. And hopefully if the notes are right, it'll show us, ah yeah. So if you can see there, it's showing the path is build first LSOS, blah blah blah blah blah. And then the other one, it's build LSOS, blah blah blah blah. So it's gone through, extracted the information out of the binary and then down a binary comparison on it and tried to find the thing that most looks like a string that might actually suggest what the issue is. See if we can find another good example. But worst case scenario, it falls back to a binary headache stump, which usually isn't tremendously useful, but it's good to know it once in a while if you really know the program, it might actually be useful. And then it's just going through various other packages. So let's see, I'll pick another one at random here. CLC what? This one went straight to the Diff. Excellent. So here, this one may have not been documented. A lot of binary data, not really showing you how awesome Diff's scope is just yet. This one has a number of documented issues. Fonks and TDS files, I think Fonks were brought up earlier. And my D, and generated by DVlytec. So this is actually one of the ones we haven't really developed good solutions for. The issue is there. Ah, it embeds the date. Classic, no timestamps like no timestamps. So one of the most exciting things we're doing on our builder is we're actually building more than a year into the future. So as you'll see, this build was done on March 25th, 2018. Yeah, and so that way we're also proactively checking for millennium bugs. Yeah, so PDFs embed all sorts of crazy stuff and then a trydiffascope.org, just a pretty straight, that's not what I typed. So trydiffascope.org, it's trydiffascope.org. It's pretty straightforward. You upload two files, it compares them. I don't know that I have any great files to compare, but it'll produce some HTML-ish output or you might even be able to get it to download you text-only output. Any other questions or, I think I touched on all the tools we worked on? To follow up on the tools question. So you're saying that, say, dPackage is fully up to date on which Debian thing? Is it testing? Debian on stable and testing. Testing evidence? Have a dPackage that can build reproducible packages. Okay, thank you. And in a related note, I'm testing is currently under freeze and about to release soon, so fairly soon we will have a Debian stable release that will actually be able to build some reproducible packages. So have I understood correctly that dPackage is going to be able to supply this build info file from a command line query? That's a really good, if I understand the question. I'd just like to be able to say dPackage minus minus build info, some package and have it spit that stuff out. Right, so dPackage by default in unstable and testing produces build info files now. We don't yet have the tooling to just take that build info file and then try and reproduce it. It just would be nice if our build system could read that from upstream packages and then try to use it to build our system. Yeah, currently the Debian archive is not exposing the build info files. It's storing them away for some future date in which we will be able to expose them alongside the actual devs we're distributing. But we also have buildinfo.debian.net where we're collecting a bunch of build info. We could get, W get it from there based on the package name presented like that. Yeah, at this point it's still in a proof of concept phase but we're tidying that up and as we're going that's improving. So we build all our packages with Clang and TCC. It would be interesting just to do this. Maybe we could automate some kind of test and see how close the binaries are coming out because we're just running different tests with them. Right now we're not comparing the binaries to see how close they are. That would be really interesting. That would be really awesome. That's definitely on our long to-do list to do things like that. The last question made me wonder, so what's keeping you from just putting the buildinfo into the control file of the package? If you embed the, if you embed it in the control file suddenly you have variable info in your build, in your build. My head's exploding. So it needs to be a separate file that you can then cross-reference the other artifacts in the upload. Yeah, it has often been sought and at one point we discussed whether buildinfo files should be identical also and that just turned out to be really impractical. And also undesirable in a lot of ways. We actually would love to see, hey, you built it with two different versions of GCC and it still came out reproducible. That's really awesome information. We want to know. We may not have the human power to actually process all that information, but at some point somebody might make it an automation project to go through all these buildinfos and do comparisons like that. So, yeah. Well, if no further questions, I guess we'll wrap the session up. Yeah. Thank you. Test one, too. Okay, we're going to get started. Our next talk is essential web security. Our speaker is Justin Mayer from monetorial.com. They are the makers of a security monitoring SaaS product. Justin's talk is going to cover automated TLS certificate provisioning, content security policy, and a host of other web security enhancements. So without further ado, I give you Justin. You're too kind. So last year I gave a talk on essential Linux security and part of that talk focused a little bit on, or significantly on web security. So what I wanted to do this time around was take that and expand it a bit because it's really deserving of its own full treatment. So there'll be a little repetition if you were there last year and if you were at App Sector this year, there'll also be some considerable repetition, but hopefully everyone will find something. This is going to be an overview of a bunch of different topics. Just a little bit about me. As my introducer so kindly mentioned, I'm the founder of monetorial.com. We are a software as a service solution for security monitoring. In my little spare time I have, I'm the maintainer of the Pelican SaaS side generator. And privacy is a thing that I care about and I will be having little non sequiturs here and there throughout the talk that sort of allude to it. So if it seems like there's some things that are a little over the place, that would be why. So speaking of why, I think the first thing that I'd like to cover when I talk about web security is like why are we here? Why is this a thing? Why are we talking about web security in particular? I'm a dinosaur so I actually remember using this browser. Like vividly, have very, very fond memories. And those were much, much simpler times. But the idea is that over time, well really just the web itself is a series of layers. You have your web server environment, you have your transport network, you have all the assets running in the browser, HTML, CSS, JavaScript, we're all aware of the various pieces, but it's a considerable number of pieces. And the web was popularized approximately 1994-ish, depending on when you came to it yourself. And a lot of crust has accumulated in that time. It's sort of one layer of standards after another, one layer of different browser implementations, and that all results in a lot of opportunities for folks who want to do not-so-great things. So this used to be like a series of logos for like Heartbleed and Freak and Ghost and Poodle and Shell Shock and all the other cute little logos and mascots that they have when they come up with their little security vulnerabilities. But now all I have is this, I'm just using this now. And I think that anytime you feel compelled, not that any of you in this room would ever do this because you seem like smart people and you wouldn't, but anytime someone that perhaps you work for or with decides like, hey, you know what? What we really need, we really need to have like a guy and a hoodie and like have like little matrix stuff come down from the top. Why? Why is always a hoodie? I don't understand. So if you're gonna use a hoodie, like go with the cat. I actually saw one of these that had a guy with a crowbar, you know, like with a hoodie of course, but with a crowbar, like trying to pry open a notebook, it didn't make any sense, but that doesn't seem to stop the marketing folks, you know. Sense is not something that they seem too concerned about. You know, I think that when you look at security in general, and you know, since we're talking about in a web security in particular, thinking about the stakes, because most of us when we're working on our web applications, we don't really think of them as being, you know, super important. And some of us work on, you know, things that are important, I mean, in terms of them being, you know, critical, some of us don't. But I think it's important to think about just the overall stakes that are possible. I was reading an article about some things that were going on in the Ukraine. They, as you know, have a little bit of a turf battle with their neighbors. And one of the things that they do is they use an Android app to help targeting with their mortar shells. So they had nine, there was upwards of 9,000 Ukrainian artillery soldiers using this Android application to target their artillery. And unfortunately what they didn't realize is that a lot of them downloaded this app to help with the targeting. This app that they actually downloaded was a Trojan app that the Russians had put in there. And the Russians were using it to geolocate where these people were so that they in turn could focus their artillery on those artillery units. So, you know, obviously most of us are not doing things that involve that level of, you know, life or death situations. But I think it is important to think about the things that can be possible because we don't think about people dying when we work on our code more or less. But it is actually happening. And to me it is incredible that in this day and age that is something that we potentially have to think about is whether or not someone's gonna die as a result of some software bug. So, you know, there's a tendency I think for folks to say like, well, you know, it's just a web application. You know, I'm not developing something that the Ukrainian artillery soldiers are gonna be using for targeting. But there are other good reasons to lock things down. There's the privacy and security of the users that use something that you might be building or interacting with. It's also, you know, it's also, you know, an issue, you know, just to bring up one particular thing that has come up more recently is passwords over HTTP. So starting with Firefox 51 and very recently Chrome 56, there will now be warnings anytime you try to serve a form that collects a password over a non-secured connection. And this is a really very important and exciting step for those of us in this field. So I was really pleased to see that. Just one of my little sidebars here, pardon me. Speaking of passwords, I don't know what it is about password-paced blocking. This is something that web developers apparently do. I'm sure no one in this room has ever done this. If you have, please don't ever do it again. I was on a flight not too long ago from, I was on Southwest and I was trying to pay for the Wi-Fi and I was, you know, went to log into my Southwest account, tried to paste my password from Password Manager. Nope, wouldn't let me. And, you know, to me it's like it to Southwest. Like, you know, you're my company. You know, you can do better than this. And another, you know, related side, if you use Firefox, you can go to your about, you can say window, you can copy and paste this and this should mitigate the password-paced blocking silliness. So speaking of passwords, I wanted to have another little related sidebar on the idea of not having them. So one of the things that we use on Montorial is we don't have any passwords at all. I mean in some of you I'm sure I've seen this type of authentication before. You enter in your email address, it sends you a link, you tap on it, you're logged in. And that's how we handle logins for Montorial and that's something that is fairly easy to implement. Most people will say, well, you know, there are, I've heard certain arguments against this and without going into a lot of what those are, what I will inevitably point out is, okay, well you on your site, do you have a password reset function? Yes, okay, well then you're already doing this. You're just doing it along with the password that someone's inevitably going to forget and it's going to create a lot of support requests because they can't find the password reset feature. We're just cutting out the password part and essentially doing the reset. So that's something that you can look into if you happen to be a Python person or use Django. There's an interesting app called Django No Passwords that will allow you to implement this fairly easily. Another option is Dan Callahan who works at Mozilla. You may have heard of Persona, which was something that Mozilla was pushing a couple of years ago. They couldn't quite get the adoption that they wanted so Dan decided to kind of take some of those basic concepts and carve it out into a project unrelated to Mozilla and it's called Portier, Portier, I'm not really sure, but you should check it out because it might help simplify, you know, handling logins again without passwords. So going back, you know, getting back on track, there's many reasons for wanting to look out for web security that don't relate to people living or dying. You know, it can be good for, you know, there's a search engine ranking benefit as you may know to serving over TLS, you know, not to mention just again the idea that you're looking out for the data integrity of your users. So, you know, when I think about data integrity, one of the things that I think about is what data is most important to me? You know, the data, specifically the data that's hosted outside of my control. And when I try to think of what that data is, it's my banks, it's my health insurance company and place that I shop at most frequently. And so I went through and I went through each of my four financial institutions and my health insurance company and said unnamed Amazon e-commerce company. And I went through and I tried to analyze, this is a very back of the envelope, like I'm trying to like, you know, do easy things. And just looking at like basic security headers, things that you can add doesn't take a lot of time, not super hard to implement. And as far as I know, I have very, you know, very little, you know, negative implications. And you know, I've done this before and I, and so in preparation for my talk, I was like, okay, well, you know, I'm gonna go and I'm gonna take another look at it because I'm sure it's changed. You know, it's like, you know, at least a year has gone by since the last time I looked. Nope, still terrible. And you know, it's the sort of thing where, you know, this is, these are banks and, you know, this is my health insurance, my medical data. So if you have anything to do with serving information via the web, like try to make it your personal mission, you know, to, you know, to do better than this, yes. Yes, and I, by no means authoritative, right? Like these are my grades. But you know, essentially what this boils down to is things like strict transport security. You know, let's see, I'm trying to think of some of the other things that I looked at. I wouldn't, you know, ding someone for not implementing public key pinning. But there's a few other things as I'll, you know, I'll get into things like, you know, some of the TLS related topics. And a lot of these folks are missing the easy ones. The harder ones I wouldn't give someone such a hard time about. But when it's literally like a one line header that you can add to your web server config, like I feel like, you know, if you're a Fortune 100 company, that's something you could probably do. So, but yes, I will get into those. So again, this is just sort of where I'm gonna make, you know, my plea is to, excuse me. You know, just take a look around and see if there's something that you can do to make it a little bit better. You know, someone once said that, you know, that having a site that's unprotected, you know, with TLS and some of the, you know, if you have a site that's unprotected with TLS, it's like saying I don't care about the privacy and integrity of my users. And I'm starting to feel the same way about basic TLS hardening that's within the reach of nearly any organization or individual. So in any case, you know, think about your friends, your family, you know, the people that you care about when you're working on stuff, fight for the user. All right, so, okay, Justin, thanks for the pep talk. Let's talk a little bit about TLS search provisioning. I don't know when the last time any of you had to provision a TLS search manually. Thankfully for me, it's been about, you know, a year or so, but it's a super menial task. It involves, you know, going onto some site and putting in your credit card information and then they email you some kind of challenge and then you have to be like, okay, well it's went to administrator at domain, you know, dot org and actually, I don't even have an alias set for that. You have to go in there and it's, the whole thing is just a mess. The copy pasting on the web, you know, the web server side of things, you know, the whole thing is, as programmers, we should hate anything that's that menial and yet that has persisted for decades. So thankfully, Let's Encrypt has thoroughly solved this problem as I'm sure many of you, if not all of you at this point are well aware. And they've handled it really, really well. I mean, it is glorious to set up an automated process where certificates are not only provisioned but just automatically renewed without anything at all. It is really a thing of beauty. There's a few different authentication methods. You can use Apache or Nginx plugins in order to do that. There's a standalone module. You can use a file that will appear in your web route. There's a much more manual process if you're just trying to provision the search without, you know, involving your web server in any way. And then the most recent entry is a DNS related challenge. I was really excited about this because there are sometimes where all of the other ones are not a great fit depending on what you're doing. And unfortunately, in my experience, the DNS one required that you change the DNS entry for each renewal. So I'm like, well, if I have to go in there and add a new DNS entry every three months or somehow automate via API this DNS entry, like that's not really a win for me personally. So maybe I misunderstood how this is actually implemented. And if so, feel free to raise your hand at some point and correct me later. But I think for most people, either the standalone or web route method will tend to work pretty well. The tool that's used for this is called Certbot. If you haven't used it, it's super easy to install. You can pip install it. You can just clone the source and run the Certbot auto. That's really easy. By the way, I'll have all of these slides available so don't feel like it. I'm gonna kind of whiz through a lot of them so you don't need to take any photos or make sure that you remember anything. I'll make sure that you have all of this. The provisioning step is also very easy. Lots of useful command line flags to revision the cert. On a related note, I don't know if you've ever used Caddy. It's a go language based web server. Recently tested this on a hobby project and it's impressively simple. If you've ever, you know, if you've sort of wrangled with a patch of your engine X they're super powerful tools but they also have lots of knobs and levers that you don't always need, particularly for something that's just very basic and easy. And I found that Caddy is a useful thing for that. So that's something that you might wanna check out. And which, by the way, it also, the reason I mentioned it is it has automated let's encrypt certificate provisioning to the point where you don't even have to do anything. It just assumes that when you add a site to your Caddy configuration, it just assumes that you want TLS and just does it. You actually have to opt out of TLS with Caddy or something that's awesome. So I originally, you know, considered talking some of it about the differences of the different clients that you can use with let's encrypt. Then I realized the magnitude of that task. Last time I checked there were over 50 clients listed on a let's encrypt site alone. Pretty sure there's even more than that out in the wild. So I decided to just sort of skip that all entirely. The other thing with let's encrypt, that many people realize is that it's also free for people who want to misuse it. It's having the menial nature of certificate provisioning and the nominal $10, 10 to $50 cost also served a purpose that kind of kept some of the near-duels from doing things that we would all rather they didn't. This makes it a little easier for them to do not so nice things. So I'm speaking at the moment specifically of fishing sites. So this can facilitate fishing attacks by allowing you to create this long convoluted domain that somewhere in there includes say PayPal. So if someone sees the little check, the little green lock or however it appears in the browser and they see the word PayPal and they think they're in the right place and they think that everything's secure. When in fact it's some attacker who's looking to empty their PayPal account. So the everything that people don't realize is that you can stand up a site and make money, submit a good amount of money by doing this in only about an hour. And so there's some other tools that will help fight against this. I'll talk about must staple at some point. That allows for near real-time revocation and that will help fight that downside. The other thing that's important about TLS is that it not only improves your security, it also improves speed. And that's something that a lot of people don't fully appreciate or realize. Moreover, in addition to the speed aspect, it also provides an improvement in privacy by making TLS fingerprinting a little bit more difficult. I'm speaking specifically, sorry, about HTTP too. That's really what I'm referring to here. There's a speed increase associated with it and it makes TLS fingerprinting more challenging. Dirk Wetter who writes a very cool open source tool that I'll mention later gave a talk on this subject where he went into debt as to the types of privacy enhancements that HTTP2 can offer. Wetter uses some kind of configuration management to manage that, so I'm just kind of curious. So that could be Ansible, Puppet, Chef, that sort of thing. Okay, yeah, a good number of people, that's fantastic. Yeah, it's very handy to use tools like that in terms of, particularly when you're dealing with server security, because otherwise what happens is you manually as a station to that thing, you make some change there and you forget to do it in some other place. This really helps you centralize and make sure that it also helps you keep things version. So you know, oh yeah, I used to have this set of ciphers and then I changed it, why did I change it? Oh, that's right, it's right there in the commit message that when I committed this configuration code, I changed it for this reason. And speaking of cipher selections, there's a couple useful groups of cipher selections that Mozilla offers on their, they have a site that talks about server-side TLS. And they have like three major groupings. And these groupings are a trade-off between better security and backwards compatibility with older browsers. And so at the top, we have modern, which is Firefox 27 and greater, Safari 9 and greater, et cetera, et cetera. These are sort of the more modern ones. And if you don't need to support IE 9 on, you know, on Windows 7, then that will probably do you just fine. But if you do, then Intermediate, the next step down, which supports everything greater than Firefox 1, Safari 1, Chrome 1, and IE 7, this is probably a better fit. Then there's old, which if you have to support old, like I feel really, sorry for you, I really do, that's terrible, no one should have to do that because you're dealing now with browser standards that are so old that you don't get to really use anything fun. So if that's you, I really am, you have my condolences, but these are the different things that you can kind of use to kind of gauge how you want to choose your ciphers. As it relates to forward security, you know, the concept of forward security, that's where the client and the server negotiate in the femoral key that never hits the wire and it's destroyed at the end of the session, preventing an attacker from decrypting past communications. And you know, the recommendation for a while was to generate your own DH primes. That's now been somewhat, that advice has been somewhat deprecated, and the advice, the consensus at the moment seems to be to choose from these predefined groups because these are regularly audited by folks who the industry seems to trust. You can write a script if you like to periodically compare your cipher selections with the ones provided in this feed that Mozilla offers. And this is actually really handy. And the reason I mentioned this stuff changes much faster than you think. I am always surprised when I will, just, you know, I'll be doing a talk and we'll, so like, I'll just check it and then say, hmm, when was the last time I ran this against, you know, this other particular system, and it doesn't seem like it was that long ago that I did it, and sure enough, it is changed by a significant amount. I'm not talking about like one cipher removed or another one added, like a third of the ciphers have been either removed or added. It's amazing how much, how fast this stuff changes, so that's something that you might wanna consider as sort of a security best practice so that you don't have to remember to go check those things. So speaking of one of the items that these financial institutions and other companies didn't do very well, strict transport security tells browsers we only do TLS here. We don't do non-TLS connections. And so this is, again, the things that you'll see here and I apologize if it's small or so faint that you can't read it, but the examples that I'll show here are 4enginex since that's the web server configuration that I'm the most familiar with. But adding a strict transport security header is very straightforward. You simply choose, you know, how long you want that to be valid for, how long you want the browser to remember that for, whether or not you want subdomains included, and you're off to the races. You can have this pre-loaded in browsers so that the browser doesn't have to actually stretch this header in order for it to be recognized. This will help mitigate attacks that use redirects over ACTP. So this is a good idea if you're so inclined. I mean, you can go to this site and you can actually register your site to be included and it'll take some time, but eventually it will be bundled along with other sites when they're added to Chrome and Firefox and other browsers. So I'm not going to go into this into some time, I'm not going to go into the details of some of these other headers, but these are also other examples of easy things that you can add. They're one-line additions. Each of these are separate and they add a significant amount of security. Other easy wins that are shown here, making sure that they have self-protocols that you accept don't include things that have already been considered to be insecure. So it used to be like, okay, well, it's time to kick SSLv2 off the list and then it was time to kick SSLv3 off the list. At some point it'll be time for a TLSv1 to be booted, so these things change and it's good to keep them in mind. OCS stapling, I'm going to go over this one probably the quickest, mainly because I don't like it very much, but I at least thought it was worthy of mention. The main purpose of OCSP is to prevent forward certificates. You may recall the DigiNotar disaster where this CA certificate authority was compromised and some certificates were issued and that should not have been issued. This is supposed to help resolve that. This is the way that you enable it. It's fairly easy to do. Like I said, it's just a few configuration lines in your web server config. The problem is that browser OCSP revocation checks fail about 15% of the time and even when they don't fail, they take about 350 milliseconds and that's a long amount of latency to add into the round trip. As a result, browser vendors unsurprisingly soft fail to account for this 15% failure rate. In some folks, I would argue, correctly, have put forth that this kind of renders OCSP useless. OCSP must table is supposed to help rectify that. In any ways it does and you can look up and learn a little bit more about how exactly it works, but in theory it can't achieve near real-time revocation and sort of eliminate a lot of the flaws that OCSP have. That said, it has its own issues. I listened to a talk that was given by one of the Google Chrome security engineers and they talked about a recent initiative of theirs called must expect staple and the idea is it's supposed to help you sort of transition to a must staple world by having you be able to do some reporting and understand a little bit more about the OCSP response cycle. So that's something else that you could look into if you're interested in implementing OCSP. Content security policy I have issues with, but far less issues and the threat model here is cross-site scripting and other code injection attacks. The thing that I'll mention just up front is you've never used it. OCSP is hard and the reason is that you're having to whitelist essentially anything that is loaded in the browser. Scripts, CSS, images, fonts. Any of these things, you essentially have to whitelist them and if you have forgotten something it's not gonna load and your site's gonna be busted. That also goes as far as inline scripts. So if you have any inline scripts, inline CSS, you're like, well it's just one line, like there's no point moving this into a separate file. I'm just gonna inline it into the HTML. It's a completely reasonable thing to want to do. You can't. If you do that, that thing will not load. That's the point, is to move everything to external files. And if you have a lot of inline stuff that's gonna get old really fast. So a colleague of mine once said that I think for a site of any complexity at all you need to build for CSSP from the very beginning. And to some extent I agree, like it can be if you have a site that has a lot of these links everywhere and you have a lot of inline stuff, it can be daunting to try to move all of this through external files to make sure that you've whitelisted every possible external domain. You know, like oh I forgot Cloudflare. Well great, now all my externally loaded Cloudflare assets aren't loading in the site's busted. You know I think that for a lot of folks, if you're starting from scratch, if you're building a new project tomorrow, I think it's fantastic because what you're really doing is you're building a compile time security checking into your process. You know you're adding something and why isn't this loading? Oh I forgot to add a CSSP whitelist entry for that. Okay so you add it. So if you kind of continue this cycle from the beginning all the way through, you won't have missed anything because you'll know in the really beginning like why isn't my feature that I just added not appearing, well you'll know why. And that's an important part of I think deciding when it's a good fit for you. The configuration syntax is pretty easy. You simply talk about, you list which scripts, which images, fonts. You can say anything that's coming from the origin should be allowed. There's a decent amount of flexibility but there are some things that are not quite so flexible. If you use things that dynamically inject CSS or JavaScript into your pages, like Adobe Typekit for example, this will not work very well. This is a problem that we ran into really early and if you go to Adobe's site they'll basically say yeah don't use CSSP. And so you have to kind of decide well what's more important, do I want to use CSSP or do I want to use Adobe's product? You know you can use this unsafe inline to sort of route around that and say okay well for this particular asset class it's okay to inline these things. But the problem is it's a real blunt instrument. You can't apply it to a single domain. You can't just say okay for Typekit.org or whatever their domain is. Unsafe inline knows but everything else shouldn't. And it would be really nice if CSSP were more flexible in that regard. I've looked through the CSSP level three drafts they're currently on level two. I've looked through the level three drafts. I didn't see anything that would help eliminate this particular downside but who knows maybe that will change in time. So for me if this is an issue, best solution don't use services that dynamically injects CSS or JavaScript into your site. There are some really useful tools for drafting, validating and reporting on the various content security policy types. There's a whole bunch of them that I put up here. So you can check this out and see if they work for you. So sub resource integrity helps prevent a compromised CDN or content delivery network from sending malicious JavaScript to your users. Generally you trust the JavaScript that you're serving. But if it's coming from some external source you can't always have that same level of trust and it's useful to apply some kind of test against those resources. So just as a way of understanding this if you already run Google Analytics on your pages you probably don't care that you're loading yet another black box, JS file from Google. But for sites that load absolutely nothing from external sources you should understand the privacy and security implications of doing this. And sort of my pet peeve related to this sort of thing is AMP. I don't know if you're familiar with AMP but it's a Google product idea that's supposed to help serve mobile versions of web content. The problem with AMP is that all pages have to load a script from this CDN. You have to, otherwise your thing will not load more for AMP. And like I said before, if you're fine with with Google Analytics, you're running your pages and you probably don't care about this. But if you do, then don't use AMP. That would be my solution. I think just to be on my little soapbox I'm not a big fan. Here's a little tweet bot link if you like that will help you filter out any link that contains an AMP link. It's my personal sort of voting with my feet thing that I do to, you know, it's not just, I'm not just being petty. There are already reports of hackers hiding phishing links in AMP URLs. So this is like an actual real thing. It's not just me being silly and pedantic. As far as implementing, you know, going back to some resource integrity, the implementation of it is really easy. You simply, you can download a CSS file yourself. You can another, actually yeah, sorry, I forgot I was using WGet here. You can just WGet the file URL, full URL path to the JS file, pipe it into a couple OpenSSL commands and it will spit out a hash. And then the way you use that hash is by tagging the resource in the script tag. So right after the source equals URL thing, you add a portion that says, integrity equals SHA384- and then your ID. And if when the browser loads that particular asset, it's going to hash that asset and if it comes up with a different value, it will not load it. The problem is that there's a good chance you're gonna miss something for that process, right? Like, just like CSP, it's quite conceivable that you will miss one of these external CDNs that you might be using to load particular assets. And this directive ensures that you don't miss anything. RequireCRI4 is missupported since Firefox version 49 and it just went live in Chrome as of, I think about a month ago when version 56 was released. The implementation of RequireSRI is pretty easy. You just decide, do you want that to be for scripts, for styles, or for both JavaScript and CSS? Similar to some of the other things, I'm gonna gloss over public key pinning, mainly again, because I don't care for it. The threat model here also is compromise or rogue certificate authorities, the DigiNotar example being again, front of mind. It protects against certificate authority breaches. Someone could impersonate your supposedly secure site and at that point, a root key can generate a certificate for any domain. The implementation of it, sorry for the wall of code, but the actual implementation of public key pinning is fairly easy. You run a bunch of open SSL commands. You take a similar bunch of hashes of these keys and then you store them and you back them up and then you add these two as a header called public key pins in your web server configuration. One of the issues with public key pinning is that it doesn't really fit all that well in a less encrypt world. Less encrypt regenerates keys on each renewal. It's not just the certificates that it's regenerating. So that means that the things that you just did in the prior step for hashing your keys, well that now has to be done again. So you have to somehow automate the generation of these hashes of the keys and automating the backup of those things because if you lose these hashes or if you lose the keys I should say, you can be in real big trouble. There have been several high profile sites that have gone down for several days, not hours days, due to public key pinning misconfiguration. So it has sharp edges, it should be handled with care, it should be something that's fully understood in terms of what you're getting from it and what you're risking from using it. Certificate transparency is sort of another way of looking at this problem and saying like, okay well, the goal is worthwhile but we as in whoever it was that came up with certificate transparency and by that I mean Google, we don't really feel like this other thing works really well. And so the idea here is that again, you're defending against four certificates. The certificates are submitted to logs and then when you, the idea is that there's a response and that's called the signed certificate timestamp and this certificate transparency is gonna be required by Chrome and Chromium, presumably, allegedly in October of 2017. I don't wanna say allegedly because that's what Chrome has announced or Google has announced. It'll be interesting to see whether they back off at that date only because there's a lot of the tooling that's just not in place, as I'll mention in a moment. There's several different CST delivery methods. Can be done via the OCSP stapling request for spot cycle. It can be done via TLS extension. There's a separate compilable Nginx module called Nginx CT that can be used to serve the SCTs. And then the other method is by embedding it as a X519B3 extension. The latter is the approach that I would find the most favorable and the one that thankfully let's encrypt seems to agree and finds the most favorable because they have already talked about supporting this by embedding the SCTs directly inserts well before October arrives. And I think there is kind of truncated a bit but when you see this slide you can go to the direct issue where they talk about how they want to implement this in that timeframe. That's gonna be interesting to see whether or not it happens. If it's already March, that's encrypt doesn't have it. I sort of doubt that everyone's gonna get all their ducks in around and I think it's gonna have to be moved forward but we'll see. Given that there's a lot of focus on open source here as it should be this weekend, I wanted to talk a little bit about some of the open source tooling. These are the first three for the moment, let's focus on those, I'll get to the last one in a moment. The first three are really useful open source tools for taking a look at that TLS configuration and trying to tell you a little bit about what this particular tool thinks that you could be doing better. The first one, test-as-sell.sh this is obviously a shell script but it's very well written shell script first I mentioned before, Dirk Wetter who's very well regarded in the field and it's a really useful tool. You get some nice colorized output where you can see things in like green, green, green, yellow, red, address this right now. So it's something that you should check out. SSLIs, actually both SSLIs and the other one that I can't pronounce are both tools written in Python. They perform similar functions but they're a little bit different in terms of scope. I guess it's sort of however that's pronounced. So the third one, that one was done for the, as part of the United States government and so they were trying to ensure that the .gov sites were as secure as possible and that's how that one came about. I mentioned the last one, bad-as-sell.com I think this is a Google project. She's like Google, so I mentioned much more, many more times in this talk than I ever intended. So, but bad-as-sell is useful because they have a bunch of subdomains on that domain. You can go there to see a list of them. And each one of them they'll be like, there's like mozilla-old.bad-as-sell.com. So that one is like, never had I mentioned the three different ones, three different Cypher profiles, modern, intermediate and old. Well that would be like, that's the old one. So you can run, say, test-as-sell.sh on mozilla-old.bad-as-sell.com and you can kind of see an example of the different Cypher warnings that you will inevitably see as a result. So you might want to check these out, I think that you'll find them useful. So just to talk a little bit about what's on the horizon, TLS 1.2 was released eight years ago, which is, you know, eons in the web world in general and web security in particular. So there's a lot in this release, given that it's been eight years, just to highlight, you know, in addition to increasing speed by reducing some network round trips, they also made things simpler. They pulled out a bunch of things that were like old protocol features, just things that weren't as relevant anymore. It's been made simpler, which makes it more secure. So CAA is probably the newest thing that I've talked about today, stands for Certification Authority Authorization. And, you know, historically, any certificate authority is allowed to issue TLS certificates for any domain. And in retrospect, this seems just unwise. You know, I think it probably made sense in the beginning when we had like two or three certificate authorities. But today it's increasingly, it seems silly when I know, for example, that there is no certificate authority on the planet today that's allowed to issue any certificates for anything that I do that's not let's encrypt. So if you're in a similar situation, it would be useful to be able to say, like, hey, I don't want any other certificate authority to be able to do that, because if they get compromised, they can issue a cert in my name, that's bad for me. So that's what this DNS record is designed to do. And it's something that is not available yet, and really in any form. It's gonna take some time for certificate authorities, for, you know, wherever you're making your DNS entries, it's not even an option for a lot of folks. Like, for example, if you're using Namecheap, you know, to say, to register your domains, and you also happen to use them for DNS, or say you're using, you know, Linode for your hosting, and you use their DNS manager for your DNS management, there's no way to enter in these records at this time, and I keep bugging them, like, hey, come on guys, like, this is a cool thing, this is really gonna help the web, and so far crickets, but hopefully I'll change soon. Something to look forward to. A couple of just, you know, random asides. There's an individual who tweets at the app pinboard account that you may be familiar with, and one of his things is, you know, he's scrubbing your logs and looking out for the data, you know, if your user's not just on the website, but also on the stuff that just happens to get collected inside, and, you know, I think in the world in which, you know, people are increasingly twitchy about, you know, how information can be exfiltrated, not just illegally, but also via, you know, quasi-legal channels, and so, you know, the most effective way to, you know, prevent against some secret FISA court order, one might argue, is to not collect the data in the first place, or at least not have it sit around. So that's something that I would encourage everyone to do if you can. Look and see what kind of data you're collecting, and try to get rid of it if you can. My last digression, I wanna mention two tools that you might wanna look into, is because they're useful on the web consumption security side, I've been talking more about kind of the web server side of things, but, you know, it's also useful from any operational security standpoint to protect what it is that you're doing. I mean, you know, you have generally the keys to the castle on your notebooks, on your workstations, and to the extent that you can help, you know, protect that, you're also protecting your infrastructure as well. So, this is a VPN, you know, most people, I think when they think VPNs, they think of some provider in the Netherlands somewhere, and, you know, there's no shortage of articles out there as to why you should never use these VPNs, and why you're essentially putting a lot of trust into an entity that really has not earned it. So, this tool called ALGO allows you to, you can use, say, DigitalOcean or Linode, and for $5 a month, you can run your own VPS, you get a lot of the protections afforded to you without the downsides of trusting some random company that really doesn't have your best interests in mind. And, oh, it's a set of Ansible scripts, by the way, and that's what ALGO is, and you just kind of point it at your instance, and it'll just set it all up for you. It has really cool things where you can just download like a provisioning certificate and, like, tap it, say, on an iPhone, and it will automatically, like, set up all of the VPN connection information on your iPhone automatically. It's really just like they've done a really good job in making it both secure and convenient. It's a, as you know, a rare combination. Something that I've been, a drum I've been beating for a while, people tend to really ignore the DNS side of the equation as it relates to security. You know, it's, you may have experienced yourself, you do some, you type something in your address bar, you hit return, and then the next thing you know, you're looking at some bizarre page that looks like it's coming from Time Warner or Comcast and it's got ads somewhere, where, you know, your ISP is hijacking, your, you know, your normal, you know, browser, like, hey, this thing didn't resolve. Like, that's, this is what happens when, you know, DNS providers that you're using aren't really looking out for your best interests. And so you may think, okay, well, you know, I could just use, you know, Google, they have 8.8.8.8 and you know, and whatnot, and I can just use them. That's not to say that Google is necessarily, you know, using your data in a way in which you may want or authorize, and so, in any case, you can encrypt your DNS lookups and get around this entirely. You can, when you do that, you can choose resolvers on the other end, that at least make a promise not to log those requests, not to sell the data to advertisers, not to use it for, you know, purposes that you never intended or authorized. So take a look at this tool, it's called DNS Crypt. The individual who wrote it, again, is a widely respected person in the field. Once, it's fairly easy to implement. I had a pull request merged fairly recently where I made some changes to how it's implemented on Homebrew if you happen to be a Mac OS user, where you can use a configuration file instead of having to put all of your options on the command line invocation. So it's a useful thing, and then once you have it set up, you can go to dnsleetest.com and tell whether or not it's working. And by all means, go there now and check and see how, you know, see who else is actually being exposed to your DNS request before you do this. You can kind of do it before and after comparison. I found it very enlightening. So all the versions and sidebars aside, thank you very much, and I'd be happy to entertain any questions you might have. You mentioned earlier, Catty, and how it automatically integrated with less encrypt. Is it possible to run that as a reverse proxy so you could still maintain having a web server that does have all the other bells and whistles that you may need for whatever reason? When someone uses the mic, should I repeat the question? Okay, just checking. So, yes, you can run, as far as I'm aware, you can run Catty on any port that you choose. So you can just run it on a privileged port, on a privileged port, and then use MENGINEX to reverse proxy. In that particular case, what I would probably do is just is not have Catty do the TLS termination. If you're gonna do that, if it were me, I would probably feel more comfortable having MENGINEX terminated and then pass it on to Catty at that point. It's possible that you could do it without doing that. I just don't know how it would operate. Hey, thank you. That was a great talk. I learned a ton. Question about DNS providers. You became my interest when you said there were some that promise they don't keep logs around. Can you give an example or two? That's a good question. So if you go to the DNS Crypts site, there's a, I believe that there's a list. It's actually embedded in the project. There's both a list and there's a list generator, where I don't know exactly where it's querying and retrieving the data from. But yeah, this information is actually bundled as part of the project. I'm blanking on the one that I currently use. I could probably pull it up and it would take me a moment. The one that I'm using will actually randomly sort of pick particular resolver in a geographic region. But yeah, there's a wide range of choices that's available on the project site. So regarding that, you mentioned that the reason Algo is such a useful tool is because it sets up infrastructure that you run and you own. But it sounds like DNS Crypt, you're relying on other people to make promises the same way somebody might have relied on a VPN service. So would a good idea might be Algo for your own DNS provider if you care enough about your DNS information leaking? So when you say Algo for your own DNS provider, I wanna make sure I understand. Something like Algo. Something like Algo, that you run the command and it sets up a remote recursive resolver for you that guarantees doesn't log your queries or something like that. Got it, got it. So your own personal recursive resolver, right? Yes, that is a fantastic idea. As a matter of fact, Kyle Rankin who gave a talk on cubes earlier today, he's written an article for Linux Journal where he makes that exact argument. His argument is, if you can, you should be running your own recursive resolver so that you are fully in control of this information. And yeah, I think that is a good idea. That is a bit more work. And so it's really a question of, do you feel comfortable trusting, at least given that you have a choice and you know that it's being encrypted from at least your end all the way to the other end, is that your level of convenience versus security or do you wanna take the time to set up your own recursive resolver? But that is a great suggestion and if you have that time and the wherewithal to do that, I think it's great. Is it fair to trust a distributed DNS server service? Is it, is it, I'm sorry. Is it like a secure in order to do that with your service? So is it secure to trust another DNS resolver or provider? Yeah, that's distributed. Let's see. Yeah, I suppose that's gonna depend on one's perspective. There's always some element of trust associated in computing in general. So everyone kind of has to decide where in that spectrum and that continuum that they feel comfortable. For me personally, I'm okay with having my DNS queries encrypted and resolved by a third party as long as I get to choose who that third party is and they've made what I feel is a reasonable promise. That said, as the other gentlemen suggested, if you can resolve it yourself and not have to rely on that third party, then you have much less to worry about. That is more secure. It was illuminating your comment about VPN services. Of course, there's dozens of them and some of them are free. And of course, you figure that you're taking risks with those. I kind of use the opera one that came out. What would you suggest instead? I think you have started touch on that, but you weren't specific enough for a novice like myself. What would I choose instead of these VPN services that I... I thought you were speaking in general of VPN isn't all it's cracked up to be. Oh, I see. No, just to clarify, I wasn't suggesting necessarily that VPNs themselves aren't. It's that the way they're commonly mentioned. Like if you have a lot of the articles, a lot of things that you will see, we'll talk about using VPNs as a kind of anonymity tool, a way of bypassing region restrictions so that you can watch a particular streaming video that you couldn't watch because it's locked to a particular continent. The concept of VPNs are sound. The problem is that one, they're generally mentioned in the context of companies or organizations that provide the service that again, they're doing it to make money. They're not really doing it to look out for you. So that's the biggest problem. And then the other problem is that a lot of them, even just on a security standpoint, so for example, with OpenVPN, it's open source, that's wonderful. OpenVPN is a common tool used for this purpose, but it also requires client-side things. So running OpenVPN on phones can be challenging. Trying to get an organization to use OpenVPN requires that that software be installed, even when it is available on a variety of different devices. And that can be cumbersome or undesirable. Whereas I mentioned algo specifically because it uses protocols that have been tested and that are already built into almost every operating system and phone that would commonly be used. Not a question this time, actually just a comment about DNS Crypt because I've used it for a little over a year now. If you're running it, I have noticed if you ever click on a Google ad result from your search term, it won't resolve. Google will deny it because it goes first to googleadservices.com. So anyone that is looking to use it, just a heads up on that, that's normal behavior just because of Google not being able to register the transaction to then charge their client. But you can temporarily disable it or select a different link that's not an ad result. Right, and that's a good point. And as a follow up to that, you can, the configuration file allows you to do all kinds of things that you'll probably want to do. Like I don't really cover that here just for time, but there are some, you can whitelist particular sites if there's, because I've had times where anything where you're going to like a hotel where normally like a portal will pop up and you have to like do something in order to actually connect to their wifi, that will often be rendered non-operable by DNS Crypt. So you can whitelist particular domains to get around that issue. And one of the other things that I'll mention just as an aside is that DNS Crypt doesn't do any DNS query caching. So it's going to make this query and with the inherent latency involved each time, which is not great. You're going to notice the slowdown if you're using your normal method versus just straight DNS Crypt. So I use it in conjunction with, you can use either DNS Mask, which is what I'm currently using. You can use Unbound, which is another DNS resolver. And those will cache those queries so that you should return back to a normal DNS query speed and get rid of that latency that you would have if you were just using DNS Crypt. On your Mac, why don't you just run a local resolver on your Mac and get rid of the problem altogether and not have to go off? Yeah, that's a good question. I guess part of the reason is I just have never desired to set up bind on my machines. And yes, there are less complicated resolvers that I could do if I didn't really feel like using bind could theoretically use a number of other tools for that purpose. But I like that idea. Like I think maybe that'll be a good subject for a future article, like why DNS Crypt versus local resolver. It's a good one. Yeah, I mean, on my Mac I have that because I resolve my VPN internal networks to my resolver and if I'm not on the VPN it goes off. So I run it for a tactical operational problem but it works for security too because I'm never going off to Google or my ISP or whatever. And in the hotels it just works because it's going off and getting hotel data. You know, I'm not doing anything funny. Right, that's a good point. That's definitely worthy of consideration. Okay, I think we're on time. Thank you very much. Thanks everyone. How you doing, Dylan? Good, thank you for closing.