 Hello, everyone. In this talk, I will share with you our journey of exploring trusted S-POM. S-POM is commonly used nowadays, and our effort is to propose about the critical importance of verifying S-POM. My name is Hao Xiangzhang. I'm a researcher from the Center for Software Excellence at Huawei Canada. My co-speaker is Professor Ahmed Hassan from Queen's University in Canada. First is a disclaimer statement. I'm not going to read it word by word, but I will post for a few seconds so that people can read through it. Now, the software build of materials, also known as S-POM, is a concept widely adopted in the software engineering industry. I list several definitions of S-POM from different sources so that people can have a common ground what S-POM is. For example, the Linux Foundation defines S-POM as a formal and machine-readable metadata for identifying software's content. NTIA states that S-POM is a formal and machine-readable inventory of software components and dependencies. It highlights that S-POM should be comprehensive. In other words, S-POM should explicitly state what a software contains or does not contain. CSA also describes S-POM as a nested inventory that make up the software components. A recent initiative by OpenSSF is a $3.2 million project about S-POM everywhere. A lot of what S-POM does is very much like the list of the ingredients in consumer products, such as food, where each ingredient is listed. And as well, the potential allergies that might be caused by such ingredients are also disclosed. Here, we show why it matters to have a trusted S-POM. It can be used for accurate tracing of vulnerabilities. In a low-quality S-POM, there could be two types of inaccurate tracing. First is a missed alarm that happens when we miss the warning about a used package that's vulnerable. This is shown in the gray arrow that represents an undeclared but used dependency. The second case is a false alarm that happens when we warn about an unused package that is vulnerable, as shown in the dotted blue arrow. In this case, this dependency is declared but never used. A high-quality S-POM needs to mitigate both issues. The quality of S-POM is also essential for licensed management as well as vulnerability management. Right now, there are many methods to generate S-POM to communicate the content for a piece of software. It can be done manually, semi-automated, manifest-based to generate S-POM. Google open-source insights also provide such information to create S-POM. Build tracing methods can also be used to create S-POM. All these types of S-POM generators that take a piece of software as input all have the same question that we can ask. That is, how accurate is this S-POM? It is of great importance to verify these S-POMs so that different approaches can be validated to avoid any issue in S-POM correctness and completeness. What do we mean by S-POM correctness and completeness? These are the two metrics that can be used to define the quality of S-POM. First, I will explain how they can be computed. For the correctness of S-POM, it measures the number of packages that are found in a software out of all the packages declared in S-POM. It indicates how correct is the information provided in an S-POM. For the completeness of S-POM, it measures the number of files in a software for which their provenance can be determined based on S-POM, divided by the total number of files in a software's binary. Next, I will work through examples to show how S-POM correctness and completeness are measured. In this example, a software contains five files that are F1, F3, F5, F7 and M1. The associated S-POM for the software lists packages A, B, C and D. What we can do is to examine the contents of these packages and verify if any file in the software can be matched to files in the packages themselves. In this case, F1 and F3 in package A can be identified to exist in the software, and F7 from package C can be identified to exist in the software. Therefore, the correctness of S-POM is equal to 2 divided by 4 packages, that is 0.5. To apply the idea of S-POM correctness in real world, we pick a software called Hikube version 1.0.5. Google OSI declares that this software has 11 dependencies. We compute the correctness of this S-POM to be 0.27, that is 3 out of 11 packages that can be verified. A correct example is common login version 1.2, where 31 out of the 33 open source files are verified to exist in Hikube binary. An inaccurate example is log4j version 1.2.17, where none of log4j's files or function signatures can be found in the Hikube binary. In this example, the concrete impact is that OSI has phone vulnerability advisories to alert about log4j. However, they could be false alarms. For S-POM completeness, we use the same example to illustrate how it can be computed. For each file that is from the software, we can identify F1, F3 from package A, F7 from package C. M1 can be identified as a self-file from the software. However, for F5, it cannot be identified from any files listed as the content of packages. Therefore, the completeness is 4 divided by 5 files in the software, that is 0.8. A concrete case of S-POM completeness is junit platform console version 1.8.2. Its completeness value is 0.97, where 1553 files in the software can be verified, from which 285 files are verified from its 6 dependencies. 1268 files are verified from the software itself by using the software name in its file path. There are 48 files remain unknown. We look into these cases and find that 45 unknown files are Java source files, with a full codified name starting with org-harmcrest, suggesting the origin of the open source projector-harmcrest core. However, harmcrest core is not declared as dependency by OSI. The concrete impact of this is that these source files need to be verified manually to avoid untraceable vulnerabilities. In a worst-case scenario, if such files contain vulnerable code or functions, they could potentially harm the software without notice. Even more, these types of vulnerabilities will be hard to spot in supply chain attack. After working through S-POM correctness and completeness, I would like to raise some open questions for our community to deeply think of the quality of S-POM. Number one is about how do we compute the S-POM? Number two is how do we know the content of each package? Number three is how do we map the files to packages? And the number four is how do we know a self-files from unknown files that we are unable to trace their origin? Our idea is not to answer these questions today, but if our community can try to answer these questions together, that will push the effort towards improving the quality of S-POM further. Now, we show their approach to verify S-POM correctness. It requires the list of packages in S-POM and the binary of software as input. The automated verification approach involves first to read the package list in S-POM. Then we extract the file contents from these packages. For each file in a package, we check if it exists in the software binary, and finally we calculate the percentage of packages that can be verified. Note that a package is verified if at least one of its files exists in the software. For the approach to verify S-POM completeness, it requires the list of packages in S-POM and the list of files in the software binary as input. This is a semi-automated verification approach. First, we extract the list of files in the binary. Then, providence analysis is done by checking if each file in the binary is an internal file such as using a heuristic through matching the application keyword in the file full-qualified name, or checking if a file exists as an external file. These processes of providence analysis are semi-automated due to the use of a heuristics. An unknown file needs to be verified by a human expert. However, this leads to an issue that is hard to work at scale. The unknown files are challenging to trace their origin to solve this problem. Large-scale code-providence databases such as Word of Code and Software Heritage can be used. These databases provide a useful infrastructure to solve many software supply chain challenges. For more details, you can refer to their website. Now, we share our investigation of S-POM quality based on the 100 public available Java software. We collect them from three main sources to try to understand how S-POM correctness and completeness are quantitatively in the real world. Our results show that S-POM correctness is known in real world. This table shows 20 software with their correctness ranging from 0 to 1. The medium S-POM correctness is 0.31. When we verify the S-POM correctness with Java class files only, the medium correctness drops further to 0.07. Similarly, we compute S-POM completeness and find that the medium value is 0.65. This gives us an idea the proportion of files in the software binary whose provenance can be traced. As a community, we must work harder to improve the trustworthiness of S-POM. Before I finish my talk, I would like to share with our current plan. We are trying to conduct a more comprehensive study based on 100 publicly available open source software to assess the quality, that is the correctness and completeness of the generated S-POM. We also encourage our whole community to put effort into trusted S-POM. For example, how we can define S-POM quality metrics not only correctness and completeness, but others that can be proposed by the community. We also need tools to measure the quality of S-POM, not just the tools to generate S-POM. Last but not least, our community should work together to form organization to provide stamps of S-POM approvals. That is similar to how firms are verified to be organic. That is the end of my talk. In summary, we introduce the concept of S-POM correctness and completeness by working through examples and show real-world measurements. We also discuss how semi-automated the S-POM completeness measure is and how word of code and the software heritage can be utilized to improve the tracking of unknown files for S-POM quality improvement. Last, we share our study to verify S-POM and the proposed community effort to improve S-POM quality. If you have any question, we can discuss in this session or I can be reached at this email address. Thank you.