Enhancing DLC Security: SBOM & Vulnerability Scanning

Oct 10, 2025 by ADMIN 54 views

This article discusses a proposal to enhance the security of Deep Learning Containers (DLC) by implementing Software Bill of Materials (SBOM) generation and Continuous Integration (CI) gating on critical vulnerabilities. This approach aims to strengthen the supply chain security, align with industry standards, and provide greater transparency into container contents.

The Motivation Behind SBOM and Vulnerability Scanning

In today's landscape, supply chain security is paramount. We need to generate SBOMs during the DLC build process. It's not just a nice-to-have; it's a crucial step in ensuring the integrity and security of our containers. Failing CI when a critical CVE (Common Vulnerabilities and Exposures) is detected is essential. By doing this, we can proactively address potential risks and prevent them from making their way into our production environments. This is particularly important in the context of deep learning, where models and data are often highly sensitive and valuable. Aligning with Executive Order 14028 requirements and meeting customer expectations for transparent container contents further underscores the importance of this initiative. Transparency and compliance are key!

The beauty of this approach is that it keeps runtime images untouched. All the changes happen within the build pipeline. This means we're not introducing any new risks or complexities into our running containers. The goal is to shift security left, addressing vulnerabilities early in the development lifecycle rather than scrambling to fix them in production. This proactive approach is more efficient and ultimately leads to more secure and reliable DLCs. We aim to create a robust and secure foundation for our deep learning workloads by embedding security practices into our build process. This is not just about ticking boxes; it's about building trust and ensuring the long-term security of our platform. By addressing vulnerabilities early, we can reduce the risk of costly incidents and maintain the confidence of our users.

Current Progress: A Proof of Concept

Currently, a fork branch, feature/sbom-vuln-scan, serves as a proof of concept for this initiative. This branch incorporates several key features:

It adds Trivy installation steps to relevant buildspecs, including those for PR, release, BJS (Build, Join, and Serve), and extended release processes. Trivy, a comprehensive and easy-to-use security scanner, is a cornerstone of this approach.
The branch hooks scanning and SBOM generation through DockerImage.build via scripts/security/scan_image.sh. This ensures that every Docker image built undergoes a thorough security scan and SBOM generation.
It uploads sbom/*.sbom.json artifacts and provides documentation for local validation (docs/local_scanning.md). This allows developers to easily verify the SBOM and vulnerability scan results locally before pushing their code.

This PoC demonstrates the feasibility of integrating SBOM generation and vulnerability scanning into our DLC build pipeline. This is a significant step towards enhancing the security and transparency of our containers. It provides a solid foundation for further development and refinement of our security practices. By making these processes an integral part of our build pipeline, we can ensure that security is not an afterthought but a core component of our DLC development lifecycle. This proactive approach to security is essential for building trust and maintaining the integrity of our platform.

Design Choices and Configuration Options

Several design choices have been made to ensure the flexibility and effectiveness of this security initiative. These choices are not set in stone and can be adjusted based on community feedback and evolving security needs.

Scanner: Trivy is the chosen scanner for both SBOM generation and vulnerability scanning. Its single-binary nature makes it easy to install and use. However, the wrapper script supports overriding this with the TRIVY_BIN environment variable, allowing for flexibility in scanner selection.
Policy Defaults: A set of policy defaults have been established, configurable via environment variables. These defaults include:
- VULN_SEVERITY=CRITICAL: This sets the default vulnerability severity threshold to CRITICAL, meaning that only critical vulnerabilities will cause the build to fail.
- VULN_FAIL_ON=true: This ensures that the CI build will fail if any vulnerabilities meeting the severity threshold are found.
- GENERATE_SBOM=true: This enables SBOM generation by default.
- SBOM_DIR=sbom: This specifies the directory where SBOM artifacts will be stored.
- SKIP_VULN_SCAN=false: This ensures that vulnerability scanning is enabled by default.
Wrapper Script: The wrapper script is designed to share the same configuration options for both local runs and CI, ensuring consistency between development and production environments.
SBOM Format: The SBOM format is currently set to SPDX JSON for PR artifacts, but this is also configurable, allowing for flexibility in SBOM format selection.

These design choices and configuration options provide a solid foundation for our SBOM generation and vulnerability scanning initiative. They allow us to balance security with flexibility, ensuring that our DLCs are both secure and easy to use. We are committed to continuously evaluating and refining these choices based on community feedback and evolving security best practices.

Seeking Guidance from Maintainers

To ensure the successful implementation and long-term sustainability of this initiative, we are actively seeking guidance from maintainers on several key aspects:

Preferred Scanner: While Trivy is currently used, we are open to exploring other options. What are the maintainers' preferences regarding scanners? Should we consider a combo of Syft/Grype, Amazon Inspector sbomgen, or other tools? Each tool has its strengths and weaknesses, and we want to choose the best tool for our needs. Let's discuss the pros and cons of each option and decide on the most effective solution. What do you guys think?
Desired Severity Threshold: The current severity threshold is set to CRITICAL. Should this be adjusted to include HIGH+ vulnerabilities? Should we also include the --ignore-unfixed option? Balancing the need for security with the potential for false positives is crucial. We need to determine the appropriate level of strictness for our vulnerability scanning. Your insights on this matter are highly valuable. Let's have a conversation and arrive at a consensus.
Long-Term SBOM Retention: What are the expectations for long-term SBOM retention? Should we upload SBOMs to S3/ECR or another storage solution? Proper storage and management of SBOMs are essential for auditing and compliance purposes. We need to establish a clear plan for SBOM retention. Share your thoughts and experiences on this topic. What strategies have worked well for you in the past?
Integration with Existing Security Review Workflows: Are there any existing security review workflows that we should integrate with? Seamless integration with existing processes is crucial for minimizing disruption and maximizing efficiency. We want to ensure that our SBOM generation and vulnerability scanning efforts align with our overall security strategy. Provide insights into our current security workflows and how this initiative can best fit into the existing framework. Let's collaborate to make this a seamless and efficient process.

Your feedback and expertise are invaluable in shaping the future of this initiative. We are committed to working together to build a more secure and transparent DLC ecosystem.

We are eager to hear your preferences and are happy to adjust the implementation and pull request (PR) based on your guidance. Let's work together to make this a success! By collaborating and sharing our knowledge, we can create a more secure and reliable platform for deep learning.

Conclusion

Implementing SBOM generation and CI gating on critical vulnerabilities is a significant step towards enhancing the security and transparency of Deep Learning Containers. This initiative aligns with industry best practices and regulatory requirements, providing a more secure and reliable platform for our users. By proactively addressing vulnerabilities early in the development lifecycle, we can reduce the risk of security incidents and maintain the trust of our community. We look forward to working with maintainers and the broader community to refine and implement this proposal, ensuring a robust and secure future for DLCs. Let's continue the conversation and work together to build a safer and more transparent ecosystem for deep learning.