HSER1 Versioning & Integrity: Phase G Explained

by ADMIN 48 views

This article dives deep into Phase G of the HSER1 project, focusing on versioning, integrity using CRC (Cyclic Redundancy Check), and ensuring backward compatibility. This phase builds upon the determinism and reproducibility efforts from Phases C through F. Let's break down the goals, scope, acceptance criteria, and deliverables of this crucial stage.

Goal: Introducing HSER1 v1.1 with Integrity Protection

The main goal of Phase G is to introduce HSER1 v1.1, which includes integrity protection, while still allowing existing v1 artifacts to be loaded without issues. This means ensuring that newer versions can detect and validate data integrity while older systems can still read the data, even without the integrity checks. This is crucial for a smooth transition and avoids breaking existing workflows. The integrity protection is achieved through the addition of a CRC32 checksum.

Key improvements in HSER1 v1.1 include:

  • Integrity Protection: CRC32 checksums are used to verify data integrity, ensuring that data hasn't been corrupted during storage or transmission.
  • Backward Compatibility: Existing v1 artifacts can still be loaded, preventing disruptions to existing systems and workflows.
  • Versioning: Introducing a versioning system allows for future updates and improvements without breaking compatibility with older versions.

The significance of this goal cannot be overstated. Data integrity is paramount in many applications, and introducing CRC32 checksums is a significant step forward in ensuring the reliability of HSER1. Backward compatibility is equally important, as it allows for a gradual adoption of the new version without forcing users to immediately migrate all their data and systems. This careful approach to versioning and compatibility demonstrates a commitment to user experience and long-term stability.

Furthermore, the introduction of versioning itself is a critical step for the project's future. As HSER1 evolves, the ability to differentiate between versions becomes essential for managing changes and ensuring compatibility across different systems and deployments. This allows for the introduction of new features and improvements while maintaining a stable and predictable environment for existing users.

Scope: Tests, CI, and Documentation First

The scope of Phase G is carefully defined to minimize external dependencies and prioritize testing and documentation. This "tests/CI/docs-first" approach ensures that the core functionality is robust and well-understood before being widely deployed. The focus is on the following areas:

  • Versioning and Integrity: Adding versioning and integrity to HSER1 without breaking v1 readers.
  • Testing: Extending existing tests to cover v1.1 round-trip, backward compatibility, and corruption detection.
  • Documentation: Updating documentation to reflect the new versioning policy, trailer format, CRC32 details, and guarantees.
  • Continuous Integration (CI): Planning CI to exercise serialization tests in both default (SIMD) and scalar modes across different platforms.

Let's delve deeper into each of these areas:

  • Versioning and Integrity Implementation: The core of this phase involves modifying the HSER1 writer to default to v1.1. This means appending a trailer tag and CRC32 checksum after the payload. The reader, on the other hand, needs to support both v1 (no trailer) and v1.1 (validates CRC if trailer is present). A key design decision is to keep the header size unchanged, ensuring that integrity is provided via the trailer to avoid breaking v1 loaders. A compile-time escape hatch is also included, allowing the emission of v1 (no trailer) if needed. This design demonstrates a thoughtful approach to balancing new features with backward compatibility.

  • Comprehensive Testing: The testing strategy is comprehensive, covering various scenarios. This includes round-trip testing for v1.1, ensuring that data can be saved and loaded correctly. Backward compatibility is tested by loading existing v1 goldens and verifying their contents. Corruption detection tests are crucial, involving flipping a bit in the payload under v1.1 and expecting the loader to fail with a clear error message. These tests are essential for ensuring the robustness and reliability of the new version.

  • Detailed Documentation: Clear and comprehensive documentation is crucial for any software project. In this phase, the documentation is updated to reflect the HSER1 versioning policy, including details about v1 vs v1.1, the trailer format and tag, CRC32 details (IEEE polynomial), and guarantees. Backward compatibility guarantees and migration notes are also included. This ensures that users have the information they need to understand and use the new version effectively.

  • Robust CI Planning: Continuous Integration (CI) is essential for ensuring code quality and stability. The CI plan involves exercising serialization tests in both default (SIMD) and -DHYPERSTREAM_FORCE_SCALAR=ON modes across different platforms, including ubuntu-latest, windows-2022, and macOS-14. This broad testing coverage helps to identify and fix issues early in the development process.

Acceptance Criteria: What Defines Success?

To ensure that Phase G is successful, specific acceptance criteria have been defined. These criteria provide a clear and measurable way to determine whether the goals of the phase have been achieved. The acceptance criteria are:

  • v1.1 with CRC32 Validation: CRC32 validation must be implemented, including the trailer tag and CRC32 checksum over the payload.
  • Backward Compatibility: v1 goldens must be loaded without changes.
  • Corruption Detection: Corruption detection tests for v1.1 must fail deterministically.
  • CI Green: CI must be green across ubuntu-latest, windows-2022, and macOS-14 (both default SIMD and force-scalar).
  • Zero External Dependencies: The implementation must have zero external dependencies.
  • Strict Warnings-as-Errors: The code must compile with strict warnings-as-errors.
  • High Code Quality: The code must meet staff+ code quality standards.

These acceptance criteria are designed to ensure that the new version of HSER1 is robust, reliable, and easy to use. The emphasis on backward compatibility and corruption detection highlights the commitment to data integrity and user experience. The criteria related to CI, external dependencies, and code quality ensure that the implementation is well-tested, maintainable, and aligns with the project's standards.

Deliverables: The Tangible Outcomes

The deliverables of Phase G represent the tangible outcomes of the work. These deliverables are the specific files and documentation that are updated or created as part of the phase. The deliverables are:

  • include/hyperstream/io/serialization.hpp: Minimal changes to add v1.1 trailer write and optional CRC validation on read (seekable stream path).
  • tests/serialization_tests.cc: New tests for v1.1 round-trip, v1 backward-compat load, and corruption detection.
  • Docs/Serialization.md: Versioning, integrity, and migration policy documentation.

Let's examine each deliverable in more detail:

  • include/hyperstream/io/serialization.hpp: This file contains the core serialization logic for HSER1. The changes in this file are focused on adding the v1.1 trailer write functionality and the optional CRC validation on read. The aim is to keep the changes minimal to reduce the risk of introducing new issues while effectively implementing the new features. The CRC validation is implemented specifically for seekable streams, as this allows for efficient reading of the trailer at the end of the data.

  • tests/serialization_tests.cc: This file contains the tests for the serialization functionality. The new tests added in this phase cover the key aspects of v1.1, including round-trip testing, backward compatibility testing, and corruption detection. These tests are crucial for verifying the correctness and reliability of the new implementation.

  • Docs/Serialization.md: This documentation file is updated to reflect the new versioning policy, integrity features, and migration policy. This includes detailed information about the trailer format, CRC32 algorithm, and any considerations for migrating from v1 to v1.1. Clear and comprehensive documentation is essential for users to understand and use the new features effectively.

Notes: Trailer Detection Strategy

A key aspect of the implementation is the trailer detection strategy. After reading the payload, the reader checks for a 4-byte tag (e.g., "HSX1"). If this tag is present, the next 4 bytes are read as the CRC32 checksum, which is then validated. If the tag is not present, or if the stream is non-seekable, the reader falls back to v1 behavior, meaning no CRC validation is performed. This approach allows for seamless backward compatibility, as v1 files will not have the trailer tag and will be processed without CRC validation.

The choice of the trailer tag is important for avoiding false positives. The tag should be a unique sequence of bytes that is unlikely to occur within the payload data. The use of a 4-byte tag provides a good balance between uniqueness and efficiency.

Conclusion

Phase G represents a significant step forward for the HSER1 project. By introducing versioning and integrity protection while maintaining backward compatibility, this phase enhances the reliability and usability of HSER1. The careful attention to testing, documentation, and CI ensures that the new features are robust and well-understood. The deliverables of this phase provide a solid foundation for future development and adoption of HSER1. The team's commitment to data integrity, backward compatibility, and code quality is evident in the design and implementation of Phase G. By following a structured approach and focusing on key acceptance criteria, this phase successfully achieves its goals and sets the stage for continued progress in the HSER1 project.

In summary, Phase G delivers a more robust and reliable HSER1, laying the groundwork for future enhancements while ensuring a smooth transition for existing users. The focus on integrity, backward compatibility, and thorough testing underscores the commitment to quality and user satisfaction.