Dynamically Load CUDSS Via Pypi For Space Savings

Oct 9, 2025 by ADMIN 50 views

Hey everyone! Let's dive into how we can optimize our PyTorch builds by dynamically loading the CUDSS library using pypi wheels. Currently, we're statically linking CUDSS, which, while convenient, is consuming a significant chunk of our precious pypi wheel space. Nvidia now provides official pypi wheels for CUDSS, and leveraging these could be a game-changer for us. This article will guide you through the process, benefits, and steps to achieve this optimization.

Why Dynamically Load CUDSS?

Let's face it, pypi wheel space is valuable real estate. Statically linking CUDSS means we're including the entire library in our PyTorch builds, even though we might only be using a fraction of its features. By switching to dynamic linking, we can significantly reduce the size of our wheels, making them easier to distribute and download. Plus, it aligns with modern dependency management practices.

The main motivation here is efficiency. We want to ensure that our users have the smallest possible download size when installing PyTorch. Smaller wheels translate to faster downloads and less storage consumption, which is a win-win for everyone. Furthermore, dynamically linking against Nvidia's official pypi wheels ensures that we're always using the latest and greatest version of CUDSS, with all the bug fixes and performance improvements that come with it.

Another compelling reason is maintainability. By relying on Nvidia's official wheels, we offload the responsibility of building and maintaining CUDSS to the experts. This frees up our team to focus on other critical aspects of PyTorch development. It also simplifies our build process, reducing the complexity and potential for errors. Using pre-built wheels reduces the chance of build failures due to version conflicts or other environmental issues.

Steps to Dynamically Link CUDSS

Here’s a breakdown of the steps we need to take to dynamically link CUDSS using pypi wheels:

1. Verify pypi Wheel Availability

Before we jump into implementation, we need to ensure that pypi wheels exist for all our supported targets. This is crucial because we don't want to break compatibility with certain platforms. Head over to https://pypi.org/project/nvidia-cudss-cu12/ and https://pypi.org/project/nvidia-cudss-cu13/ to check the available wheels. If we find any missing targets, we have a couple of options:

Ping Nvidia: Reach out to Nvidia and request them to add the missing wheels. They might be able to provide them relatively quickly.
Statically Link for Specific Targets: If getting the wheels from Nvidia proves to be a hurdle, we can consider statically linking CUDSS only for the targets where pypi wheels are unavailable. This allows us to move forward with dynamic linking for the majority of our users while still supporting all platforms.

It's essential to document which targets are dynamically linked and which are statically linked. This will help with future maintenance and troubleshooting. It's also important to regularly check for updates on the availability of pypi wheels for the statically linked targets, so we can eventually switch them over to dynamic linking as well.

2. Update CMake for Dynamic Linking

CMake is the backbone of our build system, so we need to modify it to handle dynamic linking. This involves telling CMake to look for the CUDSS library as a shared library instead of a static library. Here's a general outline of the changes we'll need to make:

Find CUDSS: Modify the find_package command for CUDSS to specify that we're looking for a shared library. This will ensure that CMake searches for the .so files (or .dll files on Windows) instead of the .a files.
Link Against CUDSS: Update the target_link_libraries command to link against the CUDSS shared library. This will tell the linker to include the CUDSS library as a dependency of our PyTorch binaries.

Here's a snippet you might find useful, though remember to adapt it to your specific CMake setup:

find_package(CUDSS REQUIRED)
target_link_libraries(your_target PRIVATE CUDSS::CUDSS)

The key here is to ensure that CMake knows where to find the CUDSS shared library. This might involve setting the CUDSS_DIR variable to point to the directory containing the CUDSS cmake configuration files. You may also need to add the CUDSS library directory to the LD_LIBRARY_PATH environment variable (or the equivalent on Windows).

3. Add pypi Wheels to CUDA Requirements

We need to tell our build system that CUDSS is now a dependency that should be installed from pypi. This involves adding the nvidia-cudss-cuXX package (where XX is the CUDA version) to the list of CUDA requirements. This will ensure that the CUDSS pypi wheel is installed alongside PyTorch.

This step ensures that when someone installs PyTorch, the correct version of the CUDSS library is also installed automatically. This simplifies the installation process and reduces the risk of version conflicts.

To do this, locate the file where CUDA dependencies are listed. This might be a requirements.txt file, a setup.py file, or a custom configuration file used by your build system. Add the appropriate nvidia-cudss-cuXX package to this list, making sure to specify the correct CUDA version. For example:

nvidia-cudss-cu12

Also, consider specifying version constraints to ensure compatibility. For example, you might want to specify a minimum version of CUDSS that is known to work well with your version of PyTorch:

nvidia-cudss-cu12>=1.2.0

4. Add .so Locations to the rpath List

The rpath (run-time search path) tells the dynamic linker where to find shared libraries at runtime. We need to add the locations of the CUDSS .so files (or .dll files on Windows) to the rpath list so that PyTorch can find the CUDSS library when it's running.

This step is crucial because it ensures that PyTorch can find the CUDSS library at runtime. Without the correct rpath, PyTorch might fail to start or might crash when it tries to use CUDSS functions.

The exact method for adding to the rpath list depends on your build system. In CMake, you can use the set_target_properties command with the RPATH property. For example:

get_target_property(current_rpath your_target PROPERTIES RPATH)
set_target_properties(your_target PROPERTIES RPATH "${current_rpath};/path/to/cudss/lib")

Replace /path/to/cudss/lib with the actual path to the directory containing the CUDSS .so files. You might need to use a variable to represent this path, depending on how your build system is set up.

It's also important to consider the different types of rpath and their implications. For example, you might want to use ORIGIN to specify a path relative to the location of the executable.

5. Upload New Wheels to S3 pypi Index

Finally, we need to upload the newly built wheels to our S3 pypi index. This makes them available for nightly builds and for our users to download. This step typically requires special permissions and access to the S3 bucket where our pypi index is stored. Reach out to @atalman or someone else with S3 access to get this done.

Ensure the uploaded wheels are properly indexed so that pip can find and install them correctly. You may need to run a command to update the index after uploading the new wheels.

Before uploading, it's always a good idea to test the new wheels locally to ensure that they install correctly and that PyTorch can find and use the CUDSS library. This can save you from having to upload multiple versions of the wheels to S3.

Alternatives Considered

While dynamically loading CUDSS via pypi wheels is the preferred approach, we could also consider other alternatives. However, they generally have significant drawbacks:

Statically Linking Everything: This is what we're currently doing, and it leads to large wheel sizes, which we're trying to avoid.
Building CUDSS from Source: We could build CUDSS from source as part of our PyTorch build process. However, this would add significant complexity to our build system and would require us to maintain the CUDSS build scripts.
Using System-Installed CUDSS: We could rely on users to install CUDSS on their systems. However, this would make it difficult to ensure that users have the correct version of CUDSS installed, and it would add complexity to the installation process.

Additional Context

Dynamic loading CUDSS helps save space in our PyTorch builds. By following these steps, we can reduce the size of our pypi wheels, improve the maintainability of our build system, and ensure that our users are always using the latest and greatest version of CUDSS. Let's work together to make this happen!