The HSA Foundation is announcing v1.1 of their spec today with some important changes. Since SemiAccurate first brought you the news of HSA years ago, the slow progress forward has been picking up pace quickly.
Just over a year from the March 2015 launch of v1.0 of the HSA spec comes the new revision. Since fully HSA1.0 compatible devices are just hitting the market now, there is a Dell box imminent with all the features enabled on a Carrrizo desktop, how long will we have to wait for v1.1 hardware? That is the beauty of the v1.1 spec, there are no hardware changes required so if you are HSA1.0 compatible you are HSA1.1 compatible.
In light of this it may not seem like v1.1 brings much to the table but that is not the case. Understanding those differences may take a pretty technical mind but most SemiAccurate readers will probably keep up. If you aren’t familiar with the current state of HSA, we won’t rehash it in detail but start here and here and here.
The basics don’t change with v1.1 which is no surprise in light of the hardware compatibility, you still compile your code to the HSAIL intermediate language and that still uses the hQ structure to pass data. Memory is still pinned rather than copied ad infinitum, and efficiency of data use is still a top priority. What the new spec offers are features that most wanted earlier and wider support.
First up is the spreading of HSA to a wider range of hardware. Currently it only supports a CPU and GPU as targets because, well, that’s all the hardware that was out there. Looking at the breadth of HSA Foundation members it is clear that support will add a wider class of devices and v1.1 delivers just that. Image processors, DSPs, NICs, and a whole host of other ISA are now officially supported, and nearly anything with a bunch of parallel processors can be added. Look for more to be added as soon as hardware from supporting vendors is official announced.
Software and drivers are a key point in this type of multi-vendor interaction and HSA1.1 supports that in a few new ways. There are transparent features now and mulit-vendor IP blocks are better integrated too. Also added are several key pieces of information that can be explicitly queried too so if you want to know where a page table resides you can actually get that answer directly. This was a big request in older versions, sometimes you need to find information that a generic abstraction can’t provide.
Speaking of generic abstractions in v1.1 we have a vendor neutral generic driver, something that makes sense for a virtual ISA like the one HSA provides. Think of this driver as the layer from the system to HSA, each hardware vendor still needs to provide a specific driver for their hardware. What this will hopefully do is simplify and standardize the coding process for the user and software writer. If you think about it, abstracting the hardware from the user perspective is a key enabler for multiple classes of hardware, and more direct polling of structures does most of the rest.
Another big gap in the 1.0 spec surrounds pausing of threads. As you might be aware the hQ structure and massively threaded programming does a lot of pausing of threads and HSA is no exception. HSA1.0 has mechanisms for pausing and restarting threads but 1.1 takes it a step further with multi-wait signals. A thread can now wait for more than one signal and apply some simple logic to the unpause process. This may seem basic but it wasn’t there and now is.
Multi-wait signaling can be combined with so called forward progress rules, also new. This does exactly what it sounds like, it guarantees a thread will make progress under certain conditions. Between the two new features you can implement QoS and service guarantees, a necessary step for many media applications especially real-time ones. With HSA1.1 you don’t have to write it all yourself, the primitives are there for you and work across multiple hardware blocks.
Memory has a lot of new features too starting with a formal memory model definition. If you think about it, HSA already has both big and little endian hardware supported and with new additions a formal model will make life a lot saner for programmers. Non-temporal memory accesses, basically cache eviction after use, and multi-agent sharing of images is also now official, the latter being effectively access to pinned memory spaces from multiple blocks.
That brings us to the tools and here we start with the finalizer. It now allows linkage to standard code objects and supports versioning. The first of these is pretty self explanatory, the second tends to be more familiar to the non-consumer space. Versioning effectively allows the code to specify which variant of the finalizer it needs and more importantly get it. This may seem trivial to non-programmers but you just need to look at modern shenanigans by Microsoft and forced march upgrades to realize how valuable flexible versioning is.
Moving farther down the tool chain there is a new profiling API that supports multiple agents. Massively threaded code running across multiple differing hardware blocks can be a tad problematic to optimize and that is where the new API comes in. Depending on the hardware it can work in realtime or just gather data for future examination. There is now direct access to hardware counters and timestamps as well. Gaming is an obvious example of where this will pay off but media work will see huge benefits as well. If nothing else this will save a lot of programmer hair from hitting the floor after being torn out in chunks.
Last up we have added language support with the headliner being Python and the Numba NumPy aware compiler. It is available on GitHub with direct HSA support and does automatic parallization of supported functions. Java, Javascript, OpenCL, and C++ were supported too but the new C++17 standard will probably include a standard template library for HSA as well. All the HSA1.0 tools and toolchains will work with v1.1 as well and most new features will likely be supported in the near future.
Those are the major points of the new HSA1.1 spec. It runs on the same hardware, adds a few languages and profiling tools, and brings a lot of new hardware possibilities into the mix. To support this there is a standard virtual ISA, explicit querying of some structures, and a much more robust signaling and wait model. To the user it should work better on more devices, to the programmer it should be simpler to support and easier to fix when things go wrong. For the hardware vendors it can only increase the available software for their offerings, what’s not to like?S|A