{"id":797,"date":"2026-01-18T12:30:09","date_gmt":"2026-01-18T16:30:09","guid":{"rendered":"https:\/\/www.fitzsim.org\/blog\/?p=797"},"modified":"2026-01-18T19:49:47","modified_gmt":"2026-01-18T23:49:47","slug":"gfx1201-on-power9","status":"publish","type":"post","link":"https:\/\/www.fitzsim.org\/blog\/?p=797","title":{"rendered":"gfx1201 on POWER9"},"content":{"rendered":"\n<p><em>AI GoF disclaimer: I don&#8217;t expect this blog post to contribute to frontier AI gain-of-function research or I would refrain from publishing it. Please consider supporting <a href=\"https:\/\/doomdebates.com\/\">Doom Debates<\/a> to improve the quality of discourse around the risks of frontier AI research, and <a href=\"https:\/\/intelligence.org\/\">MIRI<\/a> to try to <a href=\"https:\/\/ifanyonebuildsit.com\/treaty\">mitigate the risks<\/a>.<\/em><\/p>\n\n\n\n<p>I have been wanting to experiment with open weights language models on the <a href=\"https:\/\/www.fitzsim.org\/blog\/?p=350\"><em>Talos II<\/em><\/a>.<\/p>\n\n\n\n<p>I have a <code>gfx803<\/code> card that I always wanted to use for compute, but it is now out-of-support for <em>ROCm<\/em>. I have made progress getting a <code>gfx1201<\/code> card working on this machine and I wanted to write up all the interesting error messages for reference.<\/p>\n\n\n\n<p>I took a risk and bought a new GPU, the <em>AMD Radeon AI Pro R9700<\/em> (<a href=\"https:\/\/oc.asrock.com\/Graphics-Card\/AMD\/Radeon%20AI%20PRO%20R9700%20Creator%2032GB\/index.us.asp\">ASRock Creator 32GB<\/a>), without knowing if I could get it working with the <em>Talos II<\/em> mainboard, which is now seven years old.<\/p>\n\n\n\n<p>First I realized my existing power supply did not have enough free connectors; I needed a new &#8220;modular&#8221; power supply for the GPU&#8217;s new-style <code>12v-2x6<\/code> power connector (which is actually a 16 pin connector, with an array of 2 x 6 main big pins and 4 little pins at the top). That prerequisite project was nerve-racking but successful. Physically, the card fit fine in the mainboard and <em>EATX<\/em> chassis.<\/p>\n\n\n\n<p>With the latest <em>Debian Trixie<\/em> kernel driver, the card showed up as a <em>PCIe<\/em> device in <code>lspci<\/code> (validating the physical installation) but without displaying the card&#8217;s name.  I figured the driver was not new enough to recognize the card&#8217;s product identifier.  I read online that a <em>Debian<\/em>-derivative&#8217;s <code>6.17<\/code> kernel recognized the card on a different CPU architecture, so I temporarily enabled the <em>Debian<\/em> <code>testing<\/code> repository, installed <code>linux-image<\/code>, and rebooted. Now <code>lspci<\/code> displayed the card&#8217;s name, so that was progress.  But as a side effect of the kernel upgrade, my virtual machines failed to start up.  The <code>libvirtd<\/code> message was:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>qemu-system-ppc64: Can't support 64 kiB guest pages with 4 kiB host pages with this KVM implementation<\/code><\/pre>\n\n\n\n<p>It turned out <em>Debian<\/em> <code>ppc64le<\/code> had changed the default page size from <code>64KiB<\/code> to <code>4KiB<\/code>.  <em>Debian<\/em> though, with its characteristic flexibility, still provided a <code>64KiB<\/code> page-size <code>linux-image<\/code> variant.  With that the virtual machines worked again and the GPU continued to be recognized.<\/p>\n\n\n\n<p>Next I shifted to userspace; the <em>Debian<\/em>-packaged <code>rocminfo<\/code> segfaulted early during its initialization, so I looked upstream and found <a href=\"https:\/\/github.com\/ROCm\/TheRock\"><em>TheRoc<\/em>k<\/a>.<\/p>\n\n\n\n<p>I had lots of initial trouble with <em>TheRock<\/em>&#8216;s <code>CMake<\/code> monorepo\/subprojects; I am not yet sure what&#8217;s up with that, but I suspect it may be <code>ppc64le<\/code>-specific. That said, I was able to make progress by building individual subprojects one-by-one (this is probably a better approach anyway, at this stage of porting).<\/p>\n\n\n\n<p>Eventually I got <code>amd-llvm<\/code> bootstrapped, built with a minimal configuration with <em>Trixie<\/em>&#8216;s <code>gcc<\/code> <code>14.2.0<\/code>.  Then I built <code>amd-llvm<\/code> with itself, in the <em>TheRock<\/em>-recommended configuration, except for the <code>PowerPC<\/code> and <code>AMDGPU<\/code> targets.  Next I built <code>rocminfo<\/code>. It segfaulted in the same place as <em>Debian<\/em>&#8216;s package! Some debugging resulted in <a href=\"https:\/\/www.fitzsim.org\/patches\/0001-rocr-Fix-vDSO-detection-on-ppc64-architectures-in-os.patch\">a patch<\/a> to accommodate <code>ppc64<\/code>&#8216;s <code>vDSO<\/code> naming; that eliminated the segault.<\/p>\n\n\n\n<p>Then <code>rocminfo<\/code> ran and showed both the CPUs as &#8220;Agents&#8221; 0 and 1. But no sign of the GPU.<\/p>\n\n\n\n<p>I further debugged <code>rocminfo<\/code> and found it was traversing <code>sysfs<\/code>, and specifically the <em>AMD<\/em> <em>Kernel Fusion Driver<\/em> (<code>kfd<\/code>) topology. The card did not have an entry there.<\/p>\n\n\n\n<p>I looked at <code>dmesg<\/code> and noticed:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;...] amdgpu 0033:03:00.0: amdgpu: Error parsing VCRAT\n&#91;...] kfd kfd: amdgpu: Error adding device to topology\n&#91;...] kfd kfd: amdgpu: Error initializing KFD node\n&#91;...] kfd kfd: amdgpu: device 1002:7551 NOT added due to errors<\/code><\/pre>\n\n\n\n<p>First I tried building and updating a <code>.deb<\/code> of the <a href=\"https:\/\/gitlab.com\/kernel-firmware\/linux-firmware\"><code>linux-firmware<\/code> from its <code>Git<\/code> repository<\/a>, to rule out the parsing error being caused by an outdated binary-only firmware blob. (This is my one disappointment with the <em>ROCm<\/em> stack; it would be great if the firmware and firmware toolchains were free software.) Rebooting with the new firmware produced the same result.<\/p>\n\n\n\n<p>I looked at the kernel source for that driver, and noticed extra debug <code>printk<\/code>s. <em>Debian<\/em> helpfully enables the <code>CONFIG_DYNAMIC_DEBUG<\/code> kernel option.  I tried dynamically reloading the <code>amdgpu<\/code> driver and various <em>PCIe<\/em> and GPU reset approaches, but I could not get the card back to its after-boot state.  I would have to reboot to test each change.<\/p>\n\n\n\n<p>I added <code>amdgpu.dyndbg=\"+p\"<\/code> to the kernel command line, and that gave me some extra <code>kfd<\/code> messages; with those I narrowed down the failure to the <code>IO link<\/code> entry of the <em>Virtual Component Resource Association Table<\/em> (<code>VCRAT<\/code>).<\/p>\n\n\n\n<p>I re-reviewed <code>dmesg<\/code> and, earlier than the parsing error, there was another clue:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;...] amdgpu: IO link not available for non x86 platforms<\/code><\/pre>\n\n\n\n<p>That message was printed during the creation of the CPU <code>VCRAT<\/code> (in <code>kfd_create_vcrat_image_cpu<\/code>). That was the <code>#else<\/code> branch of a platform-specific <code>#ifdef<\/code>. <code>kfd_create_vcrat_image_gpu<\/code> <mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-light-gray-color\">which<\/mark> did not have a corresponding <code>#ifdef<\/code>; &#8220;this could explain the subsequent parsing failure on the <code>VCRAT<\/code> IO link entry, on <code>ppc64le<\/code>, a non-<code>x86<\/code> platform&#8221;, I thought.<\/p>\n\n\n\n<p>It was time to recompile the <em>Linux<\/em> kernel. <em>Debian<\/em> makes this surprisingly easy; I followed the <a href=\"https:\/\/kernel-team.pages.debian.net\/kernel-handbook\/ch-common-tasks.html\">official instructions<\/a> to build a custom kernel <code>.deb<\/code> with <a href=\"https:\/\/www.fitzsim.org\/patches\/0001-drm-amdgpu-fix-non-x86-GPU-VCRAT-parsing.patch\">my attempted fix<\/a> applied to the <code>amdgpu.ko<\/code> module. Another reboot and no more <code>VCRAT<\/code> parsing failure message in <code>dmesg<\/code>. That seemed like more progress. (Perhaps a more proper solution would be to add <code>IO link<\/code> support to <code>ppc64le<\/code> upstream; I don&#8217;t know if there is an equivalent <em>POWER9<\/em> capability, hardware-wise. For my purposes, I have not yet needed an <code>IO link<\/code>.)<\/p>\n\n\n\n<p><code>rocminfo<\/code> still failed though, albeit in a new way:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>hsa api call failure at: \/TheRock\/rocm-systems\/projects\/rocminfo\/rocminfo.cc:1329\nCall returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.<\/code><\/pre>\n\n\n\n<p>The co-timed <code>dmesg<\/code> messages were:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;...] amdgpu 0033:03:00.0: amdgpu: bo 00000000bdd46d97 va 0x0ffffffbfe-0x0ffffffc1d conflict with 0x0ffffffc00-0x0ffffffe00\n&#91;...] amdgpu: Failed to map VA 0xffffffbfe000 in vm. ret -22\n&#91;...] amdgpu: Failed to map bo to gpuvm<\/code><\/pre>\n\n\n\n<p>I analyzed the section of kernel driver code that generated those messages and noticed the use of <code>AMDGPU_GPU_PAGE_SIZE<\/code> in range calculations. It is hard-coded to <code>4096<\/code>.<\/p>\n\n\n\n<p>I had a hunch that the driver needed the kernel&#8217;s page size to match.  I did a quick side quest to <a href=\"https:\/\/wiki.raptorcs.com\/wiki\/Virtualization\">change all my virtual machines to use <code>4KiB<\/code> pages<\/a>, reconfigured my custom <em>Debian<\/em> kernel for <code>4KiB<\/code> pages, and rebooted again.<\/p>\n\n\n\n<p>Now the virtual machines loaded, and finally <code>rocminfo<\/code> <a href=\"https:\/\/www.fitzsim.org\/screenshots\/ppc64le-amd-radeon-ai-pro-r9700-rocminfo.txt\">showed the card&#8217;s information<\/a>!<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;...]\n*******                  \nAgent 3                  \n*******                  \n  Name:                    gfx1201                            \n  Uuid:                    GPU-6413e1798933ffe0               \n  Marketing Name:          AMD Radeon Graphics\n&#91;...]<\/code><\/pre>\n\n\n\n<p>I think <em>Debian<\/em>&#8216;s decision to use <code>4KiB<\/code> pages is sensible, likewise <code>amdgpu<\/code>&#8216;s assuming <code>4KiB<\/code> pages, so I&#8217;m happy to have done this reconfiguration. I was only using <code>64KiB<\/code> pages because it was the default when I first installed the operating system on the <em>Talos II<\/em>.<\/p>\n\n\n\n<p>The rest of the process was a grind through <em>TheRock<\/em> subprojects with a bunch of build failure workarounds. The hardest one was fixing <code>static_assert<\/code> failures about <code>__bf16<\/code>, reported by <code>clang<\/code>, when building <code>hipblaslt<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>In file included from \/TheRock\/rocm-libraries\/projects\/hipblaslt\/tensilelite\/include\/Tensile\/DataTypes.hpp:42:\nIn file included from \/opt\/rocm\/include\/hip\/hip_fp8.h:30:\nIn file included from \/opt\/rocm\/include\/hip\/amd_detail\/amd_hip_fp8.h:67:\n\/opt\/rocm\/include\/hip\/amd_detail\/amd_hip_bf16.h:155:15: error: static assertion failed due to\n      requirement 'sizeof(__bf16) == sizeof(unsigned short)'\n  155 | static_assert(sizeof(__bf16) == sizeof(unsigned short));\n      |               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\/opt\/rocm\/include\/hip\/amd_detail\/amd_hip_bf16.h:155:30: note: expression evaluates to\n      '0 == 2'\n  155 | static_assert(sizeof(__bf16) == sizeof(unsigned short));\n      |               ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~<\/code><\/pre>\n\n\n\n<p>Debugging led me to <a href=\"https:\/\/www.fitzsim.org\/patches\/clang-amdgpu-bfloat16.patch\">a workaround<\/a>; this must be a configuration issue with how I built <code>amd-llvm<\/code>. It needs further investigation, but building <code>amd-llvm<\/code> with that hack meant that I could successfully build <code>hipblaslt<\/code>.<\/p>\n\n\n\n<p>The other <em>TheRock<\/em> dependencies were easier to build; I continued until I had all the dependencies of <code>llama.cpp<\/code> as-configured for <em>ROCm<\/em>, per <a href=\"https:\/\/strixhalo.wiki\/AI\/llamacpp-with-ROCm\">this wiki entry<\/a>.<\/p>\n\n\n\n<p>At this point <code>llama.cpp<\/code> built successfully and ran, with good performance!<\/p>\n\n\n\n<p>Here is an <a href=\"https:\/\/www.fitzsim.org\/screenshots\/ppc64le-amd-radeon-ai-pro-r9700-llama-cpp-chatgpt-oss-20B-lisp-hello-world.txt\">example run with a simple prompt<\/a>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ uname -srom\nLinux 6.17.13 ppc64le GNU\/Linux\n$ build\/bin\/llama-cli -fa 1 -m ~\/models\/chatgpt-oss-20b\/gpt-oss-20b-F16.gguf \nLoadLib(libhsa-amd-aqlprofile64.so) failed: libhsa-amd-aqlprofile64.so: cannot open shared object file: No such file or directory\nggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no\nggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no\nggml_cuda_init: found 1 ROCm devices:\n  Device 0: AMD Radeon Graphics, gfx1201 (0x1201), VMM: no, Wave Size: 32\n\nLoading model... \n\n\n?? ??\n?? ??\n?? ??  ???? ????????  ????    ????? ????? ?????\n?? ?? ????? ?? ?? ?? ?????    ??    ?? ?? ?? ??\n?? ?? ????? ?? ?? ?? ????? ?? ????? ????? ?????\n                                    ??    ??\n                                    ??    ??\n\nbuild      : b7533-c18428423\nmodel      : gpt-oss-20b-F16.gguf\nmodalities : text\n\navailable commands:\n  \/exit or Ctrl+C     stop or exit\n  \/regen              regenerate the last response\n  \/clear              clear the chat history\n  \/read               add a text file\n\n\n&gt; write hello world in common lisp\nwrite hello world in common lisp\n\n&lt;|channel|&gt;analysis&lt;|message|&gt;We need to answer: \"write hello world in common lisp\". The user wants a \"Hello, world\" program in Common Lisp. Likely they want code snippet. We should provide simple Common Lisp code: `(print \"Hello, World!\")` or `(format t \"Hello, World~%\")`. Provide explanation. Also maybe show how to run with SBCL. Let's produce a response.&lt;|end|&gt;&lt;|start|&gt;assistant&lt;|channel|&gt;final&lt;|message|&gt;Here\u2019s a minimal \u201cHello, World!\u201d program in **Common?Lisp**.  \nYou can copy it into a file (e.g. `hello.lisp`) and run it with any Common?Lisp implementation such as SBCL, CLisp, or CCL.\n\n```lisp\n;; hello.lisp\n;; Common Lisp \"Hello, World!\" example\n\n;; Print the string to the default output stream.\n(format t \"Hello, World!~%\")\n```\n\n### How to run it\n\n```bash\n# Using SBCL\nsbcl --script hello.lisp\n\n# Using CLisp\nclisp hello.lisp\n\n# Using CCL\nccl hello.lisp\n```\n\n### What each part does\n\n- `format` \u2013 a versatile printing function.\n- `t` \u2013 the target output stream (`*standard-output*`).\n- `\"Hello, World!~%\"` \u2013 the string to print, where `~%` inserts a newline.\n- `--script` (SBCL) or similar options let the interpreter treat the file as a script and exit automatically after execution.\n\nThat\u2019s all you need to see \u201cHello, World!\u201d on your terminal in Common?Lisp!\n\n&#91; Prompt: 8.7 t\/s | Generation: 103.1 t\/s ]\n\n&gt;<\/code><\/pre>\n\n\n\n<p>I am very happy with this result!  I will see how stable it is in the coming days.<\/p>\n\n\n\n<p>Now that I have a working proof-of-concept I will try to upstream some patches and ideally make the top-level <em>TheRock<\/em> build &#8220;just work&#8221; on <code>ppc64le<\/code> <em>Debian<\/em>.<\/p>\n\n\n\n<p><em>Thank yous:<\/em><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li> <em>ROCm and <code>amdgpu<\/code> teams for making TheRock and the Linux kernel drivers free software, portable and well-documented.<\/em><\/li>\n\n\n\n<li><em>Debian maintainers for a highly-adaptable operating system.<\/em><\/li>\n\n\n\n<li><em><a href=\"https:\/\/raptorcs.com\/\">Raptor Computer Systems<\/a> team for making <a href=\"https:\/\/raptorcs.com\/content\/base\/products.html\">Talos II<\/a> future-proof.<\/em><\/li>\n\n\n\n<li><em><code>#talos-workstation<\/code> and <code>#debian-ai<\/code> participants for support and feedback.<\/em><\/li>\n\n\n\n<li><em>Rene Cheng for power supply advice.<\/em><\/li>\n\n\n\n<li><em>Matthew Tegelberg for editing.<\/em><\/li>\n<\/ul>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>AI GoF disclaimer: I don&#8217;t expect this blog post to contribute to frontier AI gain-of-function research or I would refrain from publishing it. Please consider supporting Doom Debates to improve the quality of discourse around the risks of frontier AI research, and MIRI to try to mitigate the risks. I have been wanting to experiment &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.fitzsim.org\/blog\/?p=797\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;gfx1201 on POWER9&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-797","post","type-post","status-publish","format-standard","hentry","category-uncategorized","entry"],"_links":{"self":[{"href":"https:\/\/www.fitzsim.org\/blog\/index.php?rest_route=\/wp\/v2\/posts\/797","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.fitzsim.org\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.fitzsim.org\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.fitzsim.org\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.fitzsim.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=797"}],"version-history":[{"count":36,"href":"https:\/\/www.fitzsim.org\/blog\/index.php?rest_route=\/wp\/v2\/posts\/797\/revisions"}],"predecessor-version":[{"id":835,"href":"https:\/\/www.fitzsim.org\/blog\/index.php?rest_route=\/wp\/v2\/posts\/797\/revisions\/835"}],"wp:attachment":[{"href":"https:\/\/www.fitzsim.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=797"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.fitzsim.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=797"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.fitzsim.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=797"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}