Proxmox VEクラスタをUPSで停止する手法のメモ

Proxmox VE環境でとりあえずUPSでサーバ停止する場合についてのメモ

cephについてはもっと面倒なので今回は考慮しない

面倒な点

・HAを有効にした仮想マシンは停止しても再起動してくる

 → 仮想マシンをHA対象外にする必要がある

・仮想マシンを一時的にHA対象外にする、という設定はない
 → 恒久的な設定変更として 仮想マシンを HA対象外にする必要がある
  &再起動したあと 仮想マシンを HA対象内にする必要もある

・PVEサーバ上で仮想マシンを動作している状態でメンテナンスモード有効にすると仮想マシンの操作が何もできなくなる
 → 動いてる仮想マシンをほかのサーバに移動する操作を自動でやってくれない

・HA対象仮想マシンの設定は /etc/pve/ha/resources.cfg にされるが変更するためのコマンド/APIはv8.3時点では用意されていない
 → テキストファイルを編集する必要がある
  &/etc/pve 以下は PVEクラスタ内で共有されているファイルなので編集競合に注意

・起動時にすぐにPVEクラスタのマスターが決まるわけではない
 → 起動後、マスターが決まるまで待機する?(どうやってマスターが決まったかを判定するか?)

ひとまずな実装方針

停止時の流れとしては以下

1) PVEマスターで現在の /etc/pve/ha/resources.cfg を保存
2) PVEマスターで /etc/pve/ha/resources.cfg を書き換え
3) 稼働している仮想マシン/コンテナを停止
4) PVEサーバをメンテナンスモードに切り替える?
5) シャットダウン実行

起動時の流れとしては以下

1) 起動開始
2) 全PVEサーバで PVEクラスタが稼働するまで処理待機
3) PVEマスターで バックアップしてあった /etc/pve/ha/resources.cfg を戻す
4) PVEサーバをメンテナンスモードから解除?
5) resources.cfgの記述に従って仮想マシン/コンテナが起動?

UPSSの出してるシャットダウンボックス UPSS-SDB04 でproxmox検証結果出てるけど、どんな実装にしてんだろ?

未確認の実装例

とりあえず、まだ未検証の実装サンプル

停止時に実行する処理

停止処理について

HAのリソースに仮想マシンが登録されていないと ha-manager statusの結果は「quorum OK」のみとなる。このスクリプトではha-manager statusの際に「master ホスト名 (ステータス)」という出力がある場合=HA設定されている場合のみ設定ファイルの退避を行っている

HA設定されている場合、resources.cfgのステータスがstoppedに書き換えられると、仮想マシンが停止される

#!/bin/bash
# 仮想マシンの状態を確認する関数
check_vms_status() {
    # 実行中の仮想マシンIDを取得
    running_vms=$(qm list | grep running | awk '{print $1}')
    if [ -z "$running_vms" ]; then
        echo "すべての仮想マシンが停止しています。"
        return 0  # すべて停止している場合
    else
        echo "以下の仮想マシンが実行中です: $running_vms"
        return 1  # 実行中の仮想マシンがある場合
    fi
}

# PVEマスター名を取得
mastername=$(ha-manager status | grep 'master' | awk '{print $2}')

# マスターでのみ実行する処理
if [ W$mastername == W`hostname` ];
then
   # resources.cfg のバックアップ作成
   cp /etc/pve/ha/resources.cfg /etc/pve/ha/resources.cfg.upstmp
   # 書き換えによりHA対象仮想マシンは停止が実行される
   sed -i 's/started/stopped/' /etc/pve/ha/resources.cfg
   # HAに含まれない仮想マシンに対する停止
   for hostname in `pvesh ls /nodes/|awk '{ print $2 }'`
   do
      for vmid in `pvesh ls /nodes/$hostname/qemu/|awk '{ print $2 }'`
      do 
         pvesh create /nodes/$hostname/qemu/$vmid/status/shutdown
      done
   done
else
   # マスタサーバ以外は処理を遅延
   sleep 10
fi

# 仮想マシンが停止しているかを確認する
check_vms_status
status=$?
while [ $status -ne 0 ]
do
   echo "仮想マシンがまだ実行中です。再確認します..."
   sleep 5
   check_vms_status
   status=$?
done

# すべての仮想マシンが停止している場合、メンテナンスモードに移行
echo "Proxmoxサーバをメンテナンスモードにします..."
ha-manager crm-command node-maintenance enable `hostname`

HPE Morpheus VM Essentials / HVM が Ubuntu 24.04 ベースにも対応するも別環境として構築必須

しれっとドキュメントが更新されているので気が付きにくいのですが、Ubuntu 22.04 LTSベースだった HPE Morpheus VM Essentials / HVM が v8.0.5 からUbuntu 24.04 LTS ベースに対応しています。

ただ、クラスターに登録する際のインタフェースを見る限りでは、Ubuntu 22.04ベースはHPE VM 1.1クラスタ、Ubuntu 24.04ベースは HPE VM 1.2クラスタと分割する必要があるようです。(2025年5月末時点でのドキュメントには、そもそもクラスタのバージョンについて書かれていませんが・・・)

現状運用しているUbuntu 22.04ベースのHPE VM 1.1クラスタをUbuntu 24.04ベースのHPE VM 1.2クラスタにアップデートすることについての記述がないので、継続運用性に若干疑問が・・・

HPE VMEssentialsでNVIDIA vGPUは使えるか?

HPE VMEssentials環境でNVIDIA vGPUが使えるか試してみたところ、libvirtを直接操作することで仮想マシンにvGPUの割り当てが行えることを確認した。

ただし、HPE VMEssentialsで管理していると思われる仮想マシンのXML形式の設定ファイルと、virsh dumpxml で取得できるXML形式の設定ファイルの内容は別物になっていて、HPE Web UIで設定変更するとvirshで直接編集した結果は破棄されてしまうので注意が必要

仮想マシンの起動/停止レベルであれば問題ないが、設定変更はダメだった

HPE VMEssentialsのGUIでパススルー設定をして仮想マシンに割り当てるという設定も可能ではあるのだが、パススルー設定できるデバイスでvGPUデバイスが選択できない状態だった。つまりは、NVIDIA vGPUドライバを入れずに仮想マシンから直接GPU全体を割り当てる、といった使い方しかできない。

HPE VMessentials側の設定:下地作り

パススルーする際に必要となるIOMMU関連設定は、hpe-vm v8.0.5.1では設定実施済みとなっていた。

必要なのはLinux標準のnouveauドライバを使用しないようにするための blacklist.conf 設定だった

実施後、nvidia vGPU のうちUbuntu版の nvidia-vgpu-ubuntu-570_570.133.10_amd64.deb をインストールして対応完了

Secure Boot有効の場合はUEFIへのキー埋め込みも行ってくれた。

また、sriov-manager -e ALL を実行して、GPUを分割されるか確認

pcuser@vgpuserver:~$ lspci -d 10de: -nnk
08:00.0 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:157e]
        Kernel driver in use: nouveau
        Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
root@vgpuserver:~# /usr/lib/nvidia/sriov-manage -e ALL
Enabling VFs on 0000:08:00.0
/usr/lib/nvidia/sriov-manage: line 148: /sys/bus/pci/drivers/nvidia/bind: No such file or directory
/usr/lib/nvidia/sriov-manage: line 148: /sys/bus/pci/drivers/nvidia/bind: No such file or directory
/usr/lib/nvidia/sriov-manage: line 148: /sys/bus/pci/drivers/nvidia/bind: No such file or directory
/usr/lib/nvidia/sriov-manage: line 148: /sys/bus/pci/drivers/nvidia/bind: No such file or directory
/usr/lib/nvidia/sriov-manage: line 148: /sys/bus/pci/drivers/nvidia/bind: No such file or directory
/usr/lib/nvidia/sriov-manage: line 148: /sys/bus/pci/drivers/nvidia/bind: No such file or directory
/usr/lib/nvidia/sriov-manage: line 148: /sys/bus/pci/drivers/nvidia/bind: No such file or directory
/usr/lib/nvidia/sriov-manage: line 148: /sys/bus/pci/drivers/nvidia/bind: No such file or directory
/usr/lib/nvidia/sriov-manage: line 148: /sys/bus/pci/drivers/nvidia/bind: No such file or directory
/usr/lib/nvidia/sriov-manage: line 148: /sys/bus/pci/drivers/nvidia/bind: No such file or directory
/usr/lib/nvidia/sriov-manage: line 148: /sys/bus/pci/drivers/nvidia/bind: No such file or directory
/usr/lib/nvidia/sriov-manage: line 148: /sys/bus/pci/drivers/nvidia/bind: No such file or directory
/usr/lib/nvidia/sriov-manage: line 148: /sys/bus/pci/drivers/nvidia/bind: No such file or directory
/usr/lib/nvidia/sriov-manage: line 148: /sys/bus/pci/drivers/nvidia/bind: No such file or directory
/usr/lib/nvidia/sriov-manage: line 148: /sys/bus/pci/drivers/nvidia/bind: No such file or directory
/usr/lib/nvidia/sriov-manage: line 148: /sys/bus/pci/drivers/nvidia/bind: No such file or directory
/usr/lib/nvidia/sriov-manage: line 90: /sys/bus/pci/drivers/nvidia/bind: No such file or directory
root@vgpuserver:~# lspci -d 10de:
08:00.0 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
08:00.4 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
08:00.5 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
08:00.6 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
08:00.7 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
08:01.0 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
08:01.1 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
08:01.2 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
08:01.3 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
08:01.4 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
08:01.5 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
08:01.6 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
08:01.7 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
08:02.0 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
08:02.1 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
08:02.2 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
08:02.3 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
root@vgpuserver:~#


分割されるようであれば、これが起動時に実行されるよう/usr/local/lib/systemd/system/nvidia-sriov.service を作成し、起動する

root@vgpuserver:~# mkdir -p /usr/local/lib/systemd/system
root@vgpuserver:~# vi /usr/local/lib/systemd/system/nvidia-sriov.service
root@vgpuserver:~# cat /usr/local/lib/systemd/system/nvidia-sriov.service
[Unit]
Description=Enable NVIDIA SR-IOV
Before=pve-guests.service nvidia-vgpud.service nvidia-vgpu-mgr.service

[Service]
Type=oneshot
ExecStart=/usr/lib/nvidia/sriov-manage -e ALL

[Install]
WantedBy=multi-user.target
root@vgpuserver:~# systemctl  daemon-reload
root@vgpuserver:~# systemctl status nvidia-sriov.service
○ nvidia-sriov.service - Enable NVIDIA SR-IOV
     Loaded: loaded (/usr/local/lib/systemd/system/nvidia-sriov.service; disabled; vendor preset: enabled)
     Active: inactive (dead)
root@vgpuserver:~# systemctl enable --now nvidia-sriov.service
Created symlink /etc/systemd/system/multi-user.target.wants/nvidia-sriov.service → /usr/local/lib/systemd/system/nvidia-sriov.service.
root@vgpuserver:~# systemctl status nvidia-sriov.service
○ nvidia-sriov.service - Enable NVIDIA SR-IOV
     Loaded: loaded (/usr/local/lib/systemd/system/nvidia-sriov.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Thu 2025-05-22 18:08:35 JST; 1s ago
    Process: 2853 ExecStart=/usr/lib/nvidia/sriov-manage -e ALL (code=exited, status=0/SUCCESS)
   Main PID: 2853 (code=exited, status=0/SUCCESS)
        CPU: 44ms

May 22 18:08:35 vgpuserver systemd[1]: Starting Enable NVIDIA SR-IOV...
May 22 18:08:35 vgpuserver sriov-manage[2857]: GPU at 0000:08:00.0 already has VFs enabled.
May 22 18:08:35 vgpuserver systemd[1]: nvidia-sriov.service: Deactivated successfully.
May 22 18:08:35 vgpuserver systemd[1]: Finished Enable NVIDIA SR-IOV.
root@vgpuserver:~#

再起動してnvidia-smiが動くか確認


pcuser@vgpuserver:~$ nvidia-smi
Tue May 27 15:39:51 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.10             Driver Version: 570.133.10     CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A2                      On  |   00000000:08:00.0 Off |                  Off |
|  0%   50C    P8              9W /   60W |       0MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
pcuser@vgpuserver:~$ 

HPE VMessentials側の設定:libvirtいじり

v8.0.5時点ではHVM経由ではvGPU割り当てができないので、libvirtを直接いじってvGPUを仮想マシンに割り当てる必要がある

参考になるドキュメントは「Configuring the vGPU Manager for a Linux with KVM Hypervisor

まず、このデバイスがサポートしている”nvidia-3桁数字”という書式のデバイス種別を確認するため「mdevctl types」を実行。また、同時に「0000:08:00.4」といった形でPCIデバイスのアドレスを確認

root@vgpuserver:~# mdevctl types
0000:08:00.4
  nvidia-742
    Available instances: 1
    Device API: vfio-pci
    Name: NVIDIA A2-1B
    Description: num_heads=4, frl_config=45, framebuffer=1024M, max_resolution=5120x2880, max_instance=16
  nvidia-743
    Available instances: 1
    Device API: vfio-pci
    Name: NVIDIA A2-2B
    Description: num_heads=4, frl_config=45, framebuffer=2048M, max_resolution=5120x2880, max_instance=8
  nvidia-744
    Available instances: 1
    Device API: vfio-pci
    Name: NVIDIA A2-1Q
    Description: num_heads=4, frl_config=60, framebuffer=1024M, max_resolution=5120x2880, max_instance=16
  nvidia-745
    Available instances: 1
    Device API: vfio-pci
    Name: NVIDIA A2-2Q
    Description: num_heads=4, frl_config=60, framebuffer=2048M, max_resolution=7680x4320, max_instance=8
  nvidia-746
    Available instances: 1
    Device API: vfio-pci
    Name: NVIDIA A2-4Q
    Description: num_heads=4, frl_config=60, framebuffer=4096M, max_resolution=7680x4320, max_instance=4
  nvidia-747
    Available instances: 1
    Device API: vfio-pci
    Name: NVIDIA A2-8Q
    Description: num_heads=4, frl_config=60, framebuffer=8192M, max_resolution=7680x4320, max_instance=2
  nvidia-748
    Available instances: 1
    Device API: vfio-pci
    Name: NVIDIA A2-16Q
    Description: num_heads=4, frl_config=60, framebuffer=16384M, max_resolution=7680x4320, max_instance=1
  nvidia-749
    Available instances: 1
    Device API: vfio-pci
    Name: NVIDIA A2-1A
    Description: num_heads=1, frl_config=60, framebuffer=1024M, max_resolution=1280x1024, max_instance=16
  nvidia-750
    Available instances: 1
    Device API: vfio-pci
    Name: NVIDIA A2-2A
    Description: num_heads=1, frl_config=60, framebuffer=2048M, max_resolution=1280x1024, max_instance=8
  nvidia-751
    Available instances: 1
    Device API: vfio-pci
    Name: NVIDIA A2-4A
    Description: num_heads=1, frl_config=60, framebuffer=4096M, max_resolution=1280x1024, max_instance=4
  nvidia-752
    Available instances: 1
    Device API: vfio-pci
    Name: NVIDIA A2-8A
    Description: num_heads=1, frl_config=60, framebuffer=8192M, max_resolution=1280x1024, max_instance=2
  nvidia-753
    Available instances: 1
    Device API: vfio-pci
    Name: NVIDIA A2-16A
    Description: num_heads=1, frl_config=60, framebuffer=16384M, max_resolution=1280x1024, max_instance=1
<略>
  nvidia-753
    Available instances: 1
    Device API: vfio-pci
    Name: NVIDIA A2-16A
    Description: num_heads=1, frl_config=60, framebuffer=16384M, max_resolution=1280x1024, max_instance=1
root@vgpuserver:~#

vGPUでCUDAを使う場合はA2-?Q デバイスあたりを使う

0000:08:00.4にある RAM2GBのA2-2Q の nvidia-745 で定義を作成する

まず「0000:08:00.4」をlibvirtで使用するデバイスIDに変換するためvirsh nodedev-listコマンドで確認する

root@vgpuserver:~# virsh nodedev-list --cap pci|grep 08_00_4
pci_0000_08_00_4
root@vgpuserver:~# 

「pci_0000_08_00_4」の「nvidia-745」が定義ファイルで使う値となる

また、定義ごとに一意のUUIDが必要になるので「uuidgen」コマンドを実行して値を確認してファイルを作成する

root@vgpuserver:~# uuidgen
73f353e8-7da8-4b76-8182-04c5a1415dec
root@vgpuserver:~# vi vgpu-test.xml
root@vgpuserver:~# cat vgpu-test.xml
<device>
    <parent>pci_0000_08_00_4</parent>
    <capability type="mdev">
        <type id="nvidia-745"/>
        <uuid>73f353e8-7da8-4b76-8182-04c5a1415dec</uuid>
    </capability>
</device>
root@vgpuserver:~#

作成したファイルをnodedevとして登録する

root@vgpuserver:~# virsh nodedev-define vgpu-test2.xml
Node device 'mdev_73f353e8_7da8_4b76_8182_04c5a1415dec_0000_08_00_4' defined from 'vgpu-test2.xml'

root@vgpuserver:~# virsh nodedev-list --cap mdev --inactive
mdev_73f353e8_7da8_4b76_8182_04c5a1415dec_0000_08_00_4

root@vgpuserver:~# virsh nodedev-list --cap mdev

root@vgpuserver:~#

登録できたmdevを開始する

root@vgpuserver:~# virsh nodedev-start mdev_73f353e8_7da8_4b76_8182_04c5a1415dec_0000_08_00_4
Device mdev_73f353e8_7da8_4b76_8182_04c5a1415dec_0000_08_00_4 started

root@vgpuserver:~# virsh nodedev-list --cap mdev
mdev_73f353e8_7da8_4b76_8182_04c5a1415dec_0000_08_00_4

root@vgpuserver:~# virsh nodedev-list --cap mdev --inactive

root@vgpuserver:~#

こうして作られたmdevデバイスをvirtshコマンドを使って直接仮想マシンに追加する

root@vgpuserver:~# virsh attach-device almalinux --persistent /dev/stdin <<EOF
<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='on'>
  <source>
    <address uuid='73f353e8-7da8-4b76-8182-04c5a1415dec'/>
  </source>
</hostdev>
EOF
Device attached successfully

root@vgpuserver:~# 

これで設定ができたのであとは仮想マシンを起動して、仮想マシン内でvGPUドライバをインストールすれば使えました

ただ、HVM管理UIで設定を変えるとmdev設定は消えてしまうので、実用するにはつらいですね

CUDA対応のtensorflowインストールがわかりにくい件

AlmaLinux 9環境でNVIDIA GPUでCUDAが動く環境を作ったときに、とりあえず動作テストとしてtensorflowでも動かすか、と試してみたら、ドキュメントに騙された件についてメモ

まず、最初に参照したドキュメントは TensorFlow 2 をインストールする

CPU と GPUのどちらも「pip install tensorflow」でインストールできる、と書いてある

では、インストール

[testuser@vgpu ~]$ pip list
Package         Version
--------------- --------
dbus-python     1.2.18
distlib         0.3.2
distro          1.5.0
filelock        3.7.1
gpg             1.15.1
libcomps        0.1.18
nftables        0.1
packaging       20.9
pip             21.3.1
platformdirs    2.5.4
pycairo         1.20.1
PyGObject       3.40.1
pyparsing       2.4.7
python-dateutil 2.8.1
PyYAML          5.4.1
rpm             4.16.1.3
selinux         3.6
sepolicy        3.6
setools         4.4.4
setuptools      53.0.0
six             1.15.0
systemd-python  234
virtualenv      20.21.1
[testuser@vgpu ~]$ pip install tensorflow
Defaulting to user installation because normal site-packages is not writeable
Collecting tensorflow
  Downloading tensorflow-2.19.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (644.8 MB)
     |████████████████████████████████| 644.8 MB 18 kB/s
Collecting h5py>=3.11.0
  Downloading h5py-3.13.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)
     |████████████████████████████████| 4.6 MB 92.7 MB/s
Collecting absl-py>=1.0.0
  Downloading absl_py-2.2.2-py3-none-any.whl (135 kB)
     |████████████████████████████████| 135 kB 85.3 MB/s
Collecting typing-extensions>=3.6.6
  Downloading typing_extensions-4.13.2-py3-none-any.whl (45 kB)
     |████████████████████████████████| 45 kB 9.4 MB/s
Collecting grpcio<2.0,>=1.24.3
  Downloading grpcio-1.71.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.9 MB)
     |████████████████████████████████| 5.9 MB 87.9 MB/s
Collecting tensorflow-io-gcs-filesystem>=0.23.1
  Downloading tensorflow_io_gcs_filesystem-0.37.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB)
     |████████████████████████████████| 5.1 MB 89.6 MB/s
Collecting google-pasta>=0.1.1
  Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
     |████████████████████████████████| 57 kB 27.5 MB/s
Requirement already satisfied: packaging in /usr/lib/python3.9/site-packages (from tensorflow) (20.9)
Collecting wrapt>=1.11.0
  Downloading wrapt-1.17.2-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (82 kB)
     |████████████████████████████████| 82 kB 5.8 MB/s
Collecting astunparse>=1.6.0
  Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting flatbuffers>=24.3.25
  Downloading flatbuffers-25.2.10-py2.py3-none-any.whl (30 kB)
Requirement already satisfied: six>=1.12.0 in /usr/lib/python3.9/site-packages (from tensorflow) (1.15.0)
Collecting requests<3,>=2.21.0
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
     |████████████████████████████████| 64 kB 21.4 MB/s
Collecting termcolor>=1.1.0
  Downloading termcolor-3.1.0-py3-none-any.whl (7.7 kB)
Collecting ml-dtypes<1.0.0,>=0.5.1
  Downloading ml_dtypes-0.5.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB)
     |████████████████████████████████| 4.7 MB 98 kB/s
Collecting keras>=3.5.0
  Downloading keras-3.9.2-py3-none-any.whl (1.3 MB)
     |████████████████████████████████| 1.3 MB 62.7 MB/s
Requirement already satisfied: setuptools in /usr/lib/python3.9/site-packages (from tensorflow) (53.0.0)
Collecting numpy<2.2.0,>=1.26.0
  Downloading numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB)
     |████████████████████████████████| 19.5 MB 53 kB/s
Collecting protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.3
  Downloading protobuf-5.29.4-cp38-abi3-manylinux2014_x86_64.whl (319 kB)
     |████████████████████████████████| 319 kB 80.5 MB/s
Collecting opt-einsum>=2.3.2
  Downloading opt_einsum-3.4.0-py3-none-any.whl (71 kB)
     |████████████████████████████████| 71 kB 2.0 MB/s
Collecting tensorboard~=2.19.0
  Downloading tensorboard-2.19.0-py3-none-any.whl (5.5 MB)
     |████████████████████████████████| 5.5 MB 88.3 MB/s
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1
  Downloading gast-0.6.0-py3-none-any.whl (21 kB)
Collecting libclang>=13.0.0
  Downloading libclang-18.1.1-py2.py3-none-manylinux2010_x86_64.whl (24.5 MB)
     |████████████████████████████████| 24.5 MB 55 kB/s
Collecting wheel<1.0,>=0.23.0
  Downloading wheel-0.45.1-py3-none-any.whl (72 kB)
     |████████████████████████████████| 72 kB 5.3 MB/s
Collecting optree
  Downloading optree-0.15.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (397 kB)
     |████████████████████████████████| 397 kB 74.3 MB/s
Collecting namex
  Downloading namex-0.0.9-py3-none-any.whl (5.8 kB)
Collecting rich
  Downloading rich-14.0.0-py3-none-any.whl (243 kB)
     |████████████████████████████████| 243 kB 78.5 MB/s
Collecting charset-normalizer<4,>=2
  Downloading charset_normalizer-3.4.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (149 kB)
     |████████████████████████████████| 149 kB 79.2 MB/s
Collecting urllib3<3,>=1.21.1
  Downloading urllib3-2.4.0-py3-none-any.whl (128 kB)
     |████████████████████████████████| 128 kB 74.9 MB/s
Collecting idna<4,>=2.5
  Downloading idna-3.10-py3-none-any.whl (70 kB)
     |████████████████████████████████| 70 kB 33.1 MB/s
Collecting certifi>=2017.4.17
  Downloading certifi-2025.4.26-py3-none-any.whl (159 kB)
     |████████████████████████████████| 159 kB 78.6 MB/s
Collecting werkzeug>=1.0.1
  Downloading werkzeug-3.1.3-py3-none-any.whl (224 kB)
     |████████████████████████████████| 224 kB 67.2 MB/s
Collecting markdown>=2.6.8
  Downloading markdown-3.8-py3-none-any.whl (106 kB)
     |████████████████████████████████| 106 kB 55.2 MB/s
Collecting tensorboard-data-server<0.8.0,>=0.7.0
  Downloading tensorboard_data_server-0.7.2-py3-none-manylinux_2_31_x86_64.whl (6.6 MB)
     |████████████████████████████████| 6.6 MB 61.3 MB/s
Requirement already satisfied: pyparsing>=2.0.2 in /usr/lib/python3.9/site-packages (from packaging->tensorflow) (2.4.7)
Collecting importlib-metadata>=4.4
  Downloading importlib_metadata-8.7.0-py3-none-any.whl (27 kB)
Collecting MarkupSafe>=2.1.1
  Downloading MarkupSafe-3.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20 kB)
Collecting markdown-it-py>=2.2.0
  Downloading markdown_it_py-3.0.0-py3-none-any.whl (87 kB)
     |████████████████████████████████| 87 kB 29.4 MB/s
Collecting pygments<3.0.0,>=2.13.0
  Downloading pygments-2.19.1-py3-none-any.whl (1.2 MB)
     |████████████████████████████████| 1.2 MB 77.3 MB/s
Collecting zipp>=3.20
  Downloading zipp-3.21.0-py3-none-any.whl (9.6 kB)
Collecting mdurl~=0.1
  Downloading mdurl-0.1.2-py3-none-any.whl (10.0 kB)
Installing collected packages: zipp, mdurl, typing-extensions, pygments, numpy, MarkupSafe, markdown-it-py, importlib-metadata, wheel, werkzeug, urllib3, tensorboard-data-server, rich, protobuf, optree, namex, ml-dtypes, markdown, idna, h5py, grpcio, charset-normalizer, certifi, absl-py, wrapt, termcolor, tensorflow-io-gcs-filesystem, tensorboard, requests, opt-einsum, libclang, keras, google-pasta, gast, flatbuffers, astunparse, tensorflow
Successfully installed MarkupSafe-3.0.2 absl-py-2.2.2 astunparse-1.6.3 certifi-2025.4.26 charset-normalizer-3.4.2 flatbuffers-25.2.10 gast-0.6.0 google-pasta-0.2.0 grpcio-1.71.0 h5py-3.13.0 idna-3.10 importlib-metadata-8.7.0 keras-3.9.2 libclang-18.1.1 markdown-3.8 markdown-it-py-3.0.0 mdurl-0.1.2 ml-dtypes-0.5.1 namex-0.0.9 numpy-2.0.2 opt-einsum-3.4.0 optree-0.15.0 protobuf-5.29.4 pygments-2.19.1 requests-2.32.3 rich-14.0.0 tensorboard-2.19.0 tensorboard-data-server-0.7.2 tensorflow-2.19.0 tensorflow-io-gcs-filesystem-0.37.1 termcolor-3.1.0 typing-extensions-4.13.2 urllib3-2.4.0 werkzeug-3.1.3 wheel-0.45.1 wrapt-1.17.2 zipp-3.21.0
[testuser@vgpu ~]$

そして、テストとして「python3 -c “import tensorflow as tf; print(tf.config.list_physical_devices(‘GPU’))”」を実行

[testuser@vgpu ~]$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2025-05-15 11:35:31.898614: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747276531.922384    2049 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747276531.929813    2049 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1747276531.948919    2049 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1747276531.948946    2049 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1747276531.948951    2049 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1747276531.948954    2049 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-05-15 11:35:31.954990: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
W0000 00:00:1747276535.245548    2049 gpu_device.cc:2341] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]
[testuser@vgpu ~]$

GPU対応については https://www.tensorflow.org/install/gpu を見ろ、とある・・・

こちらでは「pip install ‘tensorflow[and-cuda]’」でインストールしろ、とある


[testuser@vgpu ~]$ pip list
Package                      Version
---------------------------- ---------
absl-py                      2.2.2
astunparse                   1.6.3
certifi                      2025.4.26
charset-normalizer           3.4.2
dbus-python                  1.2.18
distlib                      0.3.2
distro                       1.5.0
filelock                     3.7.1
flatbuffers                  25.2.10
gast                         0.6.0
google-pasta                 0.2.0
gpg                          1.15.1
grpcio                       1.71.0
h5py                         3.13.0
idna                         3.10
importlib_metadata           8.7.0
keras                        3.9.2
libclang                     18.1.1
libcomps                     0.1.18
Markdown                     3.8
markdown-it-py               3.0.0
MarkupSafe                   3.0.2
mdurl                        0.1.2
ml_dtypes                    0.5.1
namex                        0.0.9
nftables                     0.1
numpy                        2.0.2
opt_einsum                   3.4.0
optree                       0.15.0
packaging                    20.9
pip                          21.3.1
platformdirs                 2.5.4
protobuf                     5.29.4
pycairo                      1.20.1
Pygments                     2.19.1
PyGObject                    3.40.1
pyparsing                    2.4.7
python-dateutil              2.8.1
PyYAML                       5.4.1
requests                     2.32.3
rich                         14.0.0
rpm                          4.16.1.3
selinux                      3.6
sepolicy                     3.6
setools                      4.4.4
setuptools                   53.0.0
six                          1.15.0
systemd-python               234
tensorboard                  2.19.0
tensorboard-data-server      0.7.2
tensorflow                   2.19.0
tensorflow-io-gcs-filesystem 0.37.1
termcolor                    3.1.0
typing_extensions            4.13.2
urllib3                      2.4.0
virtualenv                   20.21.1
Werkzeug                     3.1.3
wheel                        0.45.1
wrapt                        1.17.2
zipp                         3.21.0
[testuser@vgpu ~]$ pip install 'tensorflow[and-cuda]'
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: tensorflow[and-cuda] in ./.local/lib/python3.9/site-packages (2.19.0)
Requirement already satisfied: flatbuffers>=24.3.25 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (25.2.10)
Requirement already satisfied: google-pasta>=0.1.1 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (0.2.0)
Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (0.6.0)
Requirement already satisfied: termcolor>=1.1.0 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (3.1.0)
Requirement already satisfied: wrapt>=1.11.0 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (1.17.2)
Requirement already satisfied: opt-einsum>=2.3.2 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (3.4.0)
Requirement already satisfied: setuptools in /usr/lib/python3.9/site-packages (from tensorflow[and-cuda]) (53.0.0)
Requirement already satisfied: typing-extensions>=3.6.6 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (4.13.2)
Requirement already satisfied: six>=1.12.0 in /usr/lib/python3.9/site-packages (from tensorflow[and-cuda]) (1.15.0)
Requirement already satisfied: numpy<2.2.0,>=1.26.0 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (2.0.2)
Requirement already satisfied: absl-py>=1.0.0 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (2.2.2)
Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.3 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (5.29.4)
Requirement already satisfied: requests<3,>=2.21.0 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (2.32.3)
Requirement already satisfied: keras>=3.5.0 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (3.9.2)
Requirement already satisfied: h5py>=3.11.0 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (3.13.0)
Requirement already satisfied: astunparse>=1.6.0 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (1.6.3)
Requirement already satisfied: ml-dtypes<1.0.0,>=0.5.1 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (0.5.1)
Requirement already satisfied: tensorboard~=2.19.0 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (2.19.0)
Requirement already satisfied: libclang>=13.0.0 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (18.1.1)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (1.71.0)
Requirement already satisfied: packaging in /usr/lib/python3.9/site-packages (from tensorflow[and-cuda]) (20.9)
Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in ./.local/lib/python3.9/site-packages (from tensorflow[and-cuda]) (0.37.1)
Collecting nvidia-cufft-cu12==11.2.3.61
  Downloading nvidia_cufft_cu12-11.2.3.61-py3-none-manylinux2014_x86_64.whl (192.5 MB)
     |████████████████████████████████| 192.5 MB 52 kB/s
Collecting nvidia-cublas-cu12==12.5.3.2
  Downloading nvidia_cublas_cu12-12.5.3.2-py3-none-manylinux2014_x86_64.whl (363.3 MB)
     |████████████████████████████████| 363.3 MB 50 kB/s
Collecting nvidia-cudnn-cu12==9.3.0.75
  Downloading nvidia_cudnn_cu12-9.3.0.75-py3-none-manylinux2014_x86_64.whl (577.2 MB)
     |████████████████████████████████| 577.2 MB 65 kB/s
Collecting nvidia-nvjitlink-cu12==12.5.82
  Downloading nvidia_nvjitlink_cu12-12.5.82-py3-none-manylinux2014_x86_64.whl (21.3 MB)
     |████████████████████████████████| 21.3 MB 67 kB/s
Collecting nvidia-cuda-runtime-cu12==12.5.82
  Downloading nvidia_cuda_runtime_cu12-12.5.82-py3-none-manylinux2014_x86_64.whl (895 kB)
     |████████████████████████████████| 895 kB 38.5 MB/s
Collecting nvidia-nccl-cu12==2.23.4
  Downloading nvidia_nccl_cu12-2.23.4-py3-none-manylinux2014_x86_64.whl (199.0 MB)
     |████████████████████████████████| 199.0 MB 53 kB/s
Collecting nvidia-cusparse-cu12==12.5.1.3
  Downloading nvidia_cusparse_cu12-12.5.1.3-py3-none-manylinux2014_x86_64.whl (217.6 MB)
     |████████████████████████████████| 217.6 MB 51 kB/s
Collecting nvidia-cuda-nvcc-cu12==12.5.82
  Downloading nvidia_cuda_nvcc_cu12-12.5.82-py3-none-manylinux2014_x86_64.whl (22.5 MB)
     |████████████████████████████████| 22.5 MB 56 kB/s
Collecting nvidia-cuda-cupti-cu12==12.5.82
  Downloading nvidia_cuda_cupti_cu12-12.5.82-py3-none-manylinux2014_x86_64.whl (13.8 MB)
     |████████████████████████████████| 13.8 MB 47 kB/s
Collecting nvidia-cuda-nvrtc-cu12==12.5.82
  Downloading nvidia_cuda_nvrtc_cu12-12.5.82-py3-none-manylinux2014_x86_64.whl (24.9 MB)
     |████████████████████████████████| 24.9 MB 52 kB/s
Collecting nvidia-curand-cu12==10.3.6.82
  Downloading nvidia_curand_cu12-10.3.6.82-py3-none-manylinux2014_x86_64.whl (56.3 MB)
     |████████████████████████████████| 56.3 MB 49 kB/s
Collecting nvidia-cusolver-cu12==11.6.3.83
  Downloading nvidia_cusolver_cu12-11.6.3.83-py3-none-manylinux2014_x86_64.whl (130.3 MB)
     |████████████████████████████████| 130.3 MB 51 kB/s
Requirement already satisfied: wheel<1.0,>=0.23.0 in ./.local/lib/python3.9/site-packages (from astunparse>=1.6.0->tensorflow[and-cuda]) (0.45.1)
Requirement already satisfied: rich in ./.local/lib/python3.9/site-packages (from keras>=3.5.0->tensorflow[and-cuda]) (14.0.0)
Requirement already satisfied: namex in ./.local/lib/python3.9/site-packages (from keras>=3.5.0->tensorflow[and-cuda]) (0.0.9)
Requirement already satisfied: optree in ./.local/lib/python3.9/site-packages (from keras>=3.5.0->tensorflow[and-cuda]) (0.15.0)
Requirement already satisfied: certifi>=2017.4.17 in ./.local/lib/python3.9/site-packages (from requests<3,>=2.21.0->tensorflow[and-cuda]) (2025.4.26)
Requirement already satisfied: idna<4,>=2.5 in ./.local/lib/python3.9/site-packages (from requests<3,>=2.21.0->tensorflow[and-cuda]) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in ./.local/lib/python3.9/site-packages (from requests<3,>=2.21.0->tensorflow[and-cuda]) (2.4.0)
Requirement already satisfied: charset-normalizer<4,>=2 in ./.local/lib/python3.9/site-packages (from requests<3,>=2.21.0->tensorflow[and-cuda]) (3.4.2)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in ./.local/lib/python3.9/site-packages (from tensorboard~=2.19.0->tensorflow[and-cuda]) (0.7.2)
Requirement already satisfied: werkzeug>=1.0.1 in ./.local/lib/python3.9/site-packages (from tensorboard~=2.19.0->tensorflow[and-cuda]) (3.1.3)
Requirement already satisfied: markdown>=2.6.8 in ./.local/lib/python3.9/site-packages (from tensorboard~=2.19.0->tensorflow[and-cuda]) (3.8)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/lib/python3.9/site-packages (from packaging->tensorflow[and-cuda]) (2.4.7)
Requirement already satisfied: importlib-metadata>=4.4 in ./.local/lib/python3.9/site-packages (from markdown>=2.6.8->tensorboard~=2.19.0->tensorflow[and-cuda]) (8.7.0)
Requirement already satisfied: MarkupSafe>=2.1.1 in ./.local/lib/python3.9/site-packages (from werkzeug>=1.0.1->tensorboard~=2.19.0->tensorflow[and-cuda]) (3.0.2)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in ./.local/lib/python3.9/site-packages (from rich->keras>=3.5.0->tensorflow[and-cuda]) (2.19.1)
Requirement already satisfied: markdown-it-py>=2.2.0 in ./.local/lib/python3.9/site-packages (from rich->keras>=3.5.0->tensorflow[and-cuda]) (3.0.0)
Requirement already satisfied: zipp>=3.20 in ./.local/lib/python3.9/site-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard~=2.19.0->tensorflow[and-cuda]) (3.21.0)
Requirement already satisfied: mdurl~=0.1 in ./.local/lib/python3.9/site-packages (from markdown-it-py>=2.2.0->rich->keras>=3.5.0->tensorflow[and-cuda]) (0.1.2)
Installing collected packages: nvidia-nvjitlink-cu12, nvidia-cusparse-cu12, nvidia-cublas-cu12, nvidia-nccl-cu12, nvidia-cusolver-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-nvcc-cu12, nvidia-cuda-cupti-cu12
Successfully installed nvidia-cublas-cu12-12.5.3.2 nvidia-cuda-cupti-cu12-12.5.82 nvidia-cuda-nvcc-cu12-12.5.82 nvidia-cuda-nvrtc-cu12-12.5.82 nvidia-cuda-runtime-cu12-12.5.82 nvidia-cudnn-cu12-9.3.0.75 nvidia-cufft-cu12-11.2.3.61 nvidia-curand-cu12-10.3.6.82 nvidia-cusolver-cu12-11.6.3.83 nvidia-cusparse-cu12-12.5.1.3 nvidia-nccl-cu12-2.23.4 nvidia-nvjitlink-cu12-12.5.82
[testuser@vgpu ~]$
[testuser@vgpu ~]$ pip list
Package                      Version
---------------------------- ---------
absl-py                      2.2.2
astunparse                   1.6.3
certifi                      2025.4.26
charset-normalizer           3.4.2
dbus-python                  1.2.18
distlib                      0.3.2
distro                       1.5.0
filelock                     3.7.1
flatbuffers                  25.2.10
gast                         0.6.0
google-pasta                 0.2.0
gpg                          1.15.1
grpcio                       1.71.0
h5py                         3.13.0
idna                         3.10
importlib_metadata           8.7.0
keras                        3.9.2
libclang                     18.1.1
libcomps                     0.1.18
Markdown                     3.8
markdown-it-py               3.0.0
MarkupSafe                   3.0.2
mdurl                        0.1.2
ml_dtypes                    0.5.1
namex                        0.0.9
nftables                     0.1
numpy                        2.0.2
nvidia-cublas-cu12           12.5.3.2
nvidia-cuda-cupti-cu12       12.5.82
nvidia-cuda-nvcc-cu12        12.5.82
nvidia-cuda-nvrtc-cu12       12.5.82
nvidia-cuda-runtime-cu12     12.5.82
nvidia-cudnn-cu12            9.3.0.75
nvidia-cufft-cu12            11.2.3.61
nvidia-curand-cu12           10.3.6.82
nvidia-cusolver-cu12         11.6.3.83
nvidia-cusparse-cu12         12.5.1.3
nvidia-nccl-cu12             2.23.4
nvidia-nvjitlink-cu12        12.5.82
opt_einsum                   3.4.0
optree                       0.15.0
packaging                    20.9
pip                          21.3.1
platformdirs                 2.5.4
protobuf                     5.29.4
pycairo                      1.20.1
Pygments                     2.19.1
PyGObject                    3.40.1
pyparsing                    2.4.7
python-dateutil              2.8.1
PyYAML                       5.4.1
requests                     2.32.3
rich                         14.0.0
rpm                          4.16.1.3
selinux                      3.6
sepolicy                     3.6
setools                      4.4.4
setuptools                   53.0.0
six                          1.15.0
systemd-python               234
tensorboard                  2.19.0
tensorboard-data-server      0.7.2
tensorflow                   2.19.0
tensorflow-io-gcs-filesystem 0.37.1
termcolor                    3.1.0
typing_extensions            4.13.2
urllib3                      2.4.0
virtualenv                   20.21.1
Werkzeug                     3.1.3
wheel                        0.45.1
wrapt                        1.17.2
zipp                         3.21.0
[testuser@vgpu ~]$

だいぶ差分があった

そして実行


[testuser@vgpu ~]$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2025-05-15 11:45:19.350181: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747277119.373769    2108 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747277119.381364    2108 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1747277119.400331    2108 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1747277119.400358    2108 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1747277119.400362    2108 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1747277119.400365    2108 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-05-15 11:45:19.406358: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[testuser@vgpu ~]$

というわけで、tensorflowをnvidia GPU環境でインストールするときは 「pip install ‘tensorflow[and-cuda]’」でやる必要がある、という話でした

DSPラジオ ATS-Miniのfirmwareをアップデートしてみた

aliexpressで安く売ってたATS-Miniの派生品買ってみた

ループアンテナセットで3222円でした。

電源入れた時のfirmware versionは1.0.1と出ていた

バージョンアップできるのか探してみると https://github.com/esp32-si4732/ats-mini を発見

(あとからわかったのだが、最初にインストールされてたのは 同じgithubでも大文字の https://github.com/G8PTN/ATS_MINI/ で、こちらはソースコード非公開だった)

ドキュメントにあるように uv をインストールしてみた

osakanataro@ubuntuserver:~$ curl -LsSf https://astral.sh/uv/install.sh | sh
downloading uv 0.7.2 x86_64-unknown-linux-gnu
no checksums to verify
installing to /home/osakanataro/.local/bin
  uv
  uvx
everything's installed!
osakanataro@ubuntuserver:~$

ATS-MiniをLinuxサーバに接続してみるとデバイスが認識される

osakanataro@ubuntuserver:~$ lsusb
<略>
Bus 002 Device 006: ID 303a:1001 Espressif USB JTAG/serial debug unit
<略>
osakanataro@ubuntuserver:~$ lsusb --tree
<略>
/:  Bus 002.Port 001: Dev 001, Class=root_hub, Driver=xhci_hcd/11p, 480M
<略>
    |__ Port 007: Dev 006, If 0, Class=Communications, Driver=cdc_acm, 12M
    |__ Port 007: Dev 006, If 1, Class=CDC Data, Driver=cdc_acm, 12M
    |__ Port 007: Dev 006, If 2, Class=Vendor Specific Class, Driver=[none], 12M
<略>
osakanataro@ubuntuserver:~$

シリアルポート名が何になっているのかをdmesgの結果から確認

osakanataro@ubuntuserver:~$ sudo dmesg|tail
[sudo] osakanataro のパスワード:
[1695575.346617] usbcore: registered new interface driver cdc_acm
[1695575.346621] cdc_acm: USB Abstract Control Model driver for USB modems and ISDN adapters
[1695776.000954] usb 2-7: USB disconnect, device number 5
[1695849.379292] usb 2-7: new full-speed USB device number 6 using xhci_hcd
[1695849.506664] usb 2-7: New USB device found, idVendor=303a, idProduct=1001, bcdDevice= 1.01
[1695849.506681] usb 2-7: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[1695849.506688] usb 2-7: Product: USB JTAG/serial debug unit
[1695849.506693] usb 2-7: Manufacturer: Espressif
[1695849.506697] usb 2-7: SerialNumber: FC:01:2C:CC:BD:08
[1695849.510971] cdc_acm 2-7:1.0: ttyACM0: USB ACM device
osakanataro@ubuntuserver:~$
osakanataro@ubuntuserver:~$ uvx --from esptool esptool.py --chip esp32s3 --port ttyACM0 --baud 921600 read_flash 0x0 ALL original-flash.bin
      Built esptool==4.8.1
Installed 13 packages in 30ms
esptool.py v4.8.1
Serial port SERIAL_PORT

A fatal error occurred: Could not open ttyACM0, the port is busy or doesn't exist.
([Errno 2] could not open port ttyACM0: [Errno 2] No such file or directory: 'ttyACM0')

Hint: Check if the port is correct and ESP connected

osakanataro@ubuntuserver:~$

あ・・・パス指定

osakanataro@ubuntuserver:~$ uvx --from esptool esptool.py --chip esp32s3 --port /dev/ttyACM0 --baud 921600 read_flash 0x0 ALL original-flash.bin
esptool.py v4.8.1
Serial port /dev/ttyACM0

A fatal error occurred: Could not open /dev/ttyACM0, the port is busy or doesn't exist.
([Errno 13] could not open port /dev/ttyACM0: [Errno 13] Permission denied: '/dev/ttyACM0')

Hint: Try to add user into dialout or uucp group.

osakanataro@ubuntuserver:~$

uv,uvxはユーザディレクトリにインストールされているのでsudoで使えない・・・

osakanataro@ubuntuserver:~$ sudo uvx --from esptool esptool.py --chip esp32s3 --port /dev/ttyACM0 --baud 921600 read_flash 0x0 ALL original-flash.bin
sudo: uvx: コマンドが見つかりません
osakanataro@ubuntuserver:~$

uucpグループとdialoutグループのどちらにも所属させたけどpermission denied

osakanataro@ubuntuserver:~$ sudo usermod -G uucp,dialout osakanataro
[sudo] osakanataro のパスワード:
osakanataro@ubuntuserver:~$ uvx --from esptool esptool.py --chip esp32s3 --port /dev/ttyACM0 --baud 921600 read_flash 0x0 ALL original-flash.bin
esptool.py v4.8.1
Serial port /dev/ttyACM0

A fatal error occurred: Could not open /dev/ttyACM0, the port is busy or doesn't exist.
([Errno 13] could not open port /dev/ttyACM0: [Errno 13] Permission denied: '/dev/ttyACM0')

Hint: Try to add user into dialout or uucp group.

osakanataro@ubuntuserver:~$ ls -l /dev/ttyACM0
crw-rw---- 1 root dialout 166, 0  5月  6 15:54 /dev/ttyACM0
osakanataro@ubuntuserver:~$

んー???

ぐぐると brltty がインストールされている場合に発生することがあるらしいのでアンインストールして再実行

osakanataro@ubuntuserver:~$ dpkg -l|grep tty
ii  brltty                                         6.6-4ubuntu5                             amd64        Access software for a blind person using a braille display
ii  libjetty9-java                                 9.4.53-1                                 all          Java servlet engine and webserver -- core libraries
osakanataro@ubuntuserver:~$
osakanataro@ubuntuserver:~$ dpkg -l|grep brltty
ii  brltty                                         6.6-4ubuntu5                             amd64        Access software for a blind person using a braille display
osakanataro@ubuntuserver:~$ sudo apt remove brltty
パッケージリストを読み込んでいます... 完了
依存関係ツリーを作成しています... 完了
状態情報を読み取っています... 完了
以下のパッケージが自動でインストールされましたが、もう必要とされていません:
  libpcre2-32-0
これを削除するには 'sudo apt autoremove' を利用してください。
以下のパッケージは「削除」されます:
  brltty
アップグレード: 0 個、新規インストール: 0 個、削除: 1 個、保留: 7 個。
この操作後に 10.5 MB のディスク容量が解放されます。
続行しますか? [Y/n] y
(データベースを読み込んでいます ... 現在 278045 個のファイルとディレクトリがインストールされています。)
brltty (6.6-4ubuntu5) を削除しています ...
man-db (2.12.0-4build2) のトリガを処理しています ...
エラー: タイムアウトしました
osakanataro@ubuntuserver:~$ uvx --from esptool esptool.py --chip esp32s3 --port /dev/ttyACM0 --baud 921600 read_flash 0x0 ALL original-flash.bin
esptool.py v4.8.1
Serial port /dev/ttyACM0

A fatal error occurred: Could not open /dev/ttyACM0, the port is busy or doesn't exist.
([Errno 13] could not open port /dev/ttyACM0: [Errno 13] Permission denied: '/dev/ttyACM0')

Hint: Try to add user into dialout or uucp group.

osakanataro@ubuntuserver:~$

違うらしい

再起動したあと再実行すると、あっけなく実行が出来た

osakanataro@ubuntuserver:~$ uvx --from esptool esptool.py --chip esp32s3 --port /dev/ttyACM0 --baud 921600 read_flash 0x0 ALL original-flash.bin
esptool.py v4.8.1
Serial port /dev/ttyACM0
Connecting...
Chip is ESP32-S3 (QFN56) (revision v0.2)
Features: WiFi, BLE, Embedded PSRAM 8MB (AP_3v3)
Crystal is 40MHz
MAC: fc:01:2c:cc:bd:08
Uploading stub...
Running stub...
Stub running...
Changing baud rate to 921600
Changed.
Configuring flash size...
Detected flash size: 16MB
16777216 (100 %)
16777216 (100 %)
Read 16777216 bytes at 0x00000000 in 273.8 seconds (490.2 kbit/s)...
Hard resetting via RTS pin...
osakanataro@ubuntuserver:~$ ls -l original-flash.bin
-rw-rw-r-- 1 osakanataro osakanataro 16777216  5月  6 17:26 original-flash.bin
osakanataro@ubuntuserver:~$

では、続いて、firmwareを書き込み

osakanataro@ubuntuserver:~/ats-mini$ ls ats-mini-v2.14
CHANGELOG.md  ats-mini.ino.bin  ats-mini.ino.bootloader.bin  ats-mini.ino.merged.bin  ats-mini.ino.partitions.bin
osakanataro@ubuntuserver:~/ats-mini$ ls -l ats-mini-v2.14
合計 8712
-rw-r--r-- 1 osakanataro osakanataro   11491  5月  6 12:46 CHANGELOG.md
-rw-r--r-- 1 osakanataro osakanataro  494048  5月  6 12:46 ats-mini.ino.bin
-rw-r--r-- 1 osakanataro osakanataro   20208  5月  6 12:46 ats-mini.ino.bootloader.bin
-rw-r--r-- 1 osakanataro osakanataro 8388608  5月  6 12:46 ats-mini.ino.merged.bin
-rw-r--r-- 1 osakanataro osakanataro    3072  5月  6 12:46 ats-mini.ino.partitions.bin
osakanataro@ubuntuserver:~/ats-mini$ uvx --from esptool esptool.py --chip esp32s3 --port /dev/ttyACM0 --baud 921600 --before default_reset --after hard_reset write_flash  -z --flash_mode keep --flash_freq keep --flash_size keep 0x0 ats-mini-v2.14/ats-mini.ino.merged.bin
esptool.py v4.8.1
Serial port /dev/ttyACM0
Connecting...
Chip is ESP32-S3 (QFN56) (revision v0.2)
Features: WiFi, BLE, Embedded PSRAM 8MB (AP_3v3)
Crystal is 40MHz
MAC: fc:01:2c:cc:bd:08
Uploading stub...
Running stub...
Stub running...
Changing baud rate to 921600
Changed.
Configuring flash size...
Flash will be erased from 0x00000000 to 0x007fffff...
Compressed 8388608 bytes to 299786...
Wrote 8388608 bytes (299786 compressed) at 0x00000000 in 20.2 seconds (effective 3315.7 kbit/s)...

A fatal error occurred: Packet content transfer stopped (received 0 bytes)
osakanataro@ubuntuserver:~/ats-mini$

えっ・・・・失敗?

もう1つある分割ファイルの方で再実行

osakanataro@ubuntuserver:~/ats-mini$ uvx --from esptool esptool.py --chip esp32s3 --port /dev/ttyACM0 --baud 921600 --before default_reset --after hard_reset write_flash  -z --flash_mode keep --flash_freq keep --flash_size keep 0x0 ats-mini-v2.14/ats-mini.ino.bootloader.bin 0x8000 ats-mini-v2.14/ats-mini.ino.partitions.bin 0x10000 ats-mini-v2.14/ats-mini.ino.bin
esptool.py v4.8.1
Serial port /dev/ttyACM0
Connecting...
Chip is ESP32-S3 (QFN56) (revision v0.2)
Features: WiFi, BLE, Embedded PSRAM 8MB (AP_3v3)
Crystal is 40MHz
MAC: fc:01:2c:cc:bd:08
Uploading stub...
Running stub...
Stub running...
Changing baud rate to 921600
Changed.
Configuring flash size...
Flash will be erased from 0x00000000 to 0x00004fff...
Flash will be erased from 0x00008000 to 0x00008fff...
Flash will be erased from 0x00010000 to 0x00088fff...
Compressed 20208 bytes to 13058...
Wrote 20208 bytes (13058 compressed) at 0x00000000 in 0.2 seconds (effective 918.1 kbit/s)...
Hash of data verified.
Compressed 3072 bytes to 146...
Wrote 3072 bytes (146 compressed) at 0x00008000 in 0.0 seconds (effective 1196.3 kbit/s)...
Hash of data verified.
Compressed 494048 bytes to 277050...
Wrote 494048 bytes (277050 compressed) at 0x00010000 in 2.5 seconds (effective 1560.3 kbit/s)...
Hash of data verified.

Leaving...
Hard resetting via RTS pin...
osakanataro@ubuntuserver:~/ats-mini$

無事に書き込み完了

FW2.14で起動したことを確認