- This is a general problem and i found it in different clusters (we use ocata release everywhere) - Yes i checked this point, load average and other resources is ok - Lol no :))) i need really fix this problem By the way i found similar bugfix in libvirt https://libvirt.org/news.html#v5-4-0-2019-06-03
Setting the scheduler for QEMU's main thread before QEMU had a chance to start up other threads was misleading as it would affect other threads (vCPU and I/O) as well. In some particular situations this could also lead to an error when the thread for vCPU #0 was being moved to its cpu,cpuacct cgroup. This was fixed so that the scheduler for the main thread is set after QEMU starts.
I checked openstack releases starting from ocata and up and found ussuri use centos 8 and libvirt 6.0.0, all other releases use centos 7 and libvirt 4.5.0. I plan to try to update the container with libvirt and see if the problem is fixed. вт, 10 нояб. 2020 г. в 16:27, Khodayar Doustar <khodayard@gmail.com>:
Hi Denis,
- Does this happen only on one server or is it a general problem among all compute nodes? - Have you checked the load average, disk free and response time of your server? Sometimes these weird and intermittent problems happen when server does not have enough disk space, process or memory resources. - Have you tried putting this fabiolous script of yours into a cron job to be run i.e. each 4 hours. This may seem like a funny workaround but it can save a lot of time.
Good luck, Khodayar
On Mon, Nov 9, 2020 at 6:41 PM Denis Kadyshev <metajiji@gmail.com> wrote:
I have an openstack ocata release deployed via kolla.
Libvirtd running inside docker container nova_ libvirt and volumes /sys/fs/cgroup, /run privileged mode enabled.
Some guest vms cannot provide cpu-stats
Symptoms are:
$ docker exec -ti nova_libvirt virsh cpu-stats instance-000004cb error: Failed to retrieve CPU statistics for domain 'instance-000004cb' error: Requested operation is not valid: cgroup CPUACCT controller is not mounted
To check cgroups looking for all related pid
$ ps fax | grep instance-000004cb 8275 ? Sl 4073:40 /usr/libexec/qemu-kvm -name guest=instance-000004cb $ ps fax | grep 8275 8346 ? S 76:04 \_ [vhost-8275] 8367 ? S 0:00 \_ [kvm-pit/8275] 8275 ? Sl 4073:42 /usr/libexec/qemu-kvm
See cgroup for qemu-kvm
$ cat /proc/8275/cgroup 11:blkio:/user.slice 10:devices:/user.slice
9:hugetlb:/docker/e5bef89178c1c3ae34fd2b4a9b86b299a6145c0b9f608a06e83f6f4ca4d897bd 8:cpuacct,cpu:/user.slice
7:perf_event:/machine.slice/machine-qemu\x2d25\x2dinstance\x2d000004cb.scope
6:net_prio,net_cls:/machine.slice/machine-qemu\x2d25\x2dinstance\x2d000004cb.scope 5:freezer:/machine.slice/machine-qemu\x2d25\x2dinstance\x2d000004cb.scope 4:memory:/user.slice 3:pids:/user.slice
2:cpuset:/machine.slice/machine-qemu\x2d25\x2dinstance\x2d000004cb.scope/emulator 1:name=systemd:/user.slice/user-0.slice/session-c1068.scope
for vhost-8275
$ cat /proc/8346/cgroup 11:blkio:/user.slice 10:devices:/user.slice
9:hugetlb:/docker/e5bef89178c1c3ae34fd2b4a9b86b299a6145c0b9f608a06e83f6f4ca4d897bd 8:cpuacct,cpu:/user.slice
7:perf_event:/machine.slice/machine-qemu\x2d25\x2dinstance\x2d000004cb.scope
6:net_prio,net_cls:/machine.slice/machine-qemu\x2d25\x2dinstance\x2d000004cb.scope 5:freezer:/machine.slice/machine-qemu\x2d25\x2dinstance\x2d000004cb.scope 4:memory:/user.slice 3:pids:/user.slice
2:cpuset:/machine.slice/machine-qemu\x2d25\x2dinstance\x2d000004cb.scope/emulator 1:name=systemd:/user.slice/user-0.slice/session-c1068.scope
for kvm-pit
$ cat /proc/8275/cgroup 11:blkio:/user.slice 10:devices:/user.slice 9:hugetlb:/ 8:cpuacct,cpu:/user.slice 7:perf_event:/ 6:net_prio,net_cls:/ 5:freezer:/ 4:memory:/user.slice 3:pids:/user.slice 2:cpuset:/ 1:name=systemd:/user.slice/user-0.slice/session-c4807.scope
I tried to fix the groups with a this script
get_broken_vms() { docker exec nova_libvirt bash -c 'for vm in $(virsh list --name); do virsh cpu-stats $vm > /dev/null 2>&1 || echo $vm; done' }
attach_vm_to_cgroup() { # Attach processes and their threads pid to correct cgroup local vm_pid=$1; shift local vm_cgname=$1; shift
echo Fix cgroup for pid $vm_pid in cgroup $vm_cgname
for tpid in $(find /proc/$vm_pid/task/ -maxdepth 1 -mindepth 1 -type d -printf '%f\n'); do echo $tpid | tee /sys/fs/cgroup/{blkio,devices,perf_event,net_prio,net_cls,freezer,memory,pids,systemd}/machine.slice/$vm_cgname/tasks 1>/dev/null & echo $tpid | tee /sys/fs/cgroup/{cpu,\cpuacct,cpuset}/machine.slice/$vm_cgname/emulator/tasks 1>/dev/null & done }
for vm in $(get_broken_vms); do vm_pid=$(pgrep -f $vm) vm_vhost_pids=$(pgrep -x vhost-$vm_pid) vm_cgname=$(find /sys/fs/cgroup/systemd/machine.slice -maxdepth 1 -mindepth 1 -type d -name "machine-qemu\\\x2d*\\\x2d${vm/-/\\\\x2d}.scope" -printf '%f\n')
echo Working on vm: $vm pid: $vm_pid vhost_pid: $vm_vhost_pids cgroup_name: $vm_cgname [ -z "$vm_pid" -a -z "$vm_cgname" ] || attach_vm_to_cgroup $vm_pid $vm_cgname
# Fix vhost-NNNN kernel threads for vpid in $vm_vhost_pids; do [ -z "$vm_cgname" ] || attach_vm_to_cgroup $vpid $vm_cgname done done
After fixing all vms successfully provided cpu-stats and other metrics, but after some hours cgroups broke again.
Problems and symptoms:
- cgoup broken not at all VMs - to find out what leads to this effect failed - if restart a problem VM then as expected cgroups has been fixed but after some hours cgroup broken again - if cgroups has been fixed by hand cpu-stats is works, but after some hours cgroup broken again
Now i check: - logrotate - nothing - cron - nothing
Add audit logs for cgrups
auditctl -w '/sys/fs/cgroup/cpu,cpuacct/machine.slice' -p rwxa
And found only libvirtd processes write cgroups.
Any suggestions? _______________________________________________ Community mailing list Community@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/community