Bug #64997: There is always an osd process that takes up high cpu - RADOS - Ceph

Actions

Copy link

Bug #64997

open

There is always an osd process that takes up high cpu

Added by cao yong 2 months ago. Updated 9 days ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Category:

Performance/Resource Usage

Target version:

Ceph - v17.2.8

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v17.2.6

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

refer to: https://github.com/rook/rook/issues/13901

Files

Download all files

11111.png (112 KB) 11111.png		cao yong, 05/24/2024 07:16 AM
clipboard-202405241516-pfz60.png (112 KB) clipboard-202405241516-pfz60.png		cao yong, 05/24/2024 07:16 AM
clipboard-202405241517-qtxkf.png (112 KB) clipboard-202405241517-qtxkf.png		cao yong, 05/24/2024 07:17 AM
clipboard-202405241520-s5a5n.png (39.6 KB) clipboard-202405241520-s5a5n.png		cao yong, 05/24/2024 07:21 AM
clipboard-202405241521-1yzwa.png (30.8 KB) clipboard-202405241521-1yzwa.png		cao yong, 05/24/2024 07:21 AM
clipboard-202405241521-dxpc1.png (75.7 KB) clipboard-202405241521-dxpc1.png		cao yong, 05/24/2024 07:21 AM

Actions

Copy link

Updated by Radoslaw Zarzynski 2 months ago

Status changed from New to Need More Info

Note from bugscrub: need a summary here.

Actions

Copy link Download all files

Updated by cao yong 9 days ago

File 11111.png 11111.png added
File clipboard-202405241516-pfz60.png clipboard-202405241516-pfz60.png added
File clipboard-202405241517-qtxkf.png clipboard-202405241517-qtxkf.png added
File clipboard-202405241520-s5a5n.png clipboard-202405241520-s5a5n.png added
File clipboard-202405241521-1yzwa.png clipboard-202405241521-1yzwa.png added
File clipboard-202405241521-dxpc1.png clipboard-202405241521-dxpc1.png added

Radoslaw Zarzynski wrote in #note-1:

Note from bugscrub: need a summary here.

Bug Report
There is always an osd process that takes up high cpu at a certain moment for about 4 months .

I found that this is an admin_socket process, which belongs to the osd pod(pid:2809092) .

[root@sc-node-ceph-4 ~]# pstree -apscl 3360113
systemd,1 --switched-root --system --deserialize 31
  └─containerd-shim,1229280 -namespace k8s.io -id 712f8293ab1ecd3d5cc0e576efad6ed9f4e943ccabf14834fd98241e3363a988 -address /run/containerd/containerd.sock
      └─ceph-osd,2809092 --foreground --id 29 --fsid aa0e7bed-d3e3-49a7-a471-8e354bfe61f6 --setuser ceph --setgroup ceph --crush-location=root=default host=sc-node-ceph-4 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug  --default-log-to-file=false --default-mon-cluster-log-to-file=false
          └─admin_socket,3360113 --foreground --id 29 --fsid aa0e7bed-d3e3-49a7-a471-8e354bfe61f6 --setuser ceph --setgroup ceph --crush-location=root=default host=sc-node-ceph-4 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug  --default-log-to-file=false --default-mon-cluster-log-to-file=false

After the above osd pod is restarted, another osd pod will replace it at another moment and occupy high CPU. This phenomenon occurs again and again.

I found that the admin_socket process belonging to a same osd pod(pid:2809092) is also constantly restarting.


[root@sc-node-ceph-4 ~]# pstree -apscl 3364029
systemd,1 --switched-root --system --deserialize 31
  └─containerd-shim,1229280 -namespace k8s.io -id 712f8293ab1ecd3d5cc0e576efad6ed9f4e943ccabf14834fd98241e3363a988 -address /run/containerd/containerd.sock
      └─ceph-osd,2809092 --foreground --id 29 --fsid aa0e7bed-d3e3-49a7-a471-8e354bfe61f6 --setuser ceph --setgroup ceph --crush-location=root=default host=sc-node-ceph-4 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug  --default-log-to-file=false --default-mon-cluster-log-to-file=false
          └─admin_socket,3364029 --foreground --id 29 --fsid aa0e7bed-d3e3-49a7-a471-8e354bfe61f6 --setuser ceph --setgroup ceph --crush-location=root=default host=sc-node-ceph-4 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug  --default-log-to-file=false --default-mon-cluster-log-to-file=false

The status of ceph cluster is HEALTH.


 ✘ ⚡ root@sc-master-1  ~/rooks  kubectl -n rook-ceph exec -it rook-ceph-tools-7d4b5bb689-k5tvp -- /bin/bash
bash-4.4$ ceph -s
  cluster:
    id:     aa0e7bed-d3e3-49a7-a471-8e354bfe61f6
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,e,j (age 7h)
    mgr: b(active, since 6w), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 36 osds: 36 up (since 7h), 36 in (since 3M)
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 649 pgs
    objects: 37.40M objects, 16 TiB
    usage:   47 TiB used, 84 TiB / 131 TiB avail
    pgs:     647 active+clean
             2   active+clean+scrubbing+deep

  io:
    client:   7.9 KiB/s rd, 1.9 MiB/s wr, 3 op/s rd, 137 op/s wr

Deviation from expected behavior:
There is always an osd process that takes up high cpu and memory .And these osd processes appear alternately with restarts.
Expected behavior:
This strange admin-socket process should not occur and should not cause the osd pod to keep restarting

Environment:

OS (e.g. from /etc/os-release):


 ⚡ root@sc-master-1  ~/rooks/rook-1.12.10/deploy/examples  cat /etc/os-release      
NAME="Rocky Linux" 
VERSION="9.2 (Blue Onyx)" 
ID="rocky" 
ID_LIKE="rhel centos fedora" 
VERSION_ID="9.2" 
PLATFORM_ID="platform:el9" 
PRETTY_NAME="Rocky Linux 9.2 (Blue Onyx)" 
ANSI_COLOR="0;32" 
LOGO="fedora-logo-icon" 
CPE_NAME="cpe:/o:rocky:rocky:9::baseos" 
HOME_URL="https://rockylinux.org/" 
BUG_REPORT_URL="https://bugs.rockylinux.org/" 
SUPPORT_END="2032-05-31" 
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9" 
ROCKY_SUPPORT_PRODUCT_VERSION="9.2" 
REDHAT_SUPPORT_PRODUCT="Rocky Linux" 
REDHAT_SUPPORT_PRODUCT_VERSION="9.2"

Kernel (e.g. uname -a):

 ⚡ root@sc-master-1  ~/rooks/rook-1.12.10/deploy/examples  uname -a                  
Linux sc-master-1 6.6.2-1.el9.elrepo.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Nov 20 12:18:26 EST 2023 x86_64 x86_64 x86_64 GNU/Linux

Rook version (use rook version inside of a Rook Pod): V1.12.10
Storage backend version (e.g. for ceph do ceph -v):

bash-4.4$ ceph -v
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

Kubernetes version (use kubectl version):

⚡ root@sc-master-1  ~/rooks/rook-1.12.10/deploy/examples  kubectl version                        
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.14", GitCommit:"a5967a3c4d0f33469b7e7798c9ee548f71455222", GitTreeState:"clean", BuildDate:"2023-09-13T09:12:09Z", GoVersion:"go1.20.8", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.14", GitCommit:"a5967a3c4d0f33469b7e7798c9ee548f71455222", GitTreeState:"clean", BuildDate:"2023-09-13T09:04:55Z", GoVersion:"go1.20.8", Compiler:"gc", Platform:"linux/amd64"}

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #64997

There is always an osd process that takes up high cpu

Updated by Radoslaw Zarzynski 2 months ago

Updated by cao yong 9 days ago