Bug #64502: pacific/quincy/v18.2.0: client: ceph-fuse fails to unmount after upgrade to main - CephFS - Ceph

Subject changed from client: quincy ceph-fuse fails to unmount after upgrade to main to pacific/quincy/v18.2.0: client: ceph-fuse fails to unmount after upgrade to main

Actions

Copy link

#10

Updated by Venky Shankar 3 months ago

Category set to Correctness/Safety
Assignee set to Venky Shankar
Target version changed from v19.1.0 to v20.0.0

Actions

Copy link

#11

Updated by Venky Shankar 3 months ago

I am working on this. The plan is to gdb the ceph-fuse process after fusermount. Will update by EOD today.

Actions

Copy link

#12

Updated by Venky Shankar 3 months ago

oh well, infra issues now :/

2024-02-26T07:01:58.420 INFO:teuthology.orchestra.run.smithi057.stderr:/bin/podman: stderr Error: initializing source docker://quay.io/ceph/ceph:v18.2.1: Requesting bearer token: invalid status code from registry 502 (Bad Gateway)

Actions

Copy link

#13

Updated by Venky Shankar 3 months ago

Trying my luck with this today - hopefully no infra issues show up.

Actions

Copy link

#14

Updated by Venky Shankar 3 months ago

Started running into

ceph: stderr Error: OCI runtime error: crun: bpf create ``: Invalid argument

again in

/a/vshankar-2024-02-28_06:39:32-fs:upgrade-main-testing-default-smithi/7575094

Checked with Ilya and looks like the run just picked up the kernel bits when it was getting redeployed. Will have to run the test again when the latest build trigger finishes.

Actions

Copy link

#15

Updated by Venky Shankar 3 months ago

Venky Shankar wrote:

Started running into

ceph: stderr Error: OCI runtime error: crun: bpf create ``: Invalid argument

again in

/a/vshankar-2024-02-28_06:39:32-fs:upgrade-main-testing-default-smithi/7575094

Checked with Ilya and looks like the run just picked up the kernel bits when it was getting redeployed. Will have to run the test again when the latest build trigger finishes.

The kclient related issue is fixed now. Tried a sample run yesterday

- https://pulpito.ceph.com/vshankar-2024-02-28_15:21:56-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/

32/32 jobs failed. Onto this today!

Actions

Copy link

#16

Updated by Venky Shankar 3 months ago

Finally with the infra and the kclient issues set aside, I was able to gdb the ceph-fuse process, add a breakpoint in Client::_unmount(). To my surprise, the function was never invoked when fusermount was done (the gdb setup had set follow-fork-mode child and I did try this locally before doing it in the smithi machines). This is weird. I'll see what further I can come up with.

Actions

Copy link

#17

Updated by Venky Shankar 3 months ago

Another debug session with set detach-on-fork on which is supposed to let gdb debug both parent and child processes. Had a breakpoint in Client::_unmount

[New LWP 82130]
[New LWP 82141]
[New LWP 82143]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fada1e9c39a in __futex_abstimed_wait_common () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install ceph-fuse-17.2.7-560.g931b9f17.el9.x86_64
(gdb) set detach-on-fork on
(gdb) b Client::_unmount
Breakpoint 1 at 0x563ac1a8fa00
(gdb) c
Continuing.

fusermount was invoked at timestamp

2024-03-04T16:50:24.033 INFO:tasks.cephfs.fuse_mount:Running fusermount -u on ubuntu@smithi003.front.sepia.ceph.com...
2024-03-04T16:50:24.033 DEBUG:teuthology.orchestra.run.smithi003:> sudo fusermount -u /home/ubuntu/cephtest/mnt.0

However, the breakpoint was triggered when during cluster teardown. I noted the timestamp at that point.

Teardown:

2024-03-04T16:55:23.199 ERROR:tasks.cephfs.fuse_mount:process failed to terminate after unmount. This probably indicates a bug within ceph-fuse.
2024-03-04T16:55:23.199 ERROR:teuthology.run_tasks:Manager failed: ceph-fuse
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_b1dac5519c57c269ea98a22fb7729016a1d0e4d2/teuthology/run_tasks.py", line 154, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/github.com_vshankar_ceph_9822b9f8f96d88d1e7441f32961ead3c521c675e/qa/tasks/ceph_fuse.py", line 183, in task
    mount.umount_wait()
  File "/home/teuthworker/src/github.com_vshankar_ceph_9822b9f8f96d88d1e7441f32961ead3c521c675e/qa/tasks/cephfs/fuse_mount.py", line 403, in umount_wait
    run.wait([self.fuse_daemon], timeout)
  File "/home/teuthworker/src/git.ceph.com_teuthology_b1dac5519c57c269ea98a22fb7729016a1d0e4d2/teuthology/orchestra/run.py", line 473, in wait
    check_time()
  File "/home/teuthworker/src/git.ceph.com_teuthology_b1dac5519c57c269ea98a22fb7729016a1d0e4d2/teuthology/contextutil.py", line 134, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (51) after waiting for 300 seconds

gdb breakpoint:

[Detaching after vfork from child process 128806]
[Detaching after vfork from child process 128807]
[Detaching after vfork from child process 129019]
[Detaching after vfork from child process 129043]
[Thread 0x7fad81600640 (LWP 82141) exited]
[Thread 0x7fad6d600640 (LWP 82143) exited]
[Thread 0x7fad63400640 (LWP 82130) exited]
[Thread 0x7fad79600640 (LWP 82083) exited]
[Thread 0x7fad7aa00640 (LWP 82082) exited]
[Thread 0x7fad6ea00640 (LWP 82081) exited]
[Thread 0x7fad6e000640 (LWP 82079) exited]
[Thread 0x7fad78c00640 (LWP 82078) exited]
[Thread 0x7fad63e00640 (LWP 82025) exited]
[Thread 0x7fad6cc00640 (LWP 82014) exited]

Thread 1 "ceph-fuse" hit Breakpoint 1, 0x0000563ac1a8fa00 in Client::_unmount(bool) ()
(gdb) n
Single stepping until exit from function _ZN6Client8_unmountEb,
which has no line number information.
[Thread 0x7fad82a00640 (LWP 76392) exited]
[Thread 0x7fad8b400640 (LWP 76385) exited]
0x0000563ac1a2fbc3 in main ()
(gdb) bt
#0  0x0000563ac1a2fbc3 in main ()
(gdb) q
A debugging session is active.

    Inferior 1 [process 76351] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/bin/ceph-fuse, process 76351
[Inferior 1 (process 76351) detached]
[vshankar@smithi003 ~]$ date
Mon Mar  4 04:57:52 PM UTC 2024

Actions

Copy link

#18

Updated by Venky Shankar 3 months ago

Continuing on this today - fusermount(1) is basically invoking umount2(2). Will try to see what's going on.

Actions

Copy link

#19

Updated by Venky Shankar 3 months ago

This is not as bad as it looks. The ceph-fuse process seems to be exiting as usual - its somewhere in the qa world where things are timing out.

Actions

Copy link

#20

Updated by Venky Shankar 3 months ago

Venky Shankar wrote:

This is not as bad as it looks. The ceph-fuse process seems to be exiting as usual - its somewhere in the qa world where things are timing out.

I take that back. None of the MDSs are active after upgrade after upgrade to squid (down:dne).

Actions

Copy link

#21

Updated by Venky Shankar 3 months ago

Venky Shankar wrote:

Venky Shankar wrote:

This is not as bad as it looks. The ceph-fuse process seems to be exiting as usual - its somewhere in the qa world where things are timing out.

I take that back. None of the MDSs are active after upgrade after upgrade to squid (down:dne).

Scrap that. The test branch had a change[*] that caused the MDS to assert. But, I do have some findings to share. I introduce a custom qa change to skip the fusermount and add a long enough sleep so as to debug the fuse mount.

-        log.info('Unmounting ceph-fuse clients...')
-
-        for info in mounted_by_me.values():
-            # Conditional because an inner context might have umounted it
-            mount = info["mount"]
-            if mount.is_mounted():
-                mount.umount_wait()
-        for remote in remotes:
-            FuseMount.cleanup_stale_netnses_and_bridge(remote)
+        log.info('Not unmounting ceph-fuse clients (hack)...')
+        time.sleep(3600)
+
+        #for info in mounted_by_me.values():
+        #    # Conditional because an inner context might have umounted it
+        #    mount = info["mount"]
+        #    if mount.is_mounted():
+        #        mount.umount_wait()
+        #for remote in remotes:
+        #    FuseMount.cleanup_stale_netnses_and_bridge(remote)

Sample run here with main branch: https://pulpito.ceph.com/vshankar-2024-03-11_05:30:10-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/7590762/ which failed, but more importantly I was able to ssh to the node which had the fuse mount.

[vshankar@smithi132 ~]$ findmnt
...
...
└─/home/ubuntu/cephtest/mnt.0                           ceph-fuse     fuse.ceph-fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other

[vshankar@smithi132 ~]$ sudo fusermount -u /home/ubuntu/cephtest/mnt.0
[vshankar@smithi132 ~]$ findmnt

<ceph-fuse mount gone>

... and likewise with the other ceph-fuse mount on smithi187. Client logs also confirm the same.

[*]: https://github.com/ceph/ceph/pull/55616

Actions

Copy link

#22

Updated by Patrick Donnelly 3 months ago

Venky Shankar wrote:

Venky Shankar wrote:

Venky Shankar wrote:

This is not as bad as it looks. The ceph-fuse process seems to be exiting as usual - its somewhere in the qa world where things are timing out.

I take that back. None of the MDSs are active after upgrade after upgrade to squid (down:dne).

Scrap that. The test branch had a change[*] that caused the MDS to assert. But, I do have some findings to share. I introduce a custom qa change to skip the fusermount and add a long enough sleep so as to debug the fuse mount.

[...]

Sample run here with main branch: https://pulpito.ceph.com/vshankar-2024-03-11_05:30:10-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/7590762/ which failed, but more importantly I was able to ssh to the node which had the fuse mount.

[...]

... and likewise with the other ceph-fuse mount on smithi187. Client logs also confirm the same.

[*]: https://github.com/ceph/ceph/pull/55616

This is the worst kind of bug. I don't have a great suggestion beyond stating that the obvious yet highly improbable explanation is that the QA suite executing `fusermount -u ...` is clearly not actually running the command. It doesn't make sense but something odd is going on. Perhaps it's a user permission (fusermount uid != ceph-fuse uid) issue.

Actions

Copy link

#23

Updated by Venky Shankar 3 months ago

Patrick Donnelly wrote:

Venky Shankar wrote:

Venky Shankar wrote:

Venky Shankar wrote:

This is not as bad as it looks. The ceph-fuse process seems to be exiting as usual - its somewhere in the qa world where things are timing out.

I take that back. None of the MDSs are active after upgrade after upgrade to squid (down:dne).

Scrap that. The test branch had a change[*] that caused the MDS to assert. But, I do have some findings to share. I introduce a custom qa change to skip the fusermount and add a long enough sleep so as to debug the fuse mount.

[...]

Sample run here with main branch: https://pulpito.ceph.com/vshankar-2024-03-11_05:30:10-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/7590762/ which failed, but more importantly I was able to ssh to the node which had the fuse mount.

[...]

... and likewise with the other ceph-fuse mount on smithi187. Client logs also confirm the same.

[*]: https://github.com/ceph/ceph/pull/55616

This is the worst kind of bug. I don't have a great suggestion beyond stating that the obvious yet highly improbable explanation is that the QA suite executing `fusermount -u ...` is clearly not actually running the command. It doesn't make sense but something odd is going on. Perhaps it's a user permission (fusermount uid != ceph-fuse uid) issue.

I strace'd the fusermount call and the process exited normally (exitcode = 0) and the umount2(2) call returned 0 (success).

Actions

Copy link

#24

Updated by Milind Changire 2 months ago

main:
https://pulpito.ceph.com/mchangir-2024-03-22_09:46:06-fs:upgrade-wip-mchangir-testing-main-20240318.032620-testing-default-smithi/7616269/
https://pulpito.ceph.com/mchangir-2024-03-22_09:46:06-fs:upgrade-wip-mchangir-testing-main-20240318.032620-testing-default-smithi/7616273/

Actions

Copy link

#25

Updated by Venky Shankar 2 months ago

This issue is likely related to libfuse library in centos.9. To confirm this, I'll run the tests with centos.8 to verify. Will report in a day or two.

Actions

Copy link

#26

Updated by Venky Shankar 2 months ago

centos8.stream run w/ main branch: https://pulpito.ceph.com/vshankar-2024-04-01_04:53:22-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/

Actions

Copy link

#27

Updated by Venky Shankar 2 months ago

Related to Bug #61265: qa: tasks.cephfs.fuse_mount:process failed to terminate after unmount added

Actions

Copy link

#28

Updated by Venky Shankar about 2 months ago

Venky Shankar wrote:

centos8.stream run w/ main branch: https://pulpito.ceph.com/vshankar-2024-04-01_04:53:22-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/

This run has cephadm upgrade issues.

New run: https://pulpito.ceph.com/vshankar-2024-04-03_07:59:51-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/

Actions

Copy link

#29

Updated by Laura Flores about 2 months ago

/a/yuriw-2024-04-09_14:58:21-upgrade-wip-yuri4-testing-2024-04-08-1432-distro-default-smithi/7648857

Actions

Copy link

#30

Updated by Venky Shankar about 2 months ago · Edited

I have a custom quincy branch (patched with debug in ceph-fuse/fuse_ll). That should give us enough debug to see what's happening.

EDIT: https://shaman.ceph.com/builds/ceph/wip-64502-quincy-2/

Actions

Copy link

#31

Updated by Venky Shankar about 2 months ago

diff --git a/src/ceph_fuse.cc b/src/ceph_fuse.cc
index 3fa5346b463..061d9f576e4 100644
--- a/src/ceph_fuse.cc
+++ b/src/ceph_fuse.cc
@@ -331,14 +331,19 @@ int main(int argc, const char **argv, const char *envp[]) {
     cerr << "ceph-fuse[" << getpid() << "]: starting fuse" << std::endl;
     tester.init(cfuse, client);
     tester.create("tester");
+    cout << "ceph_fuse: starting fuse loop" << std::endl;
     r = cfuse->loop();
+    cout << "ceph_fuse: fuse loop returned - joining" << std::endl;
     tester.join(&tester_rp);
+    cout << "ceph_fuse: fuse loop returned - joined" << std::endl;
     tester_r = static_cast<int>(reinterpret_cast<uint64_t>(tester_rp));
     cerr << "ceph-fuse[" << getpid() << "]: fuse finished with error " << r
         << " and tester_r " << tester_r <<std::endl;

   out_client_unmount:
+    cout << "calling unmount" << std::endl;
     client->unmount();
+    cout << "calling finalize" << std::endl;
     cfuse->finalize();
   out_shutdown:
     icp.stop();
diff --git a/src/client/fuse_ll.cc b/src/client/fuse_ll.cc
index 109aa25094d..765fdd7ffc7 100644
--- a/src/client/fuse_ll.cc
+++ b/src/client/fuse_ll.cc
@@ -1612,21 +1612,31 @@ int CephFuse::Handle::loop()
   auto fuse_multithreaded = client->cct->_conf.get_val<bool>(
     "fuse_multithreaded");
   if (fuse_multithreaded) {
+    derr << "multithreaded fuse" << dendl;
 #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 1)
     {
       struct fuse_loop_config conf = {
         clone_fd: opts.clone_fd,
         max_idle_threads: opts.max_idle_threads
       };
-      return fuse_session_loop_mt(se, &conf);
+      int r = fuse_session_loop_mt(se, &conf);
+      derr << "fuse_session_loop_mt 1 -- done" << dendl;
+      return r;
     }
 #elif FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
-    return fuse_session_loop_mt(se, opts.clone_fd);
+    int r = fuse_session_loop_mt(se, opts.clone_fd);
+    derr << "fuse_session_loop_mt 2 -- done" << dendl;
+    return r;
 #else
-    return fuse_session_loop_mt(se);
+    int r = fuse_session_loop_mt(se);
+    derr << "fuse_session_loop_mt 3 -- done" << dendl;
+    return r;
 #endif
   } else {
-    return fuse_session_loop(se);
+    derr << "singlethreaded fuse" << dendl;
+    int r = fuse_session_loop(se);
+    derr << "fuse_session_loop_mt 4 -- done" << dendl;
+    return r;
   }
 }

And patched up the yaml to use the custom quincy build to upgrade to. Here is the run: https://pulpito.ceph.com/vshankar-2024-04-12_13:03:29-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/7653889/

Interestingly

vshankar@teuthology:/a/vshankar-2024-04-12_13:03:29-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/7653889$ egrep "ceph_fuse\:|fusermount|unmount|finalize|fuse_session_loop_mt" teuthology.log
2024-04-12T13:38:54.194 INFO:tasks.ceph_fuse:Running ceph_fuse task...
2024-04-12T13:38:54.195 INFO:tasks.ceph_fuse:config is {'client.0': {}, 'client.1': {}}
2024-04-12T13:38:54.195 INFO:tasks.ceph_fuse:client.0 config is {}
2024-04-12T13:38:54.195 INFO:tasks.ceph_fuse:client.1 config is {}
2024-04-12T13:38:54.522 INFO:tasks.ceph_fuse:Mounting ceph-fuse clients...
2024-04-12T13:39:01.752 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi060.stdout:ceph_fuse: starting fuse loop
2024-04-12T13:39:10.578 INFO:tasks.cephfs.fuse_mount.ceph-fuse.1.smithi110.stdout:ceph_fuse: starting fuse loop
2024-04-12T13:56:40.623 INFO:tasks.ceph_fuse:Unmounting ceph-fuse clients...
2024-04-12T13:56:45.440 INFO:tasks.cephfs.fuse_mount:Running fusermount -u on ubuntu@smithi060.front.sepia.ceph.com...
2024-04-12T13:56:45.440 DEBUG:teuthology.orchestra.run.smithi060:> sudo fusermount -u /home/ubuntu/cephtest/mnt.0
2024-04-12T14:01:44.607 ERROR:tasks.cephfs.fuse_mount:process failed to terminate after unmount. This probably indicates a bug within ceph-fuse.
2024-04-12T14:02:51.770 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi060.stdout:ceph_fuse: fuse loop returned - joining
2024-04-12T14:02:51.770 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi060.stdout:ceph_fuse: fuse loop returned - joined
2024-04-12T14:02:51.771 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi060.stderr:2024-04-12T14:02:51.768+0000 7fd910abf300 -1 fuse_session_loop_mt 3 -- done
2024-04-12T14:02:51.771 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi060.stdout:calling unmount
2024-04-12T14:03:21.770 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi060.stdout:calling finalize
vshankar@teuthology:/a/vshankar-2024-04-12_13:03:29-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/7653889$

fusermount had no effect at all on the ceph-fuse daemon. One would expect fuse_session_loop_mt() to unblock and fuse_ll.cc::loop() to return. It's only when the cluster teardown was under execution, the relevant debug logs are seen.

I'm going to run this again with fuse_debug client option enabled - that adds a -d argument to the daemon and that should give us debugging logs from libfuse (at least the config option help says so).

Actions

Copy link

#32

Updated by Patrick Donnelly about 2 months ago

Venky Shankar wrote in #note-31:

[...]

And patched up the yaml to use the custom quincy build to upgrade to. Here is the run: https://pulpito.ceph.com/vshankar-2024-04-12_13:03:29-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/7653889/

Interestingly

[...]

fusermount had no effect at all on the ceph-fuse daemon. One would expect fuse_session_loop_mt() to unblock and fuse_ll.cc::loop() to return. It's only when the cluster teardown was under execution, the relevant debug logs are seen.

I'm going to run this again with fuse_debug client option enabled - that adds a -d argument to the daemon and that should give us debugging logs from libfuse (at least the config option help says so).

Another avenue to pursue is to gdb attach the ceph-fuse process and see what the fuse dispatcher threads are doing. It's possible there is a bug in the Client preventing one or more of the threads from causing the fuse loop function from returning; I suspect libfuse wants to join on all those dispatcher threads before it returns from its main loop. I looked at the client log but didn't see anything which would indicate why cluster teardown would prompt the libfuse loop to finally terminate. There are a lot of connection refused messages for reconnects to the mons. That's about it.

Actions

Copy link

#33

Updated by Venky Shankar about 2 months ago

OK. So this bug has upgrades written all over it - it seemed obvious given that this is an upgrade task but we were trying to see what's going on with fuse. With some test iterations, it's now clear that the cluster upgrade is somehow causing this blowup. I patched up mds_upgrade_sequence to not upgrade and the umount hang isn't seen. So, reconnects are involved and I think this has to do more than just the MDS (read, monitors).

Actions

Copy link

#34

Updated by Venky Shankar about 2 months ago

Venky Shankar wrote in #note-33:

OK. So this bug has upgrades written all over it - it seemed obvious given that this is an upgrade task but we were trying to see what's going on with fuse. With some test iterations, it's now clear that the cluster upgrade is somehow causing this blowup. I patched up mds_upgrade_sequence to not upgrade and the umount hang isn't seen. So, reconnects are involved and I think this has to do more than just the MDS (read, monitors).

Oh, one more thing - in a different test iteration I patched up the upgrade yaml to not run the workload-task (which runs fsstress) but let the cluster upgrade. So, a simple cluster upgrade without any workload is enough to trigger this umount hang.

Actions

Copy link

#35

Updated by Venky Shankar about 1 month ago

Related to Bug #50260: pacific: qa: "rmdir: failed to remove '/home/ubuntu/cephtest': Directory not empty" added

Actions

Copy link

#36

Updated by Patrick Donnelly about 1 month ago

It sounds more and more to me like there is some kind of request the client continues to wait on that is blocking the fuse loop. That gets blocked by the MDS upgrade and no retry is performed. The cluster teardown must kick things in a way for the request to finally stop?

Actions

Copy link

#37

Updated by Venky Shankar about 1 month ago

Patrick Donnelly wrote in #note-36:

It sounds more and more to me like there is some kind of request the client continues to wait on that is blocking the fuse loop. That gets blocked by the MDS upgrade and no retry is performed. The cluster teardown must kick things in a way for the request to finally stop?

That should be relatively easy to figure out with the reduced log noise without the workload-task and just upgrading the MDS with cephadm orch upgrade start ... --daemon-types mds (just that I'm unsuccessful in getting a test run done in teuthology)

Actions

Copy link

#38

Updated by Venky Shankar about 1 month ago

Venky Shankar wrote in #note-37:

Patrick Donnelly wrote in #note-36:

It sounds more and more to me like there is some kind of request the client continues to wait on that is blocking the fuse loop. That gets blocked by the MDS upgrade and no retry is performed. The cluster teardown must kick things in a way for the request to finally stop?

That should be relatively easy to figure out with the reduced log noise without the workload-task and just upgrading the MDS with cephadm orch upgrade start ... --daemon-types mds (just that I'm unsuccessful in getting a test run done in teuthology)

Scrap that. This doesn't work - cephadm enforces an order of upgrade

2024-04-30T10:08:54.784 INFO:teuthology.orchestra.run.smithi037.stderr:NOTE: Enforced upgrade order is: mgr -> mon -> crash -> osd -> mds -> rgw -> rbd-mirror -> cephfs-mirror -> iscsi -> nfs

Anyway, I'll read the code carefully to see what's going on.

Actions

Copy link

#39

Updated by Venky Shankar about 1 month ago

Patrick Donnelly wrote in #note-36:

It sounds more and more to me like there is some kind of request the client continues to wait on that is blocking the fuse loop. That gets blocked by the MDS upgrade and no retry is performed. The cluster teardown must kick things in a way for the request to finally stop?

The only messages that the client is exchanging with the MDS around when the MDS upgrade is initiated are the metrics message:

2024-04-30T11:23:07.583+0000 7f5cc3400640 20 client.14522 tick
2024-04-30T11:23:07.583+0000 7f5cc3400640 20 client.14522 collect_and_send_metrics
2024-04-30T11:23:07.583+0000 7f5cc3400640 20 client.14522 collect_and_send_global_metrics
2024-04-30T11:23:07.583+0000 7f5cc3400640  1 -- 192.168.0.1:0/653481002 --> [v2:172.21.15.37:6826/2024972561,v1:172.21.15.37:6827/2024972561] -- client_metrics [client_metric_type: READ_LATENCY latency: 0.000000, avg_latency: 0.000000, sq_sum: 0, count=0][client_metric_type: WRITE_LATENCY latency: 0.000000, avg_latency: 0.000000, sq_sum: 0, count=0][client_metric_type: METADATA_LATENCY latency: 0.004497, avg_latency: 0.001499, sq_sum: 1961013495640, count=3][client_metric_type: CAP_INFO cap_hits: 348 cap_misses: 3 num_caps: 0][client_metric_type: DENTRY_LEASE dlease_hits: 0 dlease_misses: 0 num_dentries: 0][client_metric_type: OPENED_FILES opened_files: 0 total_inodes: 1][client_metric_type: PINNED_ICAPS pinned_icaps: 1 total_inodes: 1][client_metric_type: OPENED_INODES opened_inodes: 0 total_inodes: 1][client_metric_type: READ_IO_SIZES total_ops: 0 total_size: 0][client_metric_type: WRITE_IO_SIZES total_ops: 0 total_size: 0] v1 -- 0x7f5cb4004500 con 0x7f5cbc01da30
2024-04-30T11:23:07.583+0000 7f5cc3400640 20 client.14522 trim_cache size 0 max 16384
2024-04-30T11:23:07.583+0000 7f5cc3400640 20 client.14522 upkeep thread waiting interval 1.000000000s

I won't be surprised if this bug is a foul-up of that. That's my debug point for now.

Actions

Copy link

#40

Updated by Venky Shankar about 1 month ago

It's not the metrics stuff but a bug when the MDS sends back a client_session(open) during client reconnect (post mds upgrade). There is code that signals mount_cond and that's not executed in this case. I'm testing out a a possible fix.

Actions

Copy link

#41

Updated by Patrick Donnelly about 1 month ago

Venky Shankar wrote in #note-40:

It's not the metrics stuff but a bug when the MDS sends back a client_session(open) during client reconnect (post mds upgrade). There is code that signals mount_cond and that's not executed in this case. I'm testing out a a possible fix.

Excellent, looking forward to figuring this out =)

Actions

Copy link

#42

Updated by Venky Shankar 17 days ago

Patrick Donnelly wrote in #note-41:

Venky Shankar wrote in #note-40:

It's not the metrics stuff but a bug when the MDS sends back a client_session(open) during client reconnect (post mds upgrade). There is code that signals mount_cond and that's not executed in this case. I'm testing out a a possible fix.

Excellent, looking forward to figuring this out =)

Back on this soon. Teuthology acted up when I planned to test the fix.

Actions

Copy link

#43

Updated by Venky Shankar 13 days ago · Edited

File gdb.txt gdb.txt added

This is a really weird bug and is reproducible even with v16.2.0, however, a slightly more weirdness is that umount unresponsiveness isn't seen with fs:upgrade:featureful_client:old_client. I'll explain: fs:upgrade:featureful_client:old_client is somewhat similar to fs:upgrade:mds_upgrade_sequence in the sense that the MDSs are upgraded leaving out the user-space ceph-fuse client (old_client additionally checks if a client is evicted by the MDS when a certain client feature is made mandatory). So, for testing, I patched up old_client (reef branch) to start from pacific (v16.2.15) and then start the upgrade (without any workload) to reef, excluding out the bit that makes a certain client feature mandatory:

diff --git a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/0-octopus.yaml b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/0-pacific.yaml
similarity index 82%
rename from qa/suites/fs/upgrade/featureful_client/old_client/tasks/0-octopus.yaml
rename to qa/suites/fs/upgrade/featureful_client/old_client/tasks/0-pacific.yaml
index e7774423fa5..b74accc69f0 100644
--- a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/0-octopus.yaml
+++ b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/0-pacific.yaml
@@ -1,9 +1,9 @@
 meta:
 - desc: |
-   install ceph/octopus latest
+   install ceph/pacific latest
 tasks:
 - install:
-    branch: octopus
+    branch: pacific
     exclude_packages:
       - librados3
       - ceph-mgr-dashboard
@@ -13,7 +13,7 @@ tasks:
       - cephadm
       - ceph-volume
     extra_packages: ['librados2']
-- print: "**** done installing octopus" 
+- print: "**** done installing pacific" 
 - ceph:
     log-ignorelist:
       - overall HEALTH_
@@ -35,5 +35,5 @@ tasks:
         ms bind msgr2: false
 - exec:
     osd.0:
-      - ceph osd set-require-min-compat-client octopus
+      - ceph osd set-require-min-compat-client pacific
 - print: "**** done ceph" 
diff --git a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/1-client.yaml b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/1-client.yaml
index 976d6e2650b..55fd27c50d0 100644
--- a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/1-client.yaml
+++ b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/1-client.yaml
@@ -1,8 +1,3 @@
 tasks:
 - ceph-fuse:
-- print: "**** done octopus client" 
-- workunit:
-    clients:
-      all:
-      - suites/fsstress.sh
-- print: "**** done fsstress" 
+- print: "**** done pacific client" 
diff --git a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/2-upgrade.yaml b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/2-upgrade.yaml
index 26c185946dd..00e82d87151 100644
--- a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/2-upgrade.yaml
+++ b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/2-upgrade.yaml
@@ -22,7 +22,7 @@ tasks:
 - install.upgrade:
     # upgrade the single cluster node, which is running all the mon/mds/osd/mgr daemons
     mon.a:
-      branch: quincy
+      branch: reef
 - print: "**** done install.upgrade the host" 
 - ceph.restart:
     daemons: [mon.*, mgr.*]
@@ -42,7 +42,7 @@ tasks:
     mon.a:
     - ceph osd dump -f json-pretty
     - ceph versions
-    - ceph osd require-osd-release quincy
+    - ceph osd require-osd-release reef
     - for f in `ceph osd pool ls` ; do ceph osd pool set $f pg_autoscale_mode off ; done
 - ceph.healthy:
 - print: "**** done ceph.restart" 
diff --git a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/.qa b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/.qa
deleted file mode 120000
index a602a0353e7..00000000000
--- a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/.qa
+++ /dev/null
@@ -1 +0,0 @@
-../.qa/
\ No newline at end of file
diff --git a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/no.yaml b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/no.yaml
deleted file mode 100644
index b495eb41b54..00000000000
--- a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/no.yaml
+++ /dev/null
@@ -1,6 +0,0 @@
-tasks:
-- workunit:
-    clients:
-      all:
-      - suites/fsstress.sh
-- print: "**** done fsstress" 
diff --git a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/quincy.yaml b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/quincy.yaml
deleted file mode 100644
index 138d8f4e290..00000000000
--- a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/quincy.yaml
+++ /dev/null
@@ -1,12 +0,0 @@
-overrides:
-  ceph:
-    log-ignorelist:
-      - missing required features
-tasks:
-- exec:
-    mon.a:
-      - ceph fs dump --format=json-pretty
-      - ceph fs required_client_features cephfs add metric_collect
-- sleep:
-    duration: 5
-- fs.clients_evicted:

The umount unresponsiveness isn't seen and the test runs successfully. The difference between this and mds_upgrade_sequence is the way upgrade is performed. old_client uses install.upgrade task to update packages (since it basically tests upgrading from octopus for which cephadm isn't supported) for mon/mgr/osd/mds (qa/suites/fs/upgrade/featureful_client/old_client/tasks/2-upgrade.yaml) and then restarts the daemons while mds_upgrade_sequence performs a staggered upgrade of ceph daemons.

The cephfs client in the staggered upgrade scenario has multiple client_reconnect messages - one for each reconnect when a MDS is failed over (on upgrade):

2024-05-17T17:59:47.189+0000 7f0e4ea00700  1 -- 192.168.0.1:0/2023218024 --> [v2:172.21.15.163:6828/2124582105,v1:172.21.15.163:6829/2124582105] -- client_reconnect(1 caps 1 realms ) v5 -- 0x7f0e50012750 con 0x5584ab952340
2024-05-17T18:00:10.755+0000 7f0e4ea00700  1 -- 192.168.0.1:0/2023218024 --> [v2:172.21.15.188:6824/2400944679,v1:172.21.15.188:6825/2400944679] -- client_reconnect(1 caps 1 realms ) v5 -- 0x7f0e50012750 con 0x7f0e440117f0
2024-05-17T18:00:32.041+0000 7f0e4ea00700  1 -- 192.168.0.1:0/2023218024 --> [v2:172.21.15.188:6826/2689927205,v1:172.21.15.188:6827/2689927205] -- client_reconnect(1 caps 1 realms ) v5 -- 0x7f0e50012750 con 0x7f0e440142e0
2024-05-17T18:00:54.079+0000 7f0e4ea00700  1 -- 192.168.0.1:0/2023218024 --> [v2:172.21.15.163:6826/3987377094,v1:172.21.15.163:6827/3987377094] -- client_reconnect(1 caps 1 realms ) v5 -- 0x7f0e50012750 con 0x7f0e44010fe0

While the other case (old_client) there is just a single reconnect (since the daemons are restarted after installing new packages):

2024-05-18T18:21:40.110+0000 7fe54cc00700  1 -- 172.21.15.64:0/902497327 --> v1:172.21.15.39:6819/529060055 -- client_reconnect(1 caps 1 realms ) v5 -- 0x7fe55000a920 con 0x55f38a96d130

I was able to gdb into the cep-fuse process in the staggered upgrade scenario (normally, the backtraces wern't helpful since the frames weren't descriptive, esp. when examining frames in pthread library), but this time I captures the frames (attached log). The interesting part here is:

Thread 25 (Thread 0x7fb72e000700 (LWP 98680)):
#0  0x00007fb75592f0aa in splice () from /lib64/libc.so.6
#1  0x00007fb762215a05 in fuse_ll_receive_buf () from /lib64/libfuse.so.2
#2  0x00007fb7622143e3 in fuse_do_work () from /lib64/libfuse.so.2
#3  0x00007fb7578081ca in start_thread () from /lib64/libpthread.so.0
#4  0x00007fb755839e73 in clone () from /lib64/libc.so.6

Thread 24 (Thread 0x7fb72ea00700 (LWP 98678)):
#0  0x00007fb75592f0aa in splice () from /lib64/libc.so.6
#1  0x00007fb762215a05 in fuse_ll_receive_buf () from /lib64/libfuse.so.2
#2  0x00007fb7622143e3 in fuse_do_work () from /lib64/libfuse.so.2
#3  0x00007fb7578081ca in start_thread () from /lib64/libpthread.so.0
#4  0x00007fb755839e73 in clone () from /lib64/libc.so.6

and obviously the main fuse loop:

Thread 1 (Thread 0x7fb7629333c0 (LWP 98627)):
#0  0x00007fb757810da6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007fb757810e98 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fb762214698 in fuse_session_loop_mt () from /lib64/libfuse.so.2
#3  0x0000557df369b2ae in main ()

It does look like a couple of fuse threads are blocked on request completion. Nothing really suspicious in the client log apart from one small bit:

In the old_client case, just before reconnect, there is a _ll_getattr request that does not find the required caps to satisfy the request and the client tries to contact the MDS (_getattr mask pAsLsXsFs issued=0):

2024-05-18T18:21:39.098+0000 7fe536000700  8 client.4430 _ll_getattr 0x1.head
2024-05-18T18:21:39.098+0000 7fe536000700 10 client.4430 _getattr mask pAsLsXsFs issued=0
2024-05-18T18:21:39.098+0000 7fe536000700 20 client.4430 choose_target_mds starting with req->inode 0x1.head(faked_ino=0 nref=5 ll_ref=7 cap_refs={} open={} mode=41777 size=0/0 nlink=1 btime=2024-05-18T18:18:51.843703+0000 mtime=2024-05-18T18:18:51.843703+0000 ctime=202
4-05-18T18:19:15.738763+0000 change_attr=1 caps=pLsXsFs(0=pLsXsFs) COMPLETE has_dir_layout 0x7fe538007f30)
2024-05-18T18:21:39.098+0000 7fe536000700 20 client.4430 choose_target_mds 0x1.head(faked_ino=0 nref=5 ll_ref=7 cap_refs={} open={} mode=41777 size=0/0 nlink=1 btime=2024-05-18T18:18:51.843703+0000 mtime=2024-05-18T18:18:51.843703+0000 ctime=2024-05-18T18:19:15.738763+0
000 change_attr=1 caps=pLsXsFs(0=pLsXsFs) COMPLETE has_dir_layout 0x7fe538007f30) is_hash=0 hash=0
2024-05-18T18:21:39.098+0000 7fe536000700 10 client.4430 choose_target_mds from caps on inode 0x1.head(faked_ino=0 nref=5 ll_ref=7 cap_refs={} open={} mode=41777 size=0/0 nlink=1 btime=2024-05-18T18:18:51.843703+0000 mtime=2024-05-18T18:18:51.843703+0000 ctime=2024-05-1
8T18:19:15.738763+0000 change_attr=1 caps=pLsXsFs(0=pLsXsFs) COMPLETE has_dir_layout 0x7fe538007f30)
2024-05-18T18:21:39.098+0000 7fe536000700 20 client.4430 mds is 0
2024-05-18T18:21:39.098+0000 7fe536000700 10 client.4430  target mds.0 not active, waiting for new mdsmap

When the mds is back up:active, this (waiting for mdsmap) request is kicked. However, in the staggered upgrade scenario, _ll_getattr is always finds the required caps:

2024-05-17T17:59:46.188+0000 7f0e37e00700  8 client.14540 _ll_getattr 0x1.head
2024-05-17T17:59:46.188+0000 7f0e37e00700 10 client.14540 _getattr mask pAsLsXsFs issued=1
2024-05-17T17:59:46.188+0000 7f0e37e00700 10 client.14540 fill_stat on 0x1 snap/devhead mode 041777 mtime 2024-05-17T17:50:26.136511+0000 ctime 2024-05-17T17:51:12.209018+0000
2024-05-17T17:59:46.188+0000 7f0e37e00700  3 client.14540 ll_getattr 0x1.head = 0
2024-05-17T17:59:46.189+0000 7f0e37e00700  8 client.14540 _ll_forget 0x1 1
2024-05-17T17:59:46.189+0000 7f0e40c00700  8 client.14540 _ll_getattr 0x1.head
2024-05-17T17:59:46.189+0000 7f0e40c00700 10 client.14540 _getattr mask pAsLsXsFs issued=1
2024-05-17T17:59:46.189+0000 7f0e40c00700 10 client.14540 fill_stat on 0x1 snap/devhead mode 041777 mtime 2024-05-17T17:50:26.136511+0000 ctime 2024-05-17T17:51:12.209018+0000
2024-05-17T17:59:46.189+0000 7f0e40c00700  3 client.14540 ll_getattr 0x1.head = 0
2024-05-17T17:59:46.189+0000 7f0e40c00700  8 client.14540 _ll_forget 0x1 1

This sounds irrelevant (since the client can always not have some capabilities), but this (issued=1) is always true just before client_reconnect message is initiated and I've seen this is multiple runs.

I'll link two fresh runs (old_client and mds_upgrade_sequence) once they finish for reference and just to be sure the above observation is somewhere relevant. We maybe running into a bug where some requests do not get kicked and then the fuse worker threads are blocked on them.

Actions

Copy link

#44

Updated by Venky Shankar 13 days ago

https://tracker.ceph.com/issues/64502#note-43 was updated when I was still updating it. Please re-read the update in the tracker rather than in email.

Actions

Copy link

#45

Updated by Venky Shankar 13 days ago

/a/vshankar-2024-05-20_11:13:32-fs:upgrade:mds_upgrade_sequence-reef-release-testing-default-smithi/7715229 is currently available for 90m for gdb'ing ceph-fuse

ps -eaf | grep ceph-fuse
ubuntu     73853   28031  0 11:42 ?        00:00:00 bash -c (cd /home/ubuntu/cephtest && exec sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-fuse -f --admin-socket '/var/run/ceph/$cluster-$name.$pid.asok' /home/ubuntu/cephtest/mnt.0 --id 0)
root       73896   73853  0 11:42 ?        00:00:00 sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-fuse -f --admin-socket /var/run/ceph/$cluster-$name.$pid.asok /home/ubuntu/cephtest/mnt.0 --id 0
root       73900   73896  0 11:42 ?        00:00:00 sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-fuse -f --admin-socket /var/run/ceph/$cluster-$name.$pid.asok /home/ubuntu/cephtest/mnt.0 --id 0
root       73912   73900  0 11:42 ?        00:00:00 /usr/bin/python3 /bin/daemon-helper kill ceph-fuse -f --admin-socket /var/run/ceph/$cluster-$name.$pid.asok /home/ubuntu/cephtest/mnt.0 --id 0
root       73929   73912  0 11:42 ?        00:00:01 ceph-fuse -f --admin-socket /var/run/ceph/$cluster-$name.$pid.asok /home/ubuntu/cephtest/mnt.0 --id 0
vshankar  145434  142240  0 12:17 pts/0    00:00:00 grep --color=auto ceph-fuse

The two fuse threads that are probably waiting for reply are blocked on splice()

-- 73979 --
[<0>] fuse_dev_do_read.isra.17.constprop.20+0x942/0x970 [fuse]
[<0>] fuse_dev_splice_read+0xb0/0x250 [fuse]
[<0>] splice_file_to_pipe+0x78/0x80
[<0>] do_splice+0x5b0/0x890
[<0>] __do_splice+0xca/0x180
[<0>] __x64_sys_splice+0x99/0x110
[<0>] do_syscall_64+0x7f/0x190
[<0>] entry_SYSCALL_64_after_hwframe+0x71/0x79
-- 73980 --
[<0>] fuse_dev_do_read.isra.17.constprop.20+0x942/0x970 [fuse]
[<0>] fuse_dev_splice_read+0xb0/0x250 [fuse]
[<0>] splice_file_to_pipe+0x78/0x80
[<0>] do_splice+0x5b0/0x890
[<0>] __do_splice+0xca/0x180
[<0>] __x64_sys_splice+0x99/0x110
[<0>] do_syscall_64+0x7f/0x190
[<0>] entry_SYSCALL_64_after_hwframe+0x71/0x79

An ideal case would be to get hands on CephFuse *cfuse from the main program thread so that any pending/blocked request can be inspected, but that doesn't seen to be available(?). @Patrick Donnelly

Actions

Copy link

#46

Updated by Venky Shankar 13 days ago

/a/vshankar-2024-05-20_10:42:59-fs:upgrade:featureful_client:old_client-reef-release-testing-default-smithi/7715226 is old_client test suite with the yaml that makes a client feature mandatory removed.

Actions

Copy link

#47

Updated by Leonid Usov 13 days ago

Related to Bug #66126: Scrub error on inode ... with quiescer added

Actions

Copy link

#48

Updated by Leonid Usov 13 days ago

Related to deleted (Bug #66126: Scrub error on inode ... with quiescer)

Actions

Copy link

#49

Updated by Venky Shankar 12 days ago

Is it uncommon that there are two entries in /proc/mounts related to network namespaces?

[vshankar@smithi174 ~]$ grep ceph /proc/mounts
/dev/mapper/vg_nvme-lv_5 /var/lib/ceph xfs rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
nsfs /run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 nsfs rw,seclabel 0 0
nsfs /run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 nsfs rw,seclabel 0 0
nsfs /run/netns/ceph-ns--home-ubuntu-cephtest-mnt.1 nsfs rw,seclabel 0 0
nsfs /run/netns/ceph-ns--home-ubuntu-cephtest-mnt.1 nsfs rw,seclabel 0 0
ceph-fuse /home/ubuntu/cephtest/mnt.1 fuse.ceph-fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0

This is after a fusermount call for mnt.0 - there are two nsfs entries for mnt.0 network namespace and the ceph-fuse entry for mnt.0 is gone, however, the process is still running:

ubuntu     42612  0.0  0.0  12732  2816 ?        Ss   07:22   0:00 bash -c (cd /home/ubuntu/cephtest && exec sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-fuse -f --admin-socket '/var/run/ceph/$cluster-$name.$pid.asok' /home/ubuntu/cephtest/mnt.0 --id 0)
root       42655  0.0  0.0 123104  7936 ?        S    07:22   0:00 sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-fuse -f --admin-socket /var/run/ceph/$cluster-$name.$pid.asok /home/ubuntu/cephtest/mnt.0 --id 0
root       42659  0.0  0.0 123104  8192 ?        S    07:22   0:00 sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-fuse -f --admin-socket /var/run/ceph/$cluster-$name.$pid.asok /home/ubuntu/cephtest/mnt.0 --id 0
root       42684  0.0  0.0  47188 11904 ?        S    07:22   0:01 /usr/bin/python3 /bin/daemon-helper kill ceph-fuse -f --admin-socket /var/run/ceph/$cluster-$name.$pid.asok /home/ubuntu/cephtest/mnt.0 --id 0
root       42702  0.0  0.1 1218128 33708 ?       Ssl  07:22   0:02 ceph-fuse -f --admin-socket /var/run/ceph/$cluster-$name.$pid.asok /home/ubuntu/cephtest/mnt.0 --id 0

@Xiubo Li Could this be a problem?

Actions

Copy link

#50

Updated by Venky Shankar 12 days ago

This bug has started to look like its less to do with ceph-fuse -- all attempts to spot a bug in the client code haven't lead to any concrete conclusion, esp. https://tracker.ceph.com/issues/64502#note-43 where the blocked threads are basically idle threads in libfuse which are blocked reading from /dev/fuse.

I think this issue has do with the way the upgrade test is setup which uses cephadm.

Actions

Copy link

#51

Updated by Xiubo Li 12 days ago

Venky Shankar wrote in #note-49:

Is it uncommon that there are two entries in /proc/mounts related to network namespaces?

[...]

This is after a fusermount call for mnt.0 - there are two nsfs entries for mnt.0 network namespace and the ceph-fuse entry for mnt.0 is gone, however, the process is still running:

[...]

@Xiubo Li Could this be a problem?

As I remembered when I was working on this, two same entries for the same mount will be an issue and could cause the following mount failures.

Anyway there should be something wrong to cleanup the nsfs.

Actions

Copy link

#52

Updated by Venky Shankar 12 days ago

Xiubo Li wrote in #note-51:

Venky Shankar wrote in #note-49:

Is it uncommon that there are two entries in /proc/mounts related to network namespaces?

[...]

This is after a fusermount call for mnt.0 - there are two nsfs entries for mnt.0 network namespace and the ceph-fuse entry for mnt.0 is gone, however, the process is still running:

[...]

@Xiubo Li Could this be a problem?

As I remembered when I was working on this, two same entries for the same mount will be an issue and could cause the following mount failures.

Anyway there should be something wrong to cleanup the nsfs.

Is there anything logged (anywhere) that could hint at possible stale mounts being left behind?

Also, just to strengthen my point about this issue not being a user-space client bug - umount2(2) returns success as seen via strace:

165426 readlink("/home", 0x7ffecaf586d0, 1023) = -1 EINVAL (Invalid argument)
165426 readlink("/home/ubuntu", 0x7ffecaf586d0, 1023) = -1 EINVAL (Invalid argument)
165426 readlink("/home/ubuntu/cephtest", 0x7ffecaf586d0, 1023) = -1 EINVAL (Invalid argument)
165426 chdir("/")                       = 0
165426 getuid()                         = 0
165426 umask(033)                       = 022
165426 geteuid()                        = 0
165426 newfstatat(AT_FDCWD, "/etc/mtab", {st_mode=S_IFLNK|0777, st_size=19, ...}, AT_SYMLINK_NOFOLLOW) = 0
165426 getuid()                         = 0
165426 getuid()                         = 0
165426 chdir("/home/ubuntu/cephtest")   = 0
165426 getcwd("/home/ubuntu/cephtest", 65536) = 22
165426 getuid()                         = 0
165426 umount2("mnt.0", UMOUNT_NOFOLLOW) = 0
165426 chdir("/")                       = 0
165426 newfstatat(AT_FDCWD, "/etc/mtab", {st_mode=S_IFLNK|0777, st_size=19, ...}, AT_SYMLINK_NOFOLLOW) = 0
165426 exit_group(0)                    = ?
165426 +++ exited with 0 +++

If there was an operation pending on the client, then umount2(2) would have blocked IMO.

Actions

Copy link

#53

Updated by Patrick Donnelly 11 days ago

Venky Shankar wrote in #note-43:

This is a really weird bug and is reproducible even with v16.2.0, however, a slightly more weirdness is that umount unresponsiveness isn't seen with fs:upgrade:featureful_client:old_client. I'll explain: fs:upgrade:featureful_client:old_client is somewhat similar to fs:upgrade:mds_upgrade_sequence in the sense that the MDSs are upgraded leaving out the user-space ceph-fuse client (old_client additionally checks if a client is evicted by the MDS when a certain client feature is made mandatory). So, for testing, I patched up old_client (reef branch) to start from pacific (v16.2.15) and then start the upgrade (without any workload) to reef, excluding out the bit that makes a certain client feature mandatory:

I think perhaps the major difference here is that the node running the "old client" is not upgraded at all (obviously) so the packages and binaries are untouched. Perhaps this is relating to the binary swapped out? One other thing to note is that in the upgraded_client test case, we remount the client with the new packages. To eliminate that divergence, you should remount the old_client after the mds is upgraded to see if the issue is reproducible.

Actions

Copy link

#54

Updated by Venky Shankar 10 days ago · Edited

Patrick Donnelly wrote in #note-53:

Venky Shankar wrote in #note-43:

This is a really weird bug and is reproducible even with v16.2.0, however, a slightly more weirdness is that umount unresponsiveness isn't seen with fs:upgrade:featureful_client:old_client. I'll explain: fs:upgrade:featureful_client:old_client is somewhat similar to fs:upgrade:mds_upgrade_sequence in the sense that the MDSs are upgraded leaving out the user-space ceph-fuse client (old_client additionally checks if a client is evicted by the MDS when a certain client feature is made mandatory). So, for testing, I patched up old_client (reef branch) to start from pacific (v16.2.15) and then start the upgrade (without any workload) to reef, excluding out the bit that makes a certain client feature mandatory:

I think perhaps the major difference here is that the node running the "old client" is not upgraded at all (obviously) so the packages and binaries are untouched. Perhaps this is relating to the binary swapped out?

This is featureful_client:old_client cluster YAML

cat qa/suites/fs/upgrade/featureful_client/old_client/clusters/1-mds-2-client-micro.yaml
roles:
- [mon.a, mon.b, mon.c, mgr.x, mgr.y, mds.a, mds.b, mds.c, osd.0, osd.1, osd.2, osd.3]
- [client.0]
- [client.1]
openstack:
- volumes: # attached to each instance
    count: 4
    size: 10 # GB

It's just one single node hosting everything. I believe this weirdness is somehow related to containerized daemons and network namespaces.

One other thing to note is that in the upgraded_client test case, we remount the client with the new packages. To eliminate that divergence, you should remount the old_client after the mds is upgraded to see if the issue is reproducible.

Yes, that's right for upgraded_client, but, for featureful_client:old_client which does not restart clients and therefore is pretty much similar to mds_upgrade_sequence. I'll give it a try...

Actions

Copy link

#55

Updated by Patrick Donnelly 9 days ago

Venky Shankar wrote in #note-54:

Patrick Donnelly wrote in #note-53:

Venky Shankar wrote in #note-43:

This is a really weird bug and is reproducible even with v16.2.0, however, a slightly more weirdness is that umount unresponsiveness isn't seen with fs:upgrade:featureful_client:old_client. I'll explain: fs:upgrade:featureful_client:old_client is somewhat similar to fs:upgrade:mds_upgrade_sequence in the sense that the MDSs are upgraded leaving out the user-space ceph-fuse client (old_client additionally checks if a client is evicted by the MDS when a certain client feature is made mandatory). So, for testing, I patched up old_client (reef branch) to start from pacific (v16.2.15) and then start the upgrade (without any workload) to reef, excluding out the bit that makes a certain client feature mandatory:

I think perhaps the major difference here is that the node running the "old client" is not upgraded at all (obviously) so the packages and binaries are untouched. Perhaps this is relating to the binary swapped out?

This is featureful_client:old_client cluster YAML

[...]

It's just one single node hosting everything.

Everything but the clients? The clients are on different nodes (smithis).

I believe this weirdness is somehow related to containerized daemons and network namespaces.

One other thing to note is that in the upgraded_client test case, we remount the client with the new packages. To eliminate that divergence, you should remount the old_client after the mds is upgraded to see if the issue is reproducible.

Yes, that's right for upgraded_client, but, for featureful_client:old_client which does not restart clients and therefore is pretty much similar to mds_upgrade_sequence. I'll give it a try...

Actions

Copy link

#56

Updated by Venky Shankar 4 days ago

Patrick Donnelly wrote in #note-55:

Venky Shankar wrote in #note-54:

Patrick Donnelly wrote in #note-53:

Venky Shankar wrote in #note-43:

This is a really weird bug and is reproducible even with v16.2.0, however, a slightly more weirdness is that umount unresponsiveness isn't seen with fs:upgrade:featureful_client:old_client. I'll explain: fs:upgrade:featureful_client:old_client is somewhat similar to fs:upgrade:mds_upgrade_sequence in the sense that the MDSs are upgraded leaving out the user-space ceph-fuse client (old_client additionally checks if a client is evicted by the MDS when a certain client feature is made mandatory). So, for testing, I patched up old_client (reef branch) to start from pacific (v16.2.15) and then start the upgrade (without any workload) to reef, excluding out the bit that makes a certain client feature mandatory:

I think perhaps the major difference here is that the node running the "old client" is not upgraded at all (obviously) so the packages and binaries are untouched. Perhaps this is relating to the binary swapped out?

This is featureful_client:old_client cluster YAML

[...]

It's just one single node hosting everything.

Everything but the clients? The clients are on different nodes (smithis).

You are correct - the client[0,1] are on separate lines in the yaml (I missed that this yaml does not use host.<> specifier). Also, old_client is not a containerized deployment.

I'll run a test for mds_upgrade_sequence with clients on entirely separate node.

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #64502

pacific/quincy/v18.2.0: client: ceph-fuse fails to unmount after upgrade to main

Updated by Patrick Donnelly 3 months ago

Updated by Patrick Donnelly 3 months ago

Updated by Patrick Donnelly 3 months ago

Updated by Patrick Donnelly 3 months ago

Updated by Patrick Donnelly 3 months ago

Updated by Patrick Donnelly 3 months ago

Updated by Patrick Donnelly 3 months ago

Updated by Patrick Donnelly 3 months ago

Updated by Patrick Donnelly 3 months ago

Updated by Venky Shankar 3 months ago

Updated by Venky Shankar 3 months ago

Updated by Venky Shankar 3 months ago

Updated by Venky Shankar 3 months ago

Updated by Venky Shankar 3 months ago

Updated by Venky Shankar 3 months ago

Updated by Venky Shankar 3 months ago

Updated by Venky Shankar 3 months ago

Updated by Venky Shankar 3 months ago

Updated by Venky Shankar 3 months ago

Updated by Venky Shankar 3 months ago

Updated by Venky Shankar 3 months ago

Updated by Patrick Donnelly 3 months ago

Updated by Venky Shankar 3 months ago

Updated by Milind Changire 2 months ago

Updated by Venky Shankar 2 months ago

Updated by Venky Shankar 2 months ago

Updated by Venky Shankar 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Laura Flores about 2 months ago

Updated by Venky Shankar about 2 months ago · Edited

Updated by Venky Shankar about 2 months ago

Updated by Patrick Donnelly about 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Venky Shankar about 1 month ago

Updated by Patrick Donnelly about 1 month ago

Updated by Venky Shankar about 1 month ago

Updated by Venky Shankar about 1 month ago

Updated by Venky Shankar about 1 month ago

Updated by Venky Shankar about 1 month ago

Updated by Patrick Donnelly about 1 month ago

Updated by Venky Shankar 17 days ago

Updated by Venky Shankar 13 days ago · Edited

Updated by Venky Shankar 13 days ago

Updated by Venky Shankar 13 days ago

Updated by Venky Shankar 13 days ago

Updated by Leonid Usov 13 days ago

Updated by Leonid Usov 13 days ago

Updated by Venky Shankar 12 days ago

Updated by Venky Shankar 12 days ago

Updated by Xiubo Li 12 days ago

Updated by Venky Shankar 12 days ago

Updated by Patrick Donnelly 11 days ago

Updated by Venky Shankar 10 days ago · Edited

Updated by Patrick Donnelly 9 days ago

Updated by Venky Shankar 4 days ago