Bug #64502
openpacific/quincy/v18.2.0: client: ceph-fuse fails to unmount after upgrade to main
Added by Patrick Donnelly 3 months ago. Updated 4 days ago.
0%
Description
Every ceph-fuse mount for quincy fails to unmount for reef->main:
2024-02-19T19:17:36.535 INFO:tasks.cephfs.fuse_mount:Running fusermount -u on ubuntu@smithi060.front.sepia.ceph.com... 2024-02-19T19:17:36.535 INFO:teuthology.orchestra.run:Running command with timeout 300 2024-02-19T19:17:36.535 DEBUG:teuthology.orchestra.run.smithi060:> sudo fusermount -u /home/ubuntu/cephtest/mnt.0 2024-02-19T19:17:36.562 INFO:teuthology.orchestra.run:waiting for 300
From: /teuthology/pdonnell-2024-02-19_18:28:45-fs:upgrade:mds_upgrade_sequence-wip-batrick-testing-20240215.160715-distro-default-smithi/7566635/teuthology.log
2024-02-19T19:17:36.799+0000 7f9fa7fff640 20 client.14548 tick 2024-02-19T19:17:36.799+0000 7f9fa7fff640 20 client.14548 collect_and_send_metrics 2024-02-19T19:17:36.799+0000 7f9fa7fff640 20 client.14548 collect_and_send_global_metrics 2024-02-19T19:17:36.799+0000 7f9fa7fff640 1 -- 192.168.0.1:0/854663557 --> [v2:172.21.15.60:6826/3594652577,v1:172.21.15.60:6827/3594652577] -- client_metrics [client_metric_type: READ_LATENCY latency: 5.996942, avg_latency: 0.000330, sq_sum: 86627012816408144, count=17901][client_metric_type: WRITE_LATENCY latency: 23.710221, avg_latency: 0.000407, sq_sum: 1890169673992666112, count=56281][client_metric_type: METADATA_LATENCY latency: 238.430933, avg_latency: 0.005247, sq_sum: 13600282437617256448, count=45341][client_metric_type: CAP_INFO cap_hits: 831286 cap_misses: 14792 num_caps: 0][client_metric_type: DENTRY_LEASE dlease_hits: 67 dlease_misses: 154700 num_dentries: 0][client_metric_type: OPENED_FILES opened_files: 0 total_inodes: 1][client_metric_type: PINNED_ICAPS pinned_icaps: 1 total_inodes: 1][client_metric_type: OPENED_INODES opened_inodes: 0 total_inodes: 1][client_metric_type: READ_IO_SIZES total_ops: 22272 total_size: 3731108728][client_metric_type: WRITE_IO_SIZES total_ops: 56281 total_size: 4270138133] v1 -- 0x7f9fa000b9e0 con 0x5637e76f0e80 2024-02-19T19:17:36.799+0000 7f9fa7fff640 20 client.14548 trim_cache size 1 max 16384 2024-02-19T19:17:36.799+0000 7f9fa7fff640 20 client.14548 upkeep thread waiting interval 1.000000000s ... 2024-02-19T20:23:30.865+0000 7f9fc8e36480 2 client.14548 unmounting
From: /teuthology/pdonnell-2024-02-19_18:28:45-fs:upgrade:mds_upgrade_sequence-wip-batrick-testing-20240215.160715-distro-default-smithi/7566635/remote/smithi060/log/ceph-client.0.log.gz
During teardown of the cluster the unmount eventual proceeds but it's not clear what was blocking it. I think something was holding the RWRef preventing unmount from proceeding.
Files
gdb.txt (13 KB) gdb.txt | backtrace of all threads in staggered upgrade scenario | Venky Shankar, 05/20/2024 11:29 AM |
Updated by Patrick Donnelly 3 months ago
- Related to Bug #64503: client: log message when unmount call is received added
Updated by Patrick Donnelly 3 months ago
- Assignee deleted (
Patrick Donnelly) - Target version set to v19.1.0
- Source set to Development
- Backport set to squid,reef,quincy
Updated by Patrick Donnelly 3 months ago
- Related to Bug #64440: mds: reversed encoding of MDSMap max_xattr_size/bal_rank_mask v18.2.1 <-> main added
Updated by Patrick Donnelly 3 months ago
Unassigning myself to return to other high priority tasks.
This issue is only revealed by the fix for i64440 which turns on ceph-fuse mounts for fs:upgrade:mds_upgrade_sequence.
Updated by Patrick Donnelly 3 months ago
- Related to deleted (Bug #64440: mds: reversed encoding of MDSMap max_xattr_size/bal_rank_mask v18.2.1 <-> main)
Updated by Patrick Donnelly 3 months ago
- Related to Bug #64440: mds: reversed encoding of MDSMap max_xattr_size/bal_rank_mask v18.2.1 <-> main added
Updated by Patrick Donnelly 3 months ago
Also affects pacific and reef v18.2.0 (possibly v18.2.1 too):
See fs:upgrade:mds_upgrade_sequence tests failing with:
"reached maximum tries (51) after waiting for 300 seconds "
Updated by Patrick Donnelly 3 months ago
- Subject changed from client: quincy ceph-fuse fails to unmount after upgrade to main to pacific/quincy/v18.2.0: client: ceph-fuse fails to unmount after upgrade to main
Updated by Venky Shankar 3 months ago
- Category set to Correctness/Safety
- Assignee set to Venky Shankar
- Target version changed from v19.1.0 to v20.0.0
Updated by Venky Shankar 3 months ago
I am working on this. The plan is to gdb the ceph-fuse process after fusermount. Will update by EOD today.
Updated by Venky Shankar 3 months ago
oh well, infra issues now :/
2024-02-26T07:01:58.420 INFO:teuthology.orchestra.run.smithi057.stderr:/bin/podman: stderr Error: initializing source docker://quay.io/ceph/ceph:v18.2.1: Requesting bearer token: invalid status code from registry 502 (Bad Gateway)
Updated by Venky Shankar 3 months ago
Trying my luck with this today - hopefully no infra issues show up.
Updated by Venky Shankar 3 months ago
Started running into
ceph: stderr Error: OCI runtime error: crun: bpf create ``: Invalid argument
again in
/a/vshankar-2024-02-28_06:39:32-fs:upgrade-main-testing-default-smithi/7575094
Checked with Ilya and looks like the run just picked up the kernel bits when it was getting redeployed. Will have to run the test again when the latest build trigger finishes.
Updated by Venky Shankar 3 months ago
Venky Shankar wrote:
Started running into
ceph: stderr Error: OCI runtime error: crun: bpf create ``: Invalid argument
again in
/a/vshankar-2024-02-28_06:39:32-fs:upgrade-main-testing-default-smithi/7575094
Checked with Ilya and looks like the run just picked up the kernel bits when it was getting redeployed. Will have to run the test again when the latest build trigger finishes.
The kclient related issue is fixed now. Tried a sample run yesterday
32/32 jobs failed. Onto this today!
Updated by Venky Shankar 3 months ago
Finally with the infra and the kclient issues set aside, I was able to gdb the ceph-fuse process, add a breakpoint in Client::_unmount(). To my surprise, the function was never invoked when fusermount
was done (the gdb setup had set follow-fork-mode child
and I did try this locally before doing it in the smithi machines). This is weird. I'll see what further I can come up with.
Updated by Venky Shankar 3 months ago
Another debug session with set detach-on-fork on
which is supposed to let gdb debug both parent and child processes. Had a breakpoint in Client::_unmount
[New LWP 82130] [New LWP 82141] [New LWP 82143] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00007fada1e9c39a in __futex_abstimed_wait_common () from /lib64/libc.so.6 Missing separate debuginfos, use: dnf debuginfo-install ceph-fuse-17.2.7-560.g931b9f17.el9.x86_64 (gdb) set detach-on-fork on (gdb) b Client::_unmount Breakpoint 1 at 0x563ac1a8fa00 (gdb) c Continuing.
fusermount
was invoked at timestamp
2024-03-04T16:50:24.033 INFO:tasks.cephfs.fuse_mount:Running fusermount -u on ubuntu@smithi003.front.sepia.ceph.com... 2024-03-04T16:50:24.033 DEBUG:teuthology.orchestra.run.smithi003:> sudo fusermount -u /home/ubuntu/cephtest/mnt.0
However, the breakpoint was triggered when during cluster teardown. I noted the timestamp at that point.
Teardown:
2024-03-04T16:55:23.199 ERROR:tasks.cephfs.fuse_mount:process failed to terminate after unmount. This probably indicates a bug within ceph-fuse. 2024-03-04T16:55:23.199 ERROR:teuthology.run_tasks:Manager failed: ceph-fuse Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_teuthology_b1dac5519c57c269ea98a22fb7729016a1d0e4d2/teuthology/run_tasks.py", line 154, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__ next(self.gen) File "/home/teuthworker/src/github.com_vshankar_ceph_9822b9f8f96d88d1e7441f32961ead3c521c675e/qa/tasks/ceph_fuse.py", line 183, in task mount.umount_wait() File "/home/teuthworker/src/github.com_vshankar_ceph_9822b9f8f96d88d1e7441f32961ead3c521c675e/qa/tasks/cephfs/fuse_mount.py", line 403, in umount_wait run.wait([self.fuse_daemon], timeout) File "/home/teuthworker/src/git.ceph.com_teuthology_b1dac5519c57c269ea98a22fb7729016a1d0e4d2/teuthology/orchestra/run.py", line 473, in wait check_time() File "/home/teuthworker/src/git.ceph.com_teuthology_b1dac5519c57c269ea98a22fb7729016a1d0e4d2/teuthology/contextutil.py", line 134, in __call__ raise MaxWhileTries(error_msg) teuthology.exceptions.MaxWhileTries: reached maximum tries (51) after waiting for 300 seconds
gdb breakpoint:
[Detaching after vfork from child process 128806] [Detaching after vfork from child process 128807] [Detaching after vfork from child process 129019] [Detaching after vfork from child process 129043] [Thread 0x7fad81600640 (LWP 82141) exited] [Thread 0x7fad6d600640 (LWP 82143) exited] [Thread 0x7fad63400640 (LWP 82130) exited] [Thread 0x7fad79600640 (LWP 82083) exited] [Thread 0x7fad7aa00640 (LWP 82082) exited] [Thread 0x7fad6ea00640 (LWP 82081) exited] [Thread 0x7fad6e000640 (LWP 82079) exited] [Thread 0x7fad78c00640 (LWP 82078) exited] [Thread 0x7fad63e00640 (LWP 82025) exited] [Thread 0x7fad6cc00640 (LWP 82014) exited] Thread 1 "ceph-fuse" hit Breakpoint 1, 0x0000563ac1a8fa00 in Client::_unmount(bool) () (gdb) n Single stepping until exit from function _ZN6Client8_unmountEb, which has no line number information. [Thread 0x7fad82a00640 (LWP 76392) exited] [Thread 0x7fad8b400640 (LWP 76385) exited] 0x0000563ac1a2fbc3 in main () (gdb) bt #0 0x0000563ac1a2fbc3 in main () (gdb) q A debugging session is active. Inferior 1 [process 76351] will be detached. Quit anyway? (y or n) y Detaching from program: /usr/bin/ceph-fuse, process 76351 [Inferior 1 (process 76351) detached] [vshankar@smithi003 ~]$ date Mon Mar 4 04:57:52 PM UTC 2024
Updated by Venky Shankar 3 months ago
Continuing on this today - fusermount(1) is basically invoking umount2(2). Will try to see what's going on.
Updated by Venky Shankar 3 months ago
This is not as bad as it looks. The ceph-fuse process seems to be exiting as usual - its somewhere in the qa world where things are timing out.
Updated by Venky Shankar 3 months ago
Venky Shankar wrote:
This is not as bad as it looks. The ceph-fuse process seems to be exiting as usual - its somewhere in the qa world where things are timing out.
I take that back. None of the MDSs are active after upgrade after upgrade to squid (down:dne).
Updated by Venky Shankar 3 months ago
Venky Shankar wrote:
Venky Shankar wrote:
This is not as bad as it looks. The ceph-fuse process seems to be exiting as usual - its somewhere in the qa world where things are timing out.
I take that back. None of the MDSs are active after upgrade after upgrade to squid (down:dne).
Scrap that. The test branch had a change[*] that caused the MDS to assert. But, I do have some findings to share. I introduce a custom qa change to skip the fusermount and add a long enough sleep so as to debug the fuse mount.
- log.info('Unmounting ceph-fuse clients...') - - for info in mounted_by_me.values(): - # Conditional because an inner context might have umounted it - mount = info["mount"] - if mount.is_mounted(): - mount.umount_wait() - for remote in remotes: - FuseMount.cleanup_stale_netnses_and_bridge(remote) + log.info('Not unmounting ceph-fuse clients (hack)...') + time.sleep(3600) + + #for info in mounted_by_me.values(): + # # Conditional because an inner context might have umounted it + # mount = info["mount"] + # if mount.is_mounted(): + # mount.umount_wait() + #for remote in remotes: + # FuseMount.cleanup_stale_netnses_and_bridge(remote)
Sample run here with main branch: https://pulpito.ceph.com/vshankar-2024-03-11_05:30:10-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/7590762/ which failed, but more importantly I was able to ssh to the node which had the fuse mount.
[vshankar@smithi132 ~]$ findmnt ... ... └─/home/ubuntu/cephtest/mnt.0 ceph-fuse fuse.ceph-fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other [vshankar@smithi132 ~]$ sudo fusermount -u /home/ubuntu/cephtest/mnt.0 [vshankar@smithi132 ~]$ findmnt <ceph-fuse mount gone>
... and likewise with the other ceph-fuse mount on smithi187. Client logs also confirm the same.
Updated by Patrick Donnelly 3 months ago
Venky Shankar wrote:
Venky Shankar wrote:
Venky Shankar wrote:
This is not as bad as it looks. The ceph-fuse process seems to be exiting as usual - its somewhere in the qa world where things are timing out.
I take that back. None of the MDSs are active after upgrade after upgrade to squid (down:dne).
Scrap that. The test branch had a change[*] that caused the MDS to assert. But, I do have some findings to share. I introduce a custom qa change to skip the fusermount and add a long enough sleep so as to debug the fuse mount.
[...]
Sample run here with main branch: https://pulpito.ceph.com/vshankar-2024-03-11_05:30:10-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/7590762/ which failed, but more importantly I was able to ssh to the node which had the fuse mount.
[...]
... and likewise with the other ceph-fuse mount on smithi187. Client logs also confirm the same.
This is the worst kind of bug. I don't have a great suggestion beyond stating that the obvious yet highly improbable explanation is that the QA suite executing `fusermount -u ...` is clearly not actually running the command. It doesn't make sense but something odd is going on. Perhaps it's a user permission (fusermount uid != ceph-fuse uid) issue.
Updated by Venky Shankar 3 months ago
Patrick Donnelly wrote:
Venky Shankar wrote:
Venky Shankar wrote:
Venky Shankar wrote:
This is not as bad as it looks. The ceph-fuse process seems to be exiting as usual - its somewhere in the qa world where things are timing out.
I take that back. None of the MDSs are active after upgrade after upgrade to squid (down:dne).
Scrap that. The test branch had a change[*] that caused the MDS to assert. But, I do have some findings to share. I introduce a custom qa change to skip the fusermount and add a long enough sleep so as to debug the fuse mount.
[...]
Sample run here with main branch: https://pulpito.ceph.com/vshankar-2024-03-11_05:30:10-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/7590762/ which failed, but more importantly I was able to ssh to the node which had the fuse mount.
[...]
... and likewise with the other ceph-fuse mount on smithi187. Client logs also confirm the same.
This is the worst kind of bug. I don't have a great suggestion beyond stating that the obvious yet highly improbable explanation is that the QA suite executing `fusermount -u ...` is clearly not actually running the command. It doesn't make sense but something odd is going on. Perhaps it's a user permission (fusermount uid != ceph-fuse uid) issue.
I strace'd the fusermount call and the process exited normally (exitcode = 0) and the umount2(2) call returned 0 (success).
Updated by Milind Changire 2 months ago
Updated by Venky Shankar 2 months ago
This issue is likely related to libfuse library in centos.9. To confirm this, I'll run the tests with centos.8 to verify. Will report in a day or two.
Updated by Venky Shankar 2 months ago
centos8.stream run w/ main branch: https://pulpito.ceph.com/vshankar-2024-04-01_04:53:22-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/
Updated by Venky Shankar 2 months ago
- Related to Bug #61265: qa: tasks.cephfs.fuse_mount:process failed to terminate after unmount added
Updated by Venky Shankar about 2 months ago
Venky Shankar wrote:
centos8.stream run w/ main branch: https://pulpito.ceph.com/vshankar-2024-04-01_04:53:22-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/
This run has cephadm upgrade issues.
Updated by Laura Flores about 2 months ago
/a/yuriw-2024-04-09_14:58:21-upgrade-wip-yuri4-testing-2024-04-08-1432-distro-default-smithi/7648857
Updated by Venky Shankar about 2 months ago · Edited
I have a custom quincy branch (patched with debug in ceph-fuse/fuse_ll). That should give us enough debug to see what's happening.
EDIT: https://shaman.ceph.com/builds/ceph/wip-64502-quincy-2/
Updated by Venky Shankar about 2 months ago
diff --git a/src/ceph_fuse.cc b/src/ceph_fuse.cc index 3fa5346b463..061d9f576e4 100644 --- a/src/ceph_fuse.cc +++ b/src/ceph_fuse.cc @@ -331,14 +331,19 @@ int main(int argc, const char **argv, const char *envp[]) { cerr << "ceph-fuse[" << getpid() << "]: starting fuse" << std::endl; tester.init(cfuse, client); tester.create("tester"); + cout << "ceph_fuse: starting fuse loop" << std::endl; r = cfuse->loop(); + cout << "ceph_fuse: fuse loop returned - joining" << std::endl; tester.join(&tester_rp); + cout << "ceph_fuse: fuse loop returned - joined" << std::endl; tester_r = static_cast<int>(reinterpret_cast<uint64_t>(tester_rp)); cerr << "ceph-fuse[" << getpid() << "]: fuse finished with error " << r << " and tester_r " << tester_r <<std::endl; out_client_unmount: + cout << "calling unmount" << std::endl; client->unmount(); + cout << "calling finalize" << std::endl; cfuse->finalize(); out_shutdown: icp.stop(); diff --git a/src/client/fuse_ll.cc b/src/client/fuse_ll.cc index 109aa25094d..765fdd7ffc7 100644 --- a/src/client/fuse_ll.cc +++ b/src/client/fuse_ll.cc @@ -1612,21 +1612,31 @@ int CephFuse::Handle::loop() auto fuse_multithreaded = client->cct->_conf.get_val<bool>( "fuse_multithreaded"); if (fuse_multithreaded) { + derr << "multithreaded fuse" << dendl; #if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 1) { struct fuse_loop_config conf = { clone_fd: opts.clone_fd, max_idle_threads: opts.max_idle_threads }; - return fuse_session_loop_mt(se, &conf); + int r = fuse_session_loop_mt(se, &conf); + derr << "fuse_session_loop_mt 1 -- done" << dendl; + return r; } #elif FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0) - return fuse_session_loop_mt(se, opts.clone_fd); + int r = fuse_session_loop_mt(se, opts.clone_fd); + derr << "fuse_session_loop_mt 2 -- done" << dendl; + return r; #else - return fuse_session_loop_mt(se); + int r = fuse_session_loop_mt(se); + derr << "fuse_session_loop_mt 3 -- done" << dendl; + return r; #endif } else { - return fuse_session_loop(se); + derr << "singlethreaded fuse" << dendl; + int r = fuse_session_loop(se); + derr << "fuse_session_loop_mt 4 -- done" << dendl; + return r; } }
And patched up the yaml to use the custom quincy build to upgrade to. Here is the run: https://pulpito.ceph.com/vshankar-2024-04-12_13:03:29-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/7653889/
Interestingly
vshankar@teuthology:/a/vshankar-2024-04-12_13:03:29-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/7653889$ egrep "ceph_fuse\:|fusermount|unmount|finalize|fuse_session_loop_mt" teuthology.log 2024-04-12T13:38:54.194 INFO:tasks.ceph_fuse:Running ceph_fuse task... 2024-04-12T13:38:54.195 INFO:tasks.ceph_fuse:config is {'client.0': {}, 'client.1': {}} 2024-04-12T13:38:54.195 INFO:tasks.ceph_fuse:client.0 config is {} 2024-04-12T13:38:54.195 INFO:tasks.ceph_fuse:client.1 config is {} 2024-04-12T13:38:54.522 INFO:tasks.ceph_fuse:Mounting ceph-fuse clients... 2024-04-12T13:39:01.752 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi060.stdout:ceph_fuse: starting fuse loop 2024-04-12T13:39:10.578 INFO:tasks.cephfs.fuse_mount.ceph-fuse.1.smithi110.stdout:ceph_fuse: starting fuse loop 2024-04-12T13:56:40.623 INFO:tasks.ceph_fuse:Unmounting ceph-fuse clients... 2024-04-12T13:56:45.440 INFO:tasks.cephfs.fuse_mount:Running fusermount -u on ubuntu@smithi060.front.sepia.ceph.com... 2024-04-12T13:56:45.440 DEBUG:teuthology.orchestra.run.smithi060:> sudo fusermount -u /home/ubuntu/cephtest/mnt.0 2024-04-12T14:01:44.607 ERROR:tasks.cephfs.fuse_mount:process failed to terminate after unmount. This probably indicates a bug within ceph-fuse. 2024-04-12T14:02:51.770 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi060.stdout:ceph_fuse: fuse loop returned - joining 2024-04-12T14:02:51.770 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi060.stdout:ceph_fuse: fuse loop returned - joined 2024-04-12T14:02:51.771 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi060.stderr:2024-04-12T14:02:51.768+0000 7fd910abf300 -1 fuse_session_loop_mt 3 -- done 2024-04-12T14:02:51.771 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi060.stdout:calling unmount 2024-04-12T14:03:21.770 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi060.stdout:calling finalize vshankar@teuthology:/a/vshankar-2024-04-12_13:03:29-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/7653889$
fusermount
had no effect at all on the ceph-fuse daemon. One would expect fuse_session_loop_mt()
to unblock and fuse_ll.cc::loop() to return. It's only when the cluster teardown was under execution, the relevant debug logs are seen.
I'm going to run this again with fuse_debug
client option enabled - that adds a -d
argument to the daemon and that should give us debugging logs from libfuse (at least the config option help says so).
Updated by Patrick Donnelly about 2 months ago
Venky Shankar wrote in #note-31:
[...]
And patched up the yaml to use the custom quincy build to upgrade to. Here is the run: https://pulpito.ceph.com/vshankar-2024-04-12_13:03:29-fs:upgrade:mds_upgrade_sequence-main-testing-default-smithi/7653889/
Interestingly
[...]
fusermount
had no effect at all on the ceph-fuse daemon. One would expectfuse_session_loop_mt()
to unblock and fuse_ll.cc::loop() to return. It's only when the cluster teardown was under execution, the relevant debug logs are seen.I'm going to run this again with
fuse_debug
client option enabled - that adds a-d
argument to the daemon and that should give us debugging logs from libfuse (at least the config option help says so).
Another avenue to pursue is to gdb attach the ceph-fuse process and see what the fuse dispatcher threads are doing. It's possible there is a bug in the Client preventing one or more of the threads from causing the fuse loop function from returning; I suspect libfuse wants to join on all those dispatcher threads before it returns from its main loop. I looked at the client log but didn't see anything which would indicate why cluster teardown would prompt the libfuse loop to finally terminate. There are a lot of connection refused messages for reconnects to the mons. That's about it.
Updated by Venky Shankar about 2 months ago
OK. So this bug has upgrades written all over it - it seemed obvious given that this is an upgrade task but we were trying to see what's going on with fuse. With some test iterations, it's now clear that the cluster upgrade is somehow causing this blowup. I patched up mds_upgrade_sequence
to not upgrade and the umount hang isn't seen. So, reconnects are involved and I think this has to do more than just the MDS (read, monitors).
Updated by Venky Shankar about 2 months ago
Venky Shankar wrote in #note-33:
OK. So this bug has upgrades written all over it - it seemed obvious given that this is an upgrade task but we were trying to see what's going on with fuse. With some test iterations, it's now clear that the cluster upgrade is somehow causing this blowup. I patched up
mds_upgrade_sequence
to not upgrade and the umount hang isn't seen. So, reconnects are involved and I think this has to do more than just the MDS (read, monitors).
Oh, one more thing - in a different test iteration I patched up the upgrade yaml to not run the workload-task (which runs fsstress) but let the cluster upgrade. So, a simple cluster upgrade without any workload is enough to trigger this umount hang.
Updated by Venky Shankar about 1 month ago
- Related to Bug #50260: pacific: qa: "rmdir: failed to remove '/home/ubuntu/cephtest': Directory not empty" added
Updated by Patrick Donnelly about 1 month ago
It sounds more and more to me like there is some kind of request the client continues to wait on that is blocking the fuse loop. That gets blocked by the MDS upgrade and no retry is performed. The cluster teardown must kick things in a way for the request to finally stop?
Updated by Venky Shankar about 1 month ago
Patrick Donnelly wrote in #note-36:
It sounds more and more to me like there is some kind of request the client continues to wait on that is blocking the fuse loop. That gets blocked by the MDS upgrade and no retry is performed. The cluster teardown must kick things in a way for the request to finally stop?
That should be relatively easy to figure out with the reduced log noise without the workload-task and just upgrading the MDS with cephadm orch upgrade start ... --daemon-types mds
(just that I'm unsuccessful in getting a test run done in teuthology)
Updated by Venky Shankar about 1 month ago
Venky Shankar wrote in #note-37:
Patrick Donnelly wrote in #note-36:
It sounds more and more to me like there is some kind of request the client continues to wait on that is blocking the fuse loop. That gets blocked by the MDS upgrade and no retry is performed. The cluster teardown must kick things in a way for the request to finally stop?
That should be relatively easy to figure out with the reduced log noise without the workload-task and just upgrading the MDS with
cephadm orch upgrade start ... --daemon-types mds
(just that I'm unsuccessful in getting a test run done in teuthology)
Scrap that. This doesn't work - cephadm enforces an order of upgrade
2024-04-30T10:08:54.784 INFO:teuthology.orchestra.run.smithi037.stderr:NOTE: Enforced upgrade order is: mgr -> mon -> crash -> osd -> mds -> rgw -> rbd-mirror -> cephfs-mirror -> iscsi -> nfs
Anyway, I'll read the code carefully to see what's going on.
Updated by Venky Shankar about 1 month ago
Patrick Donnelly wrote in #note-36:
It sounds more and more to me like there is some kind of request the client continues to wait on that is blocking the fuse loop. That gets blocked by the MDS upgrade and no retry is performed. The cluster teardown must kick things in a way for the request to finally stop?
The only messages that the client is exchanging with the MDS around when the MDS upgrade is initiated are the metrics message:
2024-04-30T11:23:07.583+0000 7f5cc3400640 20 client.14522 tick 2024-04-30T11:23:07.583+0000 7f5cc3400640 20 client.14522 collect_and_send_metrics 2024-04-30T11:23:07.583+0000 7f5cc3400640 20 client.14522 collect_and_send_global_metrics 2024-04-30T11:23:07.583+0000 7f5cc3400640 1 -- 192.168.0.1:0/653481002 --> [v2:172.21.15.37:6826/2024972561,v1:172.21.15.37:6827/2024972561] -- client_metrics [client_metric_type: READ_LATENCY latency: 0.000000, avg_latency: 0.000000, sq_sum: 0, count=0][client_metric_type: WRITE_LATENCY latency: 0.000000, avg_latency: 0.000000, sq_sum: 0, count=0][client_metric_type: METADATA_LATENCY latency: 0.004497, avg_latency: 0.001499, sq_sum: 1961013495640, count=3][client_metric_type: CAP_INFO cap_hits: 348 cap_misses: 3 num_caps: 0][client_metric_type: DENTRY_LEASE dlease_hits: 0 dlease_misses: 0 num_dentries: 0][client_metric_type: OPENED_FILES opened_files: 0 total_inodes: 1][client_metric_type: PINNED_ICAPS pinned_icaps: 1 total_inodes: 1][client_metric_type: OPENED_INODES opened_inodes: 0 total_inodes: 1][client_metric_type: READ_IO_SIZES total_ops: 0 total_size: 0][client_metric_type: WRITE_IO_SIZES total_ops: 0 total_size: 0] v1 -- 0x7f5cb4004500 con 0x7f5cbc01da30 2024-04-30T11:23:07.583+0000 7f5cc3400640 20 client.14522 trim_cache size 0 max 16384 2024-04-30T11:23:07.583+0000 7f5cc3400640 20 client.14522 upkeep thread waiting interval 1.000000000s
I won't be surprised if this bug is a foul-up of that. That's my debug point for now.
Updated by Venky Shankar about 1 month ago
It's not the metrics stuff but a bug when the MDS sends back a client_session(open) during client reconnect (post mds upgrade). There is code that signals mount_cond
and that's not executed in this case. I'm testing out a a possible fix.
Updated by Patrick Donnelly about 1 month ago
Venky Shankar wrote in #note-40:
It's not the metrics stuff but a bug when the MDS sends back a client_session(open) during client reconnect (post mds upgrade). There is code that signals
mount_cond
and that's not executed in this case. I'm testing out a a possible fix.
Excellent, looking forward to figuring this out =)
Updated by Venky Shankar 17 days ago
Patrick Donnelly wrote in #note-41:
Venky Shankar wrote in #note-40:
It's not the metrics stuff but a bug when the MDS sends back a client_session(open) during client reconnect (post mds upgrade). There is code that signals
mount_cond
and that's not executed in this case. I'm testing out a a possible fix.Excellent, looking forward to figuring this out =)
Back on this soon. Teuthology acted up when I planned to test the fix.
Updated by Venky Shankar 13 days ago · Edited
This is a really weird bug and is reproducible even with v16.2.0, however, a slightly more weirdness is that umount unresponsiveness isn't seen with fs:upgrade:featureful_client:old_client
. I'll explain: fs:upgrade:featureful_client:old_client
is somewhat similar to fs:upgrade:mds_upgrade_sequence
in the sense that the MDSs are upgraded leaving out the user-space ceph-fuse client (old_client
additionally checks if a client is evicted by the MDS when a certain client feature is made mandatory). So, for testing, I patched up old_client
(reef branch) to start from pacific (v16.2.15) and then start the upgrade (without any workload) to reef, excluding out the bit that makes a certain client feature mandatory:
diff --git a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/0-octopus.yaml b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/0-pacific.yaml similarity index 82% rename from qa/suites/fs/upgrade/featureful_client/old_client/tasks/0-octopus.yaml rename to qa/suites/fs/upgrade/featureful_client/old_client/tasks/0-pacific.yaml index e7774423fa5..b74accc69f0 100644 --- a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/0-octopus.yaml +++ b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/0-pacific.yaml @@ -1,9 +1,9 @@ meta: - desc: | - install ceph/octopus latest + install ceph/pacific latest tasks: - install: - branch: octopus + branch: pacific exclude_packages: - librados3 - ceph-mgr-dashboard @@ -13,7 +13,7 @@ tasks: - cephadm - ceph-volume extra_packages: ['librados2'] -- print: "**** done installing octopus" +- print: "**** done installing pacific" - ceph: log-ignorelist: - overall HEALTH_ @@ -35,5 +35,5 @@ tasks: ms bind msgr2: false - exec: osd.0: - - ceph osd set-require-min-compat-client octopus + - ceph osd set-require-min-compat-client pacific - print: "**** done ceph" diff --git a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/1-client.yaml b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/1-client.yaml index 976d6e2650b..55fd27c50d0 100644 --- a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/1-client.yaml +++ b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/1-client.yaml @@ -1,8 +1,3 @@ tasks: - ceph-fuse: -- print: "**** done octopus client" -- workunit: - clients: - all: - - suites/fsstress.sh -- print: "**** done fsstress" +- print: "**** done pacific client" diff --git a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/2-upgrade.yaml b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/2-upgrade.yaml index 26c185946dd..00e82d87151 100644 --- a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/2-upgrade.yaml +++ b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/2-upgrade.yaml @@ -22,7 +22,7 @@ tasks: - install.upgrade: # upgrade the single cluster node, which is running all the mon/mds/osd/mgr daemons mon.a: - branch: quincy + branch: reef - print: "**** done install.upgrade the host" - ceph.restart: daemons: [mon.*, mgr.*] @@ -42,7 +42,7 @@ tasks: mon.a: - ceph osd dump -f json-pretty - ceph versions - - ceph osd require-osd-release quincy + - ceph osd require-osd-release reef - for f in `ceph osd pool ls` ; do ceph osd pool set $f pg_autoscale_mode off ; done - ceph.healthy: - print: "**** done ceph.restart" diff --git a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/.qa b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/.qa deleted file mode 120000 index a602a0353e7..00000000000 --- a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/.qa +++ /dev/null @@ -1 +0,0 @@ -../.qa/ \ No newline at end of file diff --git a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/no.yaml b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/no.yaml deleted file mode 100644 index b495eb41b54..00000000000 --- a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/no.yaml +++ /dev/null @@ -1,6 +0,0 @@ -tasks: -- workunit: - clients: - all: - - suites/fsstress.sh -- print: "**** done fsstress" diff --git a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/quincy.yaml b/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/quincy.yaml deleted file mode 100644 index 138d8f4e290..00000000000 --- a/qa/suites/fs/upgrade/featureful_client/old_client/tasks/3-compat_client/quincy.yaml +++ /dev/null @@ -1,12 +0,0 @@ -overrides: - ceph: - log-ignorelist: - - missing required features -tasks: -- exec: - mon.a: - - ceph fs dump --format=json-pretty - - ceph fs required_client_features cephfs add metric_collect -- sleep: - duration: 5 -- fs.clients_evicted:
The umount unresponsiveness isn't seen and the test runs successfully. The difference between this and mds_upgrade_sequence
is the way upgrade is performed. old_client
uses install.upgrade
task to update packages (since it basically tests upgrading from octopus for which cephadm
isn't supported) for mon/mgr/osd/mds (qa/suites/fs/upgrade/featureful_client/old_client/tasks/2-upgrade.yaml
) and then restarts the daemons while mds_upgrade_sequence
performs a staggered upgrade of ceph daemons.
The cephfs client in the staggered upgrade scenario has multiple client_reconnect
messages - one for each reconnect when a MDS is failed over (on upgrade):
2024-05-17T17:59:47.189+0000 7f0e4ea00700 1 -- 192.168.0.1:0/2023218024 --> [v2:172.21.15.163:6828/2124582105,v1:172.21.15.163:6829/2124582105] -- client_reconnect(1 caps 1 realms ) v5 -- 0x7f0e50012750 con 0x5584ab952340 2024-05-17T18:00:10.755+0000 7f0e4ea00700 1 -- 192.168.0.1:0/2023218024 --> [v2:172.21.15.188:6824/2400944679,v1:172.21.15.188:6825/2400944679] -- client_reconnect(1 caps 1 realms ) v5 -- 0x7f0e50012750 con 0x7f0e440117f0 2024-05-17T18:00:32.041+0000 7f0e4ea00700 1 -- 192.168.0.1:0/2023218024 --> [v2:172.21.15.188:6826/2689927205,v1:172.21.15.188:6827/2689927205] -- client_reconnect(1 caps 1 realms ) v5 -- 0x7f0e50012750 con 0x7f0e440142e0 2024-05-17T18:00:54.079+0000 7f0e4ea00700 1 -- 192.168.0.1:0/2023218024 --> [v2:172.21.15.163:6826/3987377094,v1:172.21.15.163:6827/3987377094] -- client_reconnect(1 caps 1 realms ) v5 -- 0x7f0e50012750 con 0x7f0e44010fe0
While the other case (old_client
) there is just a single reconnect (since the daemons are restarted after installing new packages):
2024-05-18T18:21:40.110+0000 7fe54cc00700 1 -- 172.21.15.64:0/902497327 --> v1:172.21.15.39:6819/529060055 -- client_reconnect(1 caps 1 realms ) v5 -- 0x7fe55000a920 con 0x55f38a96d130
I was able to gdb into the cep-fuse process in the staggered upgrade scenario (normally, the backtraces wern't helpful since the frames weren't descriptive, esp. when examining frames in pthread library), but this time I captures the frames (attached log). The interesting part here is:
Thread 25 (Thread 0x7fb72e000700 (LWP 98680)): #0 0x00007fb75592f0aa in splice () from /lib64/libc.so.6 #1 0x00007fb762215a05 in fuse_ll_receive_buf () from /lib64/libfuse.so.2 #2 0x00007fb7622143e3 in fuse_do_work () from /lib64/libfuse.so.2 #3 0x00007fb7578081ca in start_thread () from /lib64/libpthread.so.0 #4 0x00007fb755839e73 in clone () from /lib64/libc.so.6 Thread 24 (Thread 0x7fb72ea00700 (LWP 98678)): #0 0x00007fb75592f0aa in splice () from /lib64/libc.so.6 #1 0x00007fb762215a05 in fuse_ll_receive_buf () from /lib64/libfuse.so.2 #2 0x00007fb7622143e3 in fuse_do_work () from /lib64/libfuse.so.2 #3 0x00007fb7578081ca in start_thread () from /lib64/libpthread.so.0 #4 0x00007fb755839e73 in clone () from /lib64/libc.so.6
and obviously the main fuse loop:
Thread 1 (Thread 0x7fb7629333c0 (LWP 98627)): #0 0x00007fb757810da6 in do_futex_wait.constprop () from /lib64/libpthread.so.0 #1 0x00007fb757810e98 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0 #2 0x00007fb762214698 in fuse_session_loop_mt () from /lib64/libfuse.so.2 #3 0x0000557df369b2ae in main ()
It does look like a couple of fuse threads are blocked on request completion. Nothing really suspicious in the client log apart from one small bit:
In the old_client
case, just before reconnect, there is a _ll_getattr
request that does not find the required caps to satisfy the request and the client tries to contact the MDS (_getattr mask pAsLsXsFs issued=0
):
2024-05-18T18:21:39.098+0000 7fe536000700 8 client.4430 _ll_getattr 0x1.head 2024-05-18T18:21:39.098+0000 7fe536000700 10 client.4430 _getattr mask pAsLsXsFs issued=0 2024-05-18T18:21:39.098+0000 7fe536000700 20 client.4430 choose_target_mds starting with req->inode 0x1.head(faked_ino=0 nref=5 ll_ref=7 cap_refs={} open={} mode=41777 size=0/0 nlink=1 btime=2024-05-18T18:18:51.843703+0000 mtime=2024-05-18T18:18:51.843703+0000 ctime=202 4-05-18T18:19:15.738763+0000 change_attr=1 caps=pLsXsFs(0=pLsXsFs) COMPLETE has_dir_layout 0x7fe538007f30) 2024-05-18T18:21:39.098+0000 7fe536000700 20 client.4430 choose_target_mds 0x1.head(faked_ino=0 nref=5 ll_ref=7 cap_refs={} open={} mode=41777 size=0/0 nlink=1 btime=2024-05-18T18:18:51.843703+0000 mtime=2024-05-18T18:18:51.843703+0000 ctime=2024-05-18T18:19:15.738763+0 000 change_attr=1 caps=pLsXsFs(0=pLsXsFs) COMPLETE has_dir_layout 0x7fe538007f30) is_hash=0 hash=0 2024-05-18T18:21:39.098+0000 7fe536000700 10 client.4430 choose_target_mds from caps on inode 0x1.head(faked_ino=0 nref=5 ll_ref=7 cap_refs={} open={} mode=41777 size=0/0 nlink=1 btime=2024-05-18T18:18:51.843703+0000 mtime=2024-05-18T18:18:51.843703+0000 ctime=2024-05-1 8T18:19:15.738763+0000 change_attr=1 caps=pLsXsFs(0=pLsXsFs) COMPLETE has_dir_layout 0x7fe538007f30) 2024-05-18T18:21:39.098+0000 7fe536000700 20 client.4430 mds is 0 2024-05-18T18:21:39.098+0000 7fe536000700 10 client.4430 target mds.0 not active, waiting for new mdsmap
When the mds is back up:active, this (waiting for mdsmap) request is kicked. However, in the staggered upgrade scenario, _ll_getattr
is always finds the required caps:
2024-05-17T17:59:46.188+0000 7f0e37e00700 8 client.14540 _ll_getattr 0x1.head 2024-05-17T17:59:46.188+0000 7f0e37e00700 10 client.14540 _getattr mask pAsLsXsFs issued=1 2024-05-17T17:59:46.188+0000 7f0e37e00700 10 client.14540 fill_stat on 0x1 snap/devhead mode 041777 mtime 2024-05-17T17:50:26.136511+0000 ctime 2024-05-17T17:51:12.209018+0000 2024-05-17T17:59:46.188+0000 7f0e37e00700 3 client.14540 ll_getattr 0x1.head = 0 2024-05-17T17:59:46.189+0000 7f0e37e00700 8 client.14540 _ll_forget 0x1 1 2024-05-17T17:59:46.189+0000 7f0e40c00700 8 client.14540 _ll_getattr 0x1.head 2024-05-17T17:59:46.189+0000 7f0e40c00700 10 client.14540 _getattr mask pAsLsXsFs issued=1 2024-05-17T17:59:46.189+0000 7f0e40c00700 10 client.14540 fill_stat on 0x1 snap/devhead mode 041777 mtime 2024-05-17T17:50:26.136511+0000 ctime 2024-05-17T17:51:12.209018+0000 2024-05-17T17:59:46.189+0000 7f0e40c00700 3 client.14540 ll_getattr 0x1.head = 0 2024-05-17T17:59:46.189+0000 7f0e40c00700 8 client.14540 _ll_forget 0x1 1
This sounds irrelevant (since the client can always not have some capabilities), but this (issued=1
) is always true just before client_reconnect
message is initiated and I've seen this is multiple runs.
I'll link two fresh runs (old_client
and mds_upgrade_sequence
) once they finish for reference and just to be sure the above observation is somewhere relevant. We maybe running into a bug where some requests do not get kicked and then the fuse worker threads are blocked on them.
Updated by Venky Shankar 13 days ago
https://tracker.ceph.com/issues/64502#note-43 was updated when I was still updating it. Please re-read the update in the tracker rather than in email.
Updated by Venky Shankar 13 days ago
/a/vshankar-2024-05-20_11:13:32-fs:upgrade:mds_upgrade_sequence-reef-release-testing-default-smithi/7715229 is currently available for 90m for gdb'ing ceph-fuse
ps -eaf | grep ceph-fuse ubuntu 73853 28031 0 11:42 ? 00:00:00 bash -c (cd /home/ubuntu/cephtest && exec sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-fuse -f --admin-socket '/var/run/ceph/$cluster-$name.$pid.asok' /home/ubuntu/cephtest/mnt.0 --id 0) root 73896 73853 0 11:42 ? 00:00:00 sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-fuse -f --admin-socket /var/run/ceph/$cluster-$name.$pid.asok /home/ubuntu/cephtest/mnt.0 --id 0 root 73900 73896 0 11:42 ? 00:00:00 sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-fuse -f --admin-socket /var/run/ceph/$cluster-$name.$pid.asok /home/ubuntu/cephtest/mnt.0 --id 0 root 73912 73900 0 11:42 ? 00:00:00 /usr/bin/python3 /bin/daemon-helper kill ceph-fuse -f --admin-socket /var/run/ceph/$cluster-$name.$pid.asok /home/ubuntu/cephtest/mnt.0 --id 0 root 73929 73912 0 11:42 ? 00:00:01 ceph-fuse -f --admin-socket /var/run/ceph/$cluster-$name.$pid.asok /home/ubuntu/cephtest/mnt.0 --id 0 vshankar 145434 142240 0 12:17 pts/0 00:00:00 grep --color=auto ceph-fuse
The two fuse threads that are probably waiting for reply are blocked on splice()
-- 73979 -- [<0>] fuse_dev_do_read.isra.17.constprop.20+0x942/0x970 [fuse] [<0>] fuse_dev_splice_read+0xb0/0x250 [fuse] [<0>] splice_file_to_pipe+0x78/0x80 [<0>] do_splice+0x5b0/0x890 [<0>] __do_splice+0xca/0x180 [<0>] __x64_sys_splice+0x99/0x110 [<0>] do_syscall_64+0x7f/0x190 [<0>] entry_SYSCALL_64_after_hwframe+0x71/0x79 -- 73980 -- [<0>] fuse_dev_do_read.isra.17.constprop.20+0x942/0x970 [fuse] [<0>] fuse_dev_splice_read+0xb0/0x250 [fuse] [<0>] splice_file_to_pipe+0x78/0x80 [<0>] do_splice+0x5b0/0x890 [<0>] __do_splice+0xca/0x180 [<0>] __x64_sys_splice+0x99/0x110 [<0>] do_syscall_64+0x7f/0x190 [<0>] entry_SYSCALL_64_after_hwframe+0x71/0x79
An ideal case would be to get hands on CephFuse *cfuse
from the main program thread so that any pending/blocked request can be inspected, but that doesn't seen to be available(?). @Patrick Donnelly
Updated by Venky Shankar 13 days ago
/a/vshankar-2024-05-20_10:42:59-fs:upgrade:featureful_client:old_client-reef-release-testing-default-smithi/7715226 is old_client
test suite with the yaml that makes a client feature mandatory removed.
Updated by Leonid Usov 13 days ago
- Related to Bug #66126: Scrub error on inode ... with quiescer added
Updated by Leonid Usov 13 days ago
- Related to deleted (Bug #66126: Scrub error on inode ... with quiescer)
Updated by Venky Shankar 12 days ago
Is it uncommon that there are two entries in /proc/mounts
related to network namespaces?
[vshankar@smithi174 ~]$ grep ceph /proc/mounts /dev/mapper/vg_nvme-lv_5 /var/lib/ceph xfs rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0 nsfs /run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 nsfs rw,seclabel 0 0 nsfs /run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 nsfs rw,seclabel 0 0 nsfs /run/netns/ceph-ns--home-ubuntu-cephtest-mnt.1 nsfs rw,seclabel 0 0 nsfs /run/netns/ceph-ns--home-ubuntu-cephtest-mnt.1 nsfs rw,seclabel 0 0 ceph-fuse /home/ubuntu/cephtest/mnt.1 fuse.ceph-fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
This is after a fusermount
call for mnt.0 - there are two nsfs
entries for mnt.0 network namespace and the ceph-fuse entry for mnt.0 is gone, however, the process is still running:
ubuntu 42612 0.0 0.0 12732 2816 ? Ss 07:22 0:00 bash -c (cd /home/ubuntu/cephtest && exec sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-fuse -f --admin-socket '/var/run/ceph/$cluster-$name.$pid.asok' /home/ubuntu/cephtest/mnt.0 --id 0) root 42655 0.0 0.0 123104 7936 ? S 07:22 0:00 sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-fuse -f --admin-socket /var/run/ceph/$cluster-$name.$pid.asok /home/ubuntu/cephtest/mnt.0 --id 0 root 42659 0.0 0.0 123104 8192 ? S 07:22 0:00 sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-fuse -f --admin-socket /var/run/ceph/$cluster-$name.$pid.asok /home/ubuntu/cephtest/mnt.0 --id 0 root 42684 0.0 0.0 47188 11904 ? S 07:22 0:01 /usr/bin/python3 /bin/daemon-helper kill ceph-fuse -f --admin-socket /var/run/ceph/$cluster-$name.$pid.asok /home/ubuntu/cephtest/mnt.0 --id 0 root 42702 0.0 0.1 1218128 33708 ? Ssl 07:22 0:02 ceph-fuse -f --admin-socket /var/run/ceph/$cluster-$name.$pid.asok /home/ubuntu/cephtest/mnt.0 --id 0
@Xiubo Li Could this be a problem?
Updated by Venky Shankar 12 days ago
This bug has started to look like its less to do with ceph-fuse -- all attempts to spot a bug in the client code haven't lead to any concrete conclusion, esp. https://tracker.ceph.com/issues/64502#note-43 where the blocked threads are basically idle threads in libfuse which are blocked reading from /dev/fuse
.
I think this issue has do with the way the upgrade test is setup which uses cephadm
.
Updated by Xiubo Li 12 days ago
Venky Shankar wrote in #note-49:
Is it uncommon that there are two entries in
/proc/mounts
related to network namespaces?[...]
This is after a
fusermount
call for mnt.0 - there are twonsfs
entries for mnt.0 network namespace and the ceph-fuse entry for mnt.0 is gone, however, the process is still running:[...]
@Xiubo Li Could this be a problem?
As I remembered when I was working on this, two same entries for the same mount will be an issue and could cause the following mount failures.
Anyway there should be something wrong to cleanup the nsfs.
Updated by Venky Shankar 12 days ago
Xiubo Li wrote in #note-51:
Venky Shankar wrote in #note-49:
Is it uncommon that there are two entries in
/proc/mounts
related to network namespaces?[...]
This is after a
fusermount
call for mnt.0 - there are twonsfs
entries for mnt.0 network namespace and the ceph-fuse entry for mnt.0 is gone, however, the process is still running:[...]
@Xiubo Li Could this be a problem?
As I remembered when I was working on this, two same entries for the same mount will be an issue and could cause the following mount failures.
Anyway there should be something wrong to cleanup the nsfs.
Is there anything logged (anywhere) that could hint at possible stale mounts being left behind?
Also, just to strengthen my point about this issue not being a user-space client bug - umount2(2) returns success as seen via strace:
165426 readlink("/home", 0x7ffecaf586d0, 1023) = -1 EINVAL (Invalid argument) 165426 readlink("/home/ubuntu", 0x7ffecaf586d0, 1023) = -1 EINVAL (Invalid argument) 165426 readlink("/home/ubuntu/cephtest", 0x7ffecaf586d0, 1023) = -1 EINVAL (Invalid argument) 165426 chdir("/") = 0 165426 getuid() = 0 165426 umask(033) = 022 165426 geteuid() = 0 165426 newfstatat(AT_FDCWD, "/etc/mtab", {st_mode=S_IFLNK|0777, st_size=19, ...}, AT_SYMLINK_NOFOLLOW) = 0 165426 getuid() = 0 165426 getuid() = 0 165426 chdir("/home/ubuntu/cephtest") = 0 165426 getcwd("/home/ubuntu/cephtest", 65536) = 22 165426 getuid() = 0 165426 umount2("mnt.0", UMOUNT_NOFOLLOW) = 0 165426 chdir("/") = 0 165426 newfstatat(AT_FDCWD, "/etc/mtab", {st_mode=S_IFLNK|0777, st_size=19, ...}, AT_SYMLINK_NOFOLLOW) = 0 165426 exit_group(0) = ? 165426 +++ exited with 0 +++
If there was an operation pending on the client, then umount2(2) would have blocked IMO.
Updated by Patrick Donnelly 11 days ago
Venky Shankar wrote in #note-43:
This is a really weird bug and is reproducible even with v16.2.0, however, a slightly more weirdness is that umount unresponsiveness isn't seen with
fs:upgrade:featureful_client:old_client
. I'll explain:fs:upgrade:featureful_client:old_client
is somewhat similar tofs:upgrade:mds_upgrade_sequence
in the sense that the MDSs are upgraded leaving out the user-space ceph-fuse client (old_client
additionally checks if a client is evicted by the MDS when a certain client feature is made mandatory). So, for testing, I patched upold_client
(reef branch) to start from pacific (v16.2.15) and then start the upgrade (without any workload) to reef, excluding out the bit that makes a certain client feature mandatory:
I think perhaps the major difference here is that the node running the "old client" is not upgraded at all (obviously) so the packages and binaries are untouched. Perhaps this is relating to the binary swapped out? One other thing to note is that in the upgraded_client test case, we remount the client with the new packages. To eliminate that divergence, you should remount the old_client after the mds is upgraded to see if the issue is reproducible.
Updated by Venky Shankar 10 days ago · Edited
Patrick Donnelly wrote in #note-53:
Venky Shankar wrote in #note-43:
This is a really weird bug and is reproducible even with v16.2.0, however, a slightly more weirdness is that umount unresponsiveness isn't seen with
fs:upgrade:featureful_client:old_client
. I'll explain:fs:upgrade:featureful_client:old_client
is somewhat similar tofs:upgrade:mds_upgrade_sequence
in the sense that the MDSs are upgraded leaving out the user-space ceph-fuse client (old_client
additionally checks if a client is evicted by the MDS when a certain client feature is made mandatory). So, for testing, I patched upold_client
(reef branch) to start from pacific (v16.2.15) and then start the upgrade (without any workload) to reef, excluding out the bit that makes a certain client feature mandatory:I think perhaps the major difference here is that the node running the "old client" is not upgraded at all (obviously) so the packages and binaries are untouched. Perhaps this is relating to the binary swapped out?
This is featureful_client:old_client cluster YAML
cat qa/suites/fs/upgrade/featureful_client/old_client/clusters/1-mds-2-client-micro.yaml roles: - [mon.a, mon.b, mon.c, mgr.x, mgr.y, mds.a, mds.b, mds.c, osd.0, osd.1, osd.2, osd.3] - [client.0] - [client.1] openstack: - volumes: # attached to each instance count: 4 size: 10 # GB
It's just one single node hosting everything. I believe this weirdness is somehow related to containerized daemons and network namespaces.
One other thing to note is that in the upgraded_client test case, we remount the client with the new packages. To eliminate that divergence, you should remount the old_client after the mds is upgraded to see if the issue is reproducible.
Yes, that's right for upgraded_client
, but, for featureful_client:old_client
which does not restart clients and therefore is pretty much similar to mds_upgrade_sequence
. I'll give it a try...
Updated by Patrick Donnelly 9 days ago
Venky Shankar wrote in #note-54:
Patrick Donnelly wrote in #note-53:
Venky Shankar wrote in #note-43:
This is a really weird bug and is reproducible even with v16.2.0, however, a slightly more weirdness is that umount unresponsiveness isn't seen with
fs:upgrade:featureful_client:old_client
. I'll explain:fs:upgrade:featureful_client:old_client
is somewhat similar tofs:upgrade:mds_upgrade_sequence
in the sense that the MDSs are upgraded leaving out the user-space ceph-fuse client (old_client
additionally checks if a client is evicted by the MDS when a certain client feature is made mandatory). So, for testing, I patched upold_client
(reef branch) to start from pacific (v16.2.15) and then start the upgrade (without any workload) to reef, excluding out the bit that makes a certain client feature mandatory:I think perhaps the major difference here is that the node running the "old client" is not upgraded at all (obviously) so the packages and binaries are untouched. Perhaps this is relating to the binary swapped out?
This is featureful_client:old_client cluster YAML
[...]
It's just one single node hosting everything.
Everything but the clients? The clients are on different nodes (smithis).
I believe this weirdness is somehow related to containerized daemons and network namespaces.
One other thing to note is that in the upgraded_client test case, we remount the client with the new packages. To eliminate that divergence, you should remount the old_client after the mds is upgraded to see if the issue is reproducible.
Yes, that's right for
upgraded_client
, but, forfeatureful_client:old_client
which does not restart clients and therefore is pretty much similar tomds_upgrade_sequence
. I'll give it a try...
Updated by Venky Shankar 4 days ago
Patrick Donnelly wrote in #note-55:
Venky Shankar wrote in #note-54:
Patrick Donnelly wrote in #note-53:
Venky Shankar wrote in #note-43:
This is a really weird bug and is reproducible even with v16.2.0, however, a slightly more weirdness is that umount unresponsiveness isn't seen with
fs:upgrade:featureful_client:old_client
. I'll explain:fs:upgrade:featureful_client:old_client
is somewhat similar tofs:upgrade:mds_upgrade_sequence
in the sense that the MDSs are upgraded leaving out the user-space ceph-fuse client (old_client
additionally checks if a client is evicted by the MDS when a certain client feature is made mandatory). So, for testing, I patched upold_client
(reef branch) to start from pacific (v16.2.15) and then start the upgrade (without any workload) to reef, excluding out the bit that makes a certain client feature mandatory:I think perhaps the major difference here is that the node running the "old client" is not upgraded at all (obviously) so the packages and binaries are untouched. Perhaps this is relating to the binary swapped out?
This is featureful_client:old_client cluster YAML
[...]
It's just one single node hosting everything.
Everything but the clients? The clients are on different nodes (smithis).
You are correct - the client[0,1] are on separate lines in the yaml (I missed that this yaml does not use host.<> specifier). Also, old_client
is not a containerized deployment.
I'll run a test for mds_upgrade_sequence
with clients on entirely separate node.