私のLinuxサーバーが「ps aux」でクラッシュするのはなぜですか?

私のLinuxサーバーが「ps aux」でクラッシュするのはなぜですか?

科学コンピューティングアプリケーションを実行するサーバーがあります。 SLES 11sp3です。

私の観察内容のいくつかは次のとおりです。

  1. SSHを使用してサーバーにログインできます。
  2. ただし、何らかの理由でコマンドを実行すると理解ps auxできません。
  3. しかし、コマンドはtop大丈夫で、出力は次のようになります。

    top - 21:02:49 up 403 days,  5:36,  5 users,  load average: 21.01, 20.31, 18.79
    Tasks: 271 total,   6 running, 241 sleeping,  24 stopped,   0 zombie
    Cpu(s):  0.0%us,  6.3%sy,  0.0%ni, 49.9%id, 43.8%wa,  0.0%hi,  0.0%si,  0.0%st
    Mem:    258428M total,   156884M used,   101543M free,        0M buffers
    Swap:     7999M total,     2588M used,     5411M free,   151700M cached
    
       PID USER      PR  NI  VIRT  RES  SHR S   %CPU %MEM    TIME+  COMMAND                                                                           
     39993 root      20   0  9064 1276  824 R      0  0.0   0:00.43 top                                                                                
         1 root      20   0 10548   68   36 S      0  0.0   5:24.51 init                                                                               
         2 root      20   0     0    0    0 S      0  0.0   0:06.20 kthreadd                                                                                3 root      20   0     0    0    0 S      0  0.0   5:17.22 ksoftirqd/0                                                                        
         6 root      RT   0     0    0    0 S      0  0.0  51:15.16 migration/0                                                                             8 root      RT   0     0    0    0 S      0  0.0   4:39.14 migration/1                                                                        
        10 root      20   0     0    0    0 S      0  0.0   1:30.39 ksoftirqd/1                                                                            13 root      RT   0     0    0    0 S      0  0.0   1:34.02 migration/2                                                                        
        15 root      20   0     0    0    0 S      0  0.0   0:16.03 ksoftirqd/2                                                                            17 root      RT   0     0    0    0 S      0  0.0   1:19.64 migration/3                                                                        
        19 root      20   0     0    0    0 S      0  0.0   0:13.99 ksoftirqd/3                                                                            21 root      RT   0     0    0    0 S      0  0.0   1:44.44 migration/4                                                                        
        23 root      20   0     0    0    0 S      0  0.0   0:18.40 ksoftirqd/4                                                                            25 root      RT   0     0    0    0 S      0  0.0   1:42.13 migration/5                                                                        
        26 root      20   0     0    0    0 S      0  0.0  14:56.25 kworker/5:0                                                                            27 root      20   0     0    0    0 S      0  0.0   0:18.18 ksoftirqd/5                                                                        
        29 root      RT   0     0    0    0 S      0  0.0   1:43.97 migration/6                                                                            30 root      20   0     0    0    0 S      0  0.0  12:30.00 kworker/6:0                                                                        
        31 root      20   0     0    0    0 S      0  0.0   0:15.76 ksoftirqd/6                                                                            33 root      RT   0     0    0    0 S      0  0.0   1:41.60 migration/7                                                                        
        35 root      20   0     0    0    0 S      0  0.0   0:12.94 ksoftirqd/7                                                                            37 root      RT   0     0    0    0 S      0  0.0   5:13.03 migration/8                                                                        
        39 root      20   0     0    0    0 S      0  0.0   1:05.10 ksoftirqd/8                                                                            41 root      RT   0     0    0    0 R      0  0.0   3:35.18 migration/9                                                                        
        43 root      20   0     0    0    0 S      0  0.0   0:45.77 ksoftirqd/9                                                                            44 root      RT   0     0    0    0 R      0  0.0   2:21.35 watchdog/9                                                                         
        45 root      RT   0     0    0    0 S      0  0.0   3:14.10 migration/10                                                                           46 root      20   0     0    0    0 S      0  0.0  25:52.76 kworker/10:0                                                                       
        47 root      20   0     0    0    0 S      0  0.0   0:29.33 ksoftirqd/10                                                                           48 root      RT   0     0    0    0 S      0  0.0   2:11.92 watchdog/10                                                                        
        49 root      RT   0     0    0    0 S      0  0.0   3:03.78 migration/11                                                                           51 root      20   0     0    0    0 S      0  0.0   0:29.36 ksoftirqd/11                                                                       
        52 root      RT   0     0    0    0 S      0  0.0   2:09.54 watchdog/11                                                                            53 root      RT   0     0    0    0 S      0  0.0   3:13.56 migration/12                                                                       
    
  4. サーバーに関する追加の詳細:

    # cat /etc/SuSE-release 
    SUSE Linux Enterprise Server 11 (x86_64)
    VERSION = 11
    PATCHLEVEL = 3
    
    # uname -a
    Linux n049 3.0.76-0.11-default #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990) x86_64 x86_64 x86_64 GNU/Linux
    
  5. 私は次のようにprocfsに書き込もうとしています:

    echo 0 > /proc/sys/kernel/nmi_watchdog
    

    やはり応答なしで停止します。

そのような問題の考えられる原因が何であるかを知りたいです。

編集する:@Matの提案に従って、次のコマンドの出力も投稿しました。

iostat

Linux 3.0.76-0.11-default (n049)    03/01/2016  _x86_64_

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          79.98    0.00    0.35    0.25    0.00   19.41

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               2.46        38.15       845.71 1329385997 29467274973

NFS統計

Client rpc stats:
calls      retrans    authrefrsh
544779536   157        544901831

Client nfs v3:
null         getattr      setattr      lookup       access       readlink     
0         0% 296447353 54% 17723933  3% 6435727   1% 44311228  8% 204       0% 
read         write        create       mkdir        symlink      mknod        
47603223  8% 122630268 22% 750048    0% 45661     0% 4         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
509509    0% 42896     0% 5249      0% 198       0% 0         0% 856848    0% 
fsstat       fsinfo       pathconf     commit       
15677     0% 22716     0% 11358     0% 7380246   1% 

panfs_stat/project/

opstats timestamp: 1456830706:536977000

PanFS Client Exported Opstats
    callback__breaks        806486
    callback__break_all     4
    callback__cancels       8389
    ioctl__getattr          0
    ioctl__setattr          0
    op__device_create       30
    op__dir_create          72765
    op__dir_delete          3389
    op__dir_fmlookup        2913280
    op__dir_lookup          4415755
    op__dir_fmreaddir       1457842
    op__file_create         1423833
    op__file_link_create    0
    op__file_delete         867145
    op__file_rename         59860
    op__file_silly_rename   4453
    op__getattr_total       36999003848
    op__ioctl               356581655
    op__llapi_sync          120384107
    op__read                552594596
    op__read__total_bytes   14363262382940
    op__sync                501053283
    op__sync__total_bytes   7978169577178
    op__symlink_create      0
    op__symlink_follow      3125353
    op__symlink_read        3550281
    op__setattr             65362728
    op__write               30887557434
    op__write__total_bytes  15818900301081
    op__write_retried       0
    op__writepage           411
PanFS Syscall Opstats
    close       suc 262891996 / unsuc   396 / started 262892393  longest 37:592322933 / 0:000578105  avg 0:000151250 / 0:000061983
    create      suc 1423829 / unsuc     5 / started 1423834  longest 185:525407516 / 0:007318163  avg 0:009095410 / 0:002275271
    fsync       suc   425 / unsuc     0 / started   425  longest 0:961370596 / 0:000000000  avg 0:070172226 / 0:000000000
    getattr     suc 491151434 / unsuc     1 / started 491151435  longest 48:897901802 / 0:039782489  avg 0:000001457 / 0:039782489
    getxattr    suc  6516 / unsuc  9640 / started 16156  longest 0:000270379 / 0:000034562  avg 0:000006670 / 0:000000955
    ioctl       suc     1 / unsuc 356582148 / started 356582150  longest 0:000008650 / 0:096541018  avg 0:000008650 / 0:000000963
    link        suc     0 / unsuc     0 / started     0  longest 0:000000000 / 0:000000000  avg 0:000000000 / 0:000000000
    llseek      suc 741005516 / unsuc 1021903 / started 742027419  longest 0:279553783 / 0:000021232  avg 0:000000451 / 0:000000282
    lock        suc  1452 / unsuc     0 / started  1452  longest 0:016063605 / 0:000000000  avg 0:001258309 / 0:000000000
    lookup      suc 4415772 / unsuc     0 / started 4415772  longest 139:467332687 / 0:000000000  avg 0:010307139 / 0:000000000
    mkdir       suc 72765 / unsuc     0 / started 72765  longest 1:101200148 / 0:000000000  avg 0:003296509 / 0:000000000
    mknod       suc    30 / unsuc     0 / started    30  longest 0:159658010 / 0:000000000  avg 0:021141579 / 0:000000000
    mmap        suc 10953869 / unsuc     0 / started 10953869  longest 0:011133309 / 0:000000000  avg 0:000000688 / 0:000000000
    open        suc 262892479 / unsuc    16 / started 262892495  longest 244:019141374 / 0:171940090  avg 0:000008309 / 0:079117046
    permission  suc 4479767553 / unsuc 1975542 / started 4481743095  longest 148:505477264 / 2:209368229  avg 0:000001927 / 0:000035222
    put_super   suc  5464 / unsuc     0 / started  5464  longest 0:001834006 / 0:000000000  avg 0:000040821 / 0:000000000
    read        suc 552597080 / unsuc     1 / started 552597081  longest 193:103296270 / 0:012257444  avg 0:000047965 / 0:012257444
    readdir     suc 1664834 / unsuc    10 / started 1664844  longest 3:179359832 / 1:464181024  avg 0:003504540 / 0:155413678
    rename      suc 55407 / unsuc     0 / started 55407  longest 20:462678110 / 0:000000000  avg 0:008082163 / 0:000000000
    rmdir       suc  3153 / unsuc   236 / started  3389  longest 0:433589477 / 0:000464708  avg 0:002029711 / 0:000313255
    setattr     suc 65373798 / unsuc     0 / started 65373798  longest 243:868835310 / 0:000000000  avg 0:004721321 / 0:000000000
    setxattr    suc     0 / unsuc     0 / started     0  longest 0:000000000 / 0:000000000  avg 0:000000000 / 0:000000000
    statfs      suc   214 / unsuc     0 / started   214  longest 0:012208633 / 0:000000000  avg 0:000528460 / 0:000000000
    symlink     suc     0 / unsuc     0 / started     0  longest 0:000000000 / 0:000000000  avg 0:000000000 / 0:000000000
    unlink      suc 870893 / unsuc     0 / started 870893  longest 14:820432065 / 0:000000000  avg 0:001506087 / 0:000000000
    vfs_admit   suc     0 / unsuc     0 / started     0  longest 0:000000000 / 0:000000000  avg 0:000000000 / 0:000000000
    write       suc 30888837091 / unsuc     0 / started 30888838130  longest 1041:545276662 / 0:000000000  avg 0:000010204 / 0:000000000

案 DEV

Average:           lo      0.03      0.03      0.00      0.00      0.00      0.00      0.00
Average:         eth0      1.34      0.53      0.16      0.10      0.00      0.00      0.01
Average:         eth1      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:          ib0      1.05      5.14      0.15      0.81      0.00      0.00      0.00

関連情報