linux

Table of Contents

1. vmlinuz

vmlinuz是可引导的、压缩的内核。"vm"代表"Virtual Memory"。Linux 支持虚拟内存,不像老的操作系统比如DOS有640KB内存的限制。Linux能够使用硬盘空间作为虚拟内存,因此得名"vm"。vmlinuz是可执行的Linux内核,它位于/boot/vmlinuz,它一般是一个软链接。vmlinux是未压缩的内核,vmlinuz是vmlinux的压缩文件。

vmlinuz的建立有两种方式。一是编译内核时通过"make zImage"创建,然后通过:"cp /usr/src/linux-2.4/arch/i386/linux/boot/zImage /boot/vmlinuz"产生。zImage适用于小内核的情况,它的存在是为了向后的兼容性。二是内核编译时通过命令make bzImage创建,然后通过:"cp /usr/src/linux-2.4/arch/i386/linux/boot/bzImage /boot/vmlinuz"产生。bzImage是压缩的内核映像,需要注意,bzImage不是用bzip2压缩的,bzImage中的bz容易引起误解,bz表示"big zImage"。 bzImage中的b是"big"意思。

zImage(vmlinuz)和bzImage(vmlinuz)都是用gzip压缩的。它们不仅是一个压缩文件,而且在这两个文件的开头部分内嵌有gzip解压缩代码。所以你不能用gunzip 或 gzip –dc解包vmlinuz。内核文件中包含一个微型的gzip用于解压缩内核并引导它。两者的不同之处在于,老的zImage解压缩内核到低端内存(第一个640K),bzImage解压缩内核到高端内存(1M以上)。如果内核比较小,那么可以采用zImage或bzImage之一,两种方式引导的系统运行时是相同的。大的内核采用bzImage,不能采用zImage。

2. linux io/storage stack

Pasted-Image-20231225104657.png Pasted-Image-20231225104838.png

3. program exit code

首先看下面一段Java程序

/* coding:utf-8
 * Copyright (C) dirlt
 */

public class X{
  public static void main(String[] args) {
    System.exit(1);
  }
}

然后这个Java程序被Python调用,判断这个打印值

#!/usr/bin/env python
#coding:utf-8
#Copyright (C) dirlt

import os
print os.system('java X')

返回值不为1而是256,对此解释是这样的

a 16-bit number, whose low byte is the signal number that killed the process, and whose high byte is the exit status (if the signal number is zero); the high bit of the low byte is set if a core file was produced.

但是下面这段Python程序,使用echo $?判断返回值为0而不是256

#!/usr/bin/env python
#coding:utf-8
#Copyright (C) dirlt

code=256
exit(code)

4. dp8网卡问题

当时dp8的网络流量从一个非常大的值变为非常小的值,检查/proc/net/netstat,以下几个统计数值dp8和其他机器差距较大(相差1-2个数量级):

  • TCPDirectCopyFromPrequeue
  • TCPHPHitsToUser
  • TCPDSACKUndo
  • TCPLossUndo
  • TCPLostRetransmit
  • TCPFastRetrans
  • TCPSlowStartRetrans
  • TCPSackShiftFallback

之后在dmesg上面发现如下线索:

dp@dp8:~$ dmesg | grep eth0
[ 15.635160] eth0: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express f
[ 15.736389] bnx2: eth0: using MSIX
[ 15.738263] ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 37.848755] bnx2: eth0 NIC Copper Link is Up, 100 Mbps full duplex
[ 37.850623] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 1933.934668] bnx2: eth0: using MSIX
[ 1933.936960] ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 1956.130773] bnx2: eth0 NIC Copper Link is Up, 100 Mbps full duplex
[ 1956.132625] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[4804526.542976] bnx2: eth0 NIC Copper Link is Down
[4804552.008858] bnx2: eth0 NIC Copper Link is Up, 100 Mbps full duplex

日志 [4804552.008858] bnx2: eth0 NIC Copper Link is Up, 100 Mbps full duplex 表明dp8上的网卡速度被识别成100 Mbps了。

可能的原因如下:

  • 网线、水晶头质量太差或老化、水晶头没压好,从而导致网线接触不良或短路等,可以重新压水晶头或更换网线,建议用质量可靠的六类网线六类水晶头
  • 本地连接―右键―属性―配置―高级―速度和双工,这里设置错误,改为自动感应或1000Mbps全双工即可
  • 网卡所接的交换机或路由器等硬件设备出现故障,或者这些设备是百兆的(千和百连在一起,千变百向下兼容)
  • 电磁场干扰有时也会变百兆,所以说网线尽量别与电线一起穿管(论坛会员tchack友情提供)

我们的网线都是由 世xx联 提供的,质量应该不错,有两种情况需要优先排除。

  • 网线问题(测试方法:换根网线试试)
  • 交换机dp8连接的口坏了(测试方法:把dp8的网线换一个交换机的口)

5. 修改资源限制

临时的修改方式可以通过ulimit来进行修改,也可以通过修改文件/etc/security/limits.conf来永久修改

hadoop - nofile 102400
hadoop - nproc 40960

6. CPU温度过高

这个问题是我在Ubuntu PC上面遇到的,明显的感觉就是运行速度变慢。然后在syslog里面出现如下日志:

May  2 18:24:21 umeng-ubuntu-pc kernel: [ 1188.717609] CPU1: Core temperature/speed normal
May  2 18:24:21 umeng-ubuntu-pc kernel: [ 1188.717612] CPU0: Package temperature above threshold, cpu clock throttled (total events = 137902)
May  2 18:24:21 umeng-ubuntu-pc kernel: [ 1188.717615] CPU2: Package temperature above threshold, cpu clock throttled (total events = 137902)
May  2 18:24:21 umeng-ubuntu-pc kernel: [ 1188.717619] CPU1: Package temperature above threshold, cpu clock throttled (total events = 137902)
May  2 18:24:21 umeng-ubuntu-pc kernel: [ 1188.717622] CPU3: Package temperature above threshold, cpu clock throttled (total events = 137902)

7. sync hangup

8. upgrade glibc

linux - How to recover after deleting the symbolic link libc.so.6? - Stack Overflow : http://stackoverflow.com/questions/12249547/how-to-recover-after-deleting-the-symbolic-link-libc-so-6

@2013-05-23 https://docs.google.com/a/umeng.com/document/d/12dzJ3OhVlrEax3yIdz0k08F8tM8DDQva1wdrD3K49PI/edit 怀疑glibc版本存在问题,在dp45上操作但是出现问题。

我的操作顺序计划是这样的:

  1. 将dp20的glibc copy到自己的目录下面/home/dp/dirlt/libc-2.11.so
  2. 将dp45的glibc backup. mv /lib64/libc-2.12.so /lib64/libc-2.12.bak.so(补充一点,就是在lib64下面还有软链接 libc.so.6 -> libc-2.12.so,这个文件应该是被程序查找使用的)
  3. cp /home/dp/dirlt/libc-2.11.so /lib64/libc-2.12.so

但是进行到2之后就发现cp不可用了,并且ls等命令也不能够使用了。原因非常简单,就是因为2之后libc.so.6没有对应的文件了,而cp,ls这些基本的命令依赖于这个动态链接库。

~ $ ldd /bin/cp
	linux-vdso.so.1 =>  (0x00007fff9717f000)
	libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007f5efb804000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f5efb5fc000)
	libacl.so.1 => /lib/x86_64-linux-gnu/libacl.so.1 (0x00007f5efb3f3000)
	libattr.so.1 => /lib/x86_64-linux-gnu/libattr.so.1 (0x00007f5efb1ee000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f5efae2f000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f5efac2a000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f5efba2d000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f5efaa0d000)

@2013-08-03

A copy of the C library was found in an unexpected directory | Blog : http://blog.i-al.net/2013/03/a-copy-of-the-c-library-was-found-in-an-unexpected-directory/

上面的链接给出了升级glibc的方法

  • sudo su - root # 首先切换到root账号下面
  • mv libc.so librt.so /root # 将glibc等相关的so移动到root账号下面,主要不要移动软连接文件。
  • LD_PRELOAD=/root/libc.so:/root/librt.so bash # 这个时候如果执行bash是找不到glibc等so了,所以需要使用LD_PRELOAD来预先加载
  • apt-get install # 在这个bash下面使用apt-get来安装和升级glibc.

9. 允许不在tty上执行sudo

修改/etc/sudoers文件,注释掉

Defaults requiretty

10. ssh proxy

http://serverfault.com/questions/37629/how-do-i-do-multihop-scp-transfers

  • 目的机器是D,端口是16021,用户是x
  • 跳板机器是T,端口是18021,用户是y
  • client需要和x@D以及y@T建立信任关系
  • 方法A
    • 从T上和D建立链接并且配置转发端口p, 所有和T:p的数据交互都会转发到D:16021
    • 在T上执行 ssh -L "*:5502:D:16021" x@D # 转发端口是5502
      • -o ServerAliveInterval=60 # 我才想单位应该是s。这样每隔60s可以和server做一些keepalive的通信,确保长时间没有数据通信的情况下,连接不会断开。
    • ssh -p 5502 x@T 或者 scp -P 5502 <file> x@T:<path-at-D>
  • 方法B
    • scp可以指定proxyCommand配合D上nc命令完成
    • scp -o ProxyCommand="ssh -p 18021 y@T 'nc D 16021'" <file> x@D:<path-at-D>

UPDATE @ 2016-08-26: 发现这个方法可以用来解决remote ipython notebook的问题.

  • 首先在目标机器dev上启动ipython notebook. `jupyter notebook –no-browser –port=8888`
  • 然后在本机上选择绑定端口比如1000. `ssh -L "*:10000:dev:8888" dev`

之后就可以在本地使用 `http://localhost:10000` 来访问远端的notebook了.

11. 修改最大打开文件句柄数

首先需要修改系统上限,这些可以在/etc/sysctl.conf里面修改,然后执行sysctl -p

  • /proc/sys/fs/file-max # 所有进程打开文件句柄数上限
  • /proc/sys/fs/nr_open # 单个进程打开文件句柄数上限
  • /proc/sys/fs/file-nr # 系统当前打开文件句柄数

然后修改用户(进程)使用上限

  • /etc/security/limits.conf
  • ulimit

12. apt-get hang

在使用ubuntu的apt-get时候,可能会出现一些异常的状况,我们直接终止了apt-get。但是这个时候apt-get软件本身出于一个不正常的状态,导致之后不能够启动apt-get。如果观察进程的话会出现下面一些可疑的进程

dp@dp1:~$ ps aux | grep "apt"
root      3587  0.0  0.0  36148 22800 ?        Ds   Oct08   0:00 /usr/bin/dpkg --status-fd 50 --unpack --auto-deconfigure /var/cache/apt/archives/sgml-data_2.0.4_all.deb
root      9579  0.0  0.0  35992 22744 ?        Ds   Oct19   0:00 /usr/bin/dpkg --status-fd 50 --unpack --auto-deconfigure /var/cache/apt/archives/iftop_0.17-16_amd64.deb
root     25957  0.0  0.0  36120 22796 ?        Ds   Nov05   0:00 /usr/bin/dpkg --status-fd 50 --unpack --auto-deconfigure /var/cache/apt/archives/iftop_0.17-16_amd64.deb /var/cache/apt/archives/iotop_0.4-1_all.deb
dp       30586  0.0  0.0   7628  1020 pts/2    S+   08:59   0:00 grep --color=auto apt

这些进程的父进程都是init进程,并且状态是uninterruptible sleep,给kill -9也没有办法终止,唯一的办法只能reboot机器来解决这个问题。关于这个问题可以看stackoverflow上面的解答 How to stop 'uninterruptible' process on Linux? - Stack Overflow http://stackoverflow.com/questions/767551/how-to-stop-uninterruptible-process-on-linux

  • Simple answer: you cannot. Longer answer: the uninterruptable sleep means the process will not be woken up by signals. It can be only woken up by what it's waiting for. When I get such situations eg. with CD-ROM, I usually reset the computer by using suspend-to-disk and resuming.
  • The D state basically means that the process is waiting for disk I/O, or other block I/O that can't be interrupted. Sometimes this means the kernel or device is feverishly trying to read a bad block (especially from an optical disk). Sometimes it means there's something else. The process cannot be killed until it gets out of the D state. Find out what it is waiting for and fix that. The easy way is to reboot. Sometimes removing the disk in question helps, but that can be rather dangerous: unfixable catastrophic hardware failure if you don't know what you're doing (read: smoke coming out).

13. syslog on cpu

13.1. Core power limit notifaction

May 12 12:29:12 dp57 kernel: CPU1: Core power limit notification (total events = 42322)
May 12 12:29:12 dp57 kernel: CPU17: Core power limit notification (total events = 42321)
May 12 12:29:12 dp57 kernel: CPU5: Core power limit notification (total events = 42328)
May 12 12:29:12 dp57 kernel: CPU21: Core power limit notification (total events = 42327)
May 12 12:29:12 dp57 kernel: CPU19: Core power limit notification (total events = 42327)
May 12 12:29:12 dp57 kernel: CPU3: Core power limit notification (total events = 42327)
May 12 12:29:12 dp57 kernel: CPU7: Core power limit notification (total events = 42323)
May 12 12:29:12 dp57 kernel: CPU23: Core power limit notification (total events = 42322)
May 12 12:29:12 dp57 kernel: CPU25: Core power limit notification (total events = 42226)
May 12 12:29:12 dp57 kernel: CPU9: Core power limit notification (total events = 42222)
May 12 12:29:12 dp57 kernel: CPU11: Core power limit notification (total events = 42222)
May 12 12:29:12 dp57 kernel: CPU27: Core power limit notification (total events = 42219)
May 12 12:29:12 dp57 kernel: CPU13: Core power limit notification (total events = 42321)
May 12 12:29:12 dp57 kernel: CPU29: Core power limit notification (total events = 42307)
May 12 12:29:12 dp57 kernel: CPU15: Core power limit notification (total events = 42556)
May 12 12:29:12 dp57 kernel: CPU31: Core power limit notification (total events = 42550)

13.2. Package power limit notification

May 12 12:29:12 dp57 kernel: CPU17: Package power limit notification (total events = 42377)
May 12 12:29:12 dp57 kernel: CPU5: Package power limit notification (total events = 42612)
May 12 12:29:12 dp57 kernel: CPU21: Package power limit notification (total events = 42615)
May 12 12:29:12 dp57 kernel: CPU19: Package power limit notification (total events = 42553)
May 12 12:29:12 dp57 kernel: CPU3: Package power limit notification (total events = 42543)
May 12 12:29:12 dp57 kernel: CPU7: Package power limit notification (total events = 42661)
May 12 12:29:12 dp57 kernel: CPU23: Package power limit notification (total events = 42667)
May 12 12:29:12 dp57 kernel: CPU25: Package power limit notification (total events = 42707)
May 12 12:29:12 dp57 kernel: CPU9: Package power limit notification (total events = 42706)
May 12 12:29:12 dp57 kernel: CPU11: Package power limit notification (total events = 42705)
May 12 12:29:12 dp57 kernel: CPU27: Package power limit notification (total events = 42731)
May 12 12:29:12 dp57 kernel: CPU13: Package power limit notification (total events = 42619)
May 12 12:29:12 dp57 kernel: CPU29: Package power limit notification (total events = 42627)
May 12 12:29:12 dp57 kernel: CPU15: Package power limit notification (total events = 42623)
May 12 12:29:12 dp57 kernel: CPU31: Package power limit notification (total events = 42644)
May 12 12:29:12 dp57 kernel: CPU1: Package power limit notification (total events = 42360

13.3. below trip temperature. Throttling disabled

May 12 12:29:40 dp57 mcelog: Processor 17 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 5 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 21 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 19 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 3 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 7 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 23 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 25 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 9 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 11 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 27 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 13 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 29 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 15 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 17 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 31 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 5 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 21 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 19 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 3 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 7 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 23 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 25 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 9 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 11 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 27 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 13 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 29 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 15 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 31 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 1 below trip temperature. Throttling disabled
May 12 12:29:40 dp57 mcelog: Processor 1 below trip temperature. Throttling disabled

14. ssh access denied

通常来说access denied主要是因为 ~/.ssh/authorized_keys 里面没有配置公钥,但是也有其他原因比如目录权限等。 在排除了公钥问题之后如何定位access denied的原因呢?假如你现在还有一个session连接在远端服务器上的话,那么可以在 这个服务器上另外一个端口启动sshd, 并且开启debug模式来观察错误日志. (方法来自于这个 帖子)

下面我做个试验. 我先把 tinycache 的.ssh目录修改一下权限 `chmod og+rwx .ssh`

这个时候如果如果连接 tinycache 服务器就会出现下面错误

[ec2-user@rel0 ~]$ ssh tinycache
Permission denied (publickey).

然后我在 tinycache 服务器上启动debug模式的sshd

/usr/sbin/sshd -d -p 2222

然后重启尝试连接端口2222的话,那么这边就会出现错误日志

Authentication refused: bad ownership or modes for directory /home/ec2-user/.ssh
Authentication refused: bad ownership or modes for directory /home/ec2-user/.ssh
Authentication refused: bad ownership or modes for directory /home/ec2-user/.ssh