Linux常用監(jiān)控指標(biāo)有哪些

發(fā)布時(shí)間：2021-11-30 09:26:56 來源：億速云閱讀：175 作者：iii 欄目：大數(shù)據(jù)

本篇內(nèi)容介紹了“Linux常用監(jiān)控指標(biāo)有哪些”的有關(guān)知識(shí)，在實(shí)際案例的操作過程中，不少人都會(huì)遇到這樣的困境，接下來就讓小編帶領(lǐng)大家學(xué)習(xí)一下如何處理這些情況吧！希望大家仔細(xì)閱讀，能夠?qū)W有所成！

1. Linux運(yùn)維基礎(chǔ)采集項(xiàng)

做運(yùn)維，不怕出問題，怕的是出了問題，抓不到現(xiàn)場(chǎng)，兩眼摸黑。所以，依靠強(qiáng)大的監(jiān)控系統(tǒng)，收集盡可能多的指標(biāo)，意義重大。但哪些指標(biāo)才是有意義的呢，本著從實(shí)踐中來的思想，各位工程師在長(zhǎng)期摸爬滾打中總結(jié)出來的經(jīng)驗(yàn)最有價(jià)值。

在各位運(yùn)維工程師長(zhǎng)期的工作實(shí)踐中，我們總結(jié)了在系統(tǒng)運(yùn)維過程中，經(jīng)常會(huì)參考的一些指標(biāo)，主要包括以下幾個(gè)類別：

CPU
Load
內(nèi)存
磁盤
IO
網(wǎng)絡(luò)相關(guān)
內(nèi)核參數(shù)
ss 統(tǒng)計(jì)輸出
端口采集
核心服務(wù)的進(jìn)程存活信息采集
關(guān)鍵業(yè)務(wù)進(jìn)程資源消耗
NTP offset采集
DNS解析采集

每個(gè)類別，具體的詳細(xì)指標(biāo)如下，這些指標(biāo)，都是open-falcon的agent組件直接支持的。falcon-agent每隔一定時(shí)間間隔（目前是60秒）會(huì)采集一次相關(guān)的指標(biāo)，并匯報(bào)給server端。

2. CPU相關(guān)采集項(xiàng)

計(jì)算方法：通過采集/proc/stat來得到，大家可以參考sar命令的統(tǒng)計(jì)輸出來理解。

cpu.idle：Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
cpu.busy：與cpu.idle相對(duì)，他的值等于100減去cpu.idle。
cpu.guest：Percentage of time spent by the CPU or CPUs to run a virtual processor.
cpu.iowait：Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
cpu.irq：Percentage of time spent by the CPU or CPUs to service hardware interrupts.
cpu.softirq：Percentage of time spent by the CPU or CPUs to service software interrupts.
cpu.nice：Percentage of CPU utilization that occurred while executing at the user level with nice priority.
cpu.steal：Percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
cpu.system：Percentage of CPU utilization that occurred while executing at the system level (kernel).
cpu.user：Percentage of CPU utilization that occurred while executing at the user level (application).
cpu.cnt：cpu核數(shù)。
cpu.switches：cpu上下文切換次數(shù)，計(jì)數(shù)器類型。

3. 磁盤相關(guān)采集項(xiàng)

計(jì)算方法：先讀取/proc/mounts拿到所有掛載點(diǎn)，然后通過syscall.Statfs_t拿到blocks和inode的使用情況。每個(gè)metric都會(huì)附加一組tag描述，類似mount=$mount,fstype=$fstype，其中$mount是掛載點(diǎn)，比如/home，$fstype是文件系統(tǒng)，比如ext4。

df.bytes.free：磁盤可用量，int64
df.bytes.free.percent：磁盤可用量占總量的百分比，float64，比如32.1
df.bytes.total：磁盤總大小，int64
df.bytes.used：磁盤已用大小，int64
df.bytes.used.percent：磁盤已用大小占總量的百分比，float64
df.inodes.total：inode總數(shù)，int64
df.inodes.free：可用inode數(shù)目，int64
df.inodes.free.percent：可用inode占比，float64
df.inodes.used：已用的inode數(shù)據(jù)，int64
df.inodes.used.percent：已用inode占比，float64

4. megacli工具輸出

使用 megacli 工具讀取 RAID 相關(guān)信息，每個(gè)metric都會(huì)附件一組tag描述，用來標(biāo)明所屬PD或者 VD，PD格式為PD=Enclosure_ID:SLOT_ID，比如PD=32:0表明第一塊磁盤，VD=0 表明第一個(gè)邏輯磁盤。

sys.disk.lsiraid.pd.Media_Error_Count：這個(gè)及以下三個(gè)指標(biāo)目前僅作為數(shù)據(jù)收集，不一定意味磁盤損壞（只是表示損壞概率變大）
sys.disk.lsiraid.pd.Other_Error_Count
sys.disk.lsiraid.pd.Predictive_Failure_Count
sys.disk.lsiraid.pd.Drive_Temperature
sys.disk.lsiraid.pd.Firmware_state：如果值不為0，則此物理磁盤出現(xiàn)問題
sys.disk.lsiraid.vd.cache_policy：如果值不為0，表示此邏輯磁盤緩存策略和設(shè)置不符
sys.disk.lsiraid.vd.state：如果值不為0，表示此邏輯磁盤出現(xiàn)問題

5. SMART工具輸出

使用 smartctl 工具讀取磁盤 SMART 信息，目前所有指標(biāo)僅作為數(shù)據(jù)收集，不一定意味磁盤損壞（只是表示概率變大），每個(gè)metric都會(huì)有一組tag描述，表明盤符，例如device=/dev/sda。

sys.disk.smart.Reallocated_Sector_Ct
sys.disk.smart.Spin_Retry_Count
sys.disk.smart.Reallocated_Event_Count
sys.disk.smart.Current_Pending_Sector
sys.disk.smart.Offline_Uncorrectable
sys.disk.smart.Temperature_Celsius

6. 分區(qū)讀寫監(jiān)控

測(cè)試所有已掛載分區(qū)是否可讀寫，每個(gè)metric都會(huì)有一組tag描述，表示掛載點(diǎn)，比如mount=/home

sys.disk.rw：如果值不為0，表明此分區(qū)讀寫出現(xiàn)問題

7. IO相關(guān)采集項(xiàng)

計(jì)算方法：每秒采集一次/proc/diskstats，計(jì)算差值，都是計(jì)數(shù)器類型的。每個(gè)metric都會(huì)有一組tag描述，形如device=$device，用來表示具體的設(shè)備，比如sda1、sdb。用戶可以參考iostat的幫助文檔來理解具體的metric含義。

disk.io.ios_in_progress：Number of actual I/O requests currently in flight.
disk.io.msec_read：Total number of ms spent by all reads.
disk.io.msec_total：Amount of time during which ios_in_progress >= 1.
disk.io.msec_weighted_total：Measure of recent I/O completion time and backlog.
disk.io.msec_write：Total number of ms spent by all writes.
disk.io.read_merged：Adjacent read requests merged in a single req.
disk.io.read_requests：Total number of reads completed successfully.
disk.io.read_sectors：Total number of sectors read successfully.
disk.io.write_merged：Adjacent write requests merged in a single req.
disk.io.write_requests：total number of writes completed successfully.
disk.io.write_sectors：total number of sectors written successfully.
disk.io.read_bytes：?jiǎn)挝皇莃yte的數(shù)字
disk.io.write_bytes：?jiǎn)挝皇莃yte的數(shù)字
disk.io.avgrq_sz：下面幾個(gè)值就是iostat -x 1看到的值
disk.io.avgqu-sz
disk.io.await
disk.io.svctm
disk.io.util：是個(gè)百分?jǐn)?shù)，比如56.43，表示56.43%

8. 機(jī)器負(fù)載相關(guān)采集項(xiàng)

計(jì)算方法：讀取/proc/loadavg，都是原始值類型的：

load.1min
load.5min
load.15min

9. 內(nèi)存相關(guān)采集項(xiàng)

計(jì)算方法：讀取/proc/meminfo 中的內(nèi)容，其中的mem.memfree是free+buffers+cached，mem.memused=mem.memtotal-mem.memfree。用戶具體可以參考free命令的輸出和幫助文檔來理解每個(gè)metric的含義。

mem.memtotal：內(nèi)存總大小
mem.memused：使用了多少內(nèi)存
mem.memused.percent：使用的內(nèi)存占比
mem.memfree
mem.memfree.percent
mem.swaptotal：swap總大小
mem.swapused：使用了多少swap
mem.swapused.percent：使用的swap的占比
mem.swapfree
mem.swapfree.percent

10. 網(wǎng)絡(luò)相關(guān)采集項(xiàng)

計(jì)算方法：讀取/proc/net/dev的內(nèi)容，每個(gè)metric都附加有一組tag，形如iface=$iface，標(biāo)明具體那個(gè)interface，比如eth0。metric中帶有in的表示流入情況，out表示流出情況，total是總量in+out，支持的metric如下：

net.if.in.bytes
net.if.in.compressed
net.if.in.dropped
net.if.in.errors
net.if.in.fifo.errs
net.if.in.frame.errs
net.if.in.multicast
net.if.in.packets
net.if.out.bytes
net.if.out.carrier.errs
net.if.out.collisions
net.if.out.compressed
net.if.out.dropped
net.if.out.errors
net.if.out.fifo.errs
net.if.out.packets
net.if.total.bytes
net.if.total.dropped
net.if.total.errors
net.if.total.packets

11. 端口采集項(xiàng)

計(jì)算方法，通過ss -ln，來判斷指定的端口是否處于listen狀態(tài)。原始值類型，值要么是1：代表在監(jiān)聽，要么是0，代表沒有在監(jiān)聽。每個(gè)metric都附件一組tag，形如port=port，port就是具體的端口。

net.port.listen

12. 機(jī)器內(nèi)核配置

kernel.maxfiles：讀取的/proc/sys/fs/file-max
kernel.files.allocated：讀取的/proc/sys/fs/file-nr第一個(gè)Field
kernel.files.left：值=kernel.maxfiles-kernel.files.allocated
kernel.maxproc：讀取的/proc/sys/kernel/pid_max

13. ntp采集項(xiàng)

使用 ntpq -pn 獲取本機(jī)時(shí)間相對(duì)于 ntp 服務(wù)器的 offset。

sys.ntp.offset：本機(jī)偏移時(shí)間，單位為ms，值過大或者為0則表明有異常，需要報(bào)警

14. 進(jìn)程監(jiān)控

proc.num：判斷某個(gè)進(jìn)程的數(shù)目，這里需要分兩個(gè)場(chǎng)景，一種是根據(jù)進(jìn)程的名字來判定，比如name=sshd；另外一種是根據(jù)cmdline來判定，比如Java的應(yīng)用進(jìn)程名可能都是java，根據(jù)第一種情況沒法做區(qū)分，此時(shí)可以配置cmdline，如cmdline=./falcon_agent-c./cfg.ini

15. 進(jìn)程資源監(jiān)控

process.cpu.all：進(jìn)程和它的子進(jìn)程使用的sys+user的cpu，單位是jiffies
process.cpu.sys：進(jìn)程和它的子進(jìn)程使用的sys cpu，單位是jiffies
process.cpu.user：進(jìn)程和它的子進(jìn)程使用的user cpu，單位是jiffies
process.swap：進(jìn)程和它的子進(jìn)程使用的swap，單位是page
process.fd：進(jìn)程使用的文件描述符個(gè)數(shù)
process.mem：進(jìn)程占用內(nèi)存，單位byte

16. ss命令輸出

ss.orphaned
ss.closed
ss.timewait
ss.slabinfo.timewait
ss.synrecv
ss.estab

“Linux常用監(jiān)控指標(biāo)有哪些”的內(nèi)容就介紹到這里了，感謝大家的閱讀。如果想了解更多行業(yè)相關(guān)的知識(shí)可以關(guān)注億速云網(wǎng)站，小編將為大家輸出更多高質(zhì)量的實(shí)用文章！

向AI問一下細(xì)節(jié)

Linux常用監(jiān)控指標(biāo)有哪些

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽