Python腳本刪除查詢出來(lái)的數(shù)據(jù)進(jìn)

發(fā)布時(shí)間：2020-05-28 17:03:58 來(lái)源：億速云閱讀：269 作者：鴿子欄目：系統(tǒng)運(yùn)維

需求背景

業(yè)務(wù)系統(tǒng)將各類的報(bào)表和統(tǒng)計(jì)數(shù)據(jù)存放于ES中，由于歷史原因，系統(tǒng)每天均以全量方式進(jìn)行統(tǒng)計(jì)，隨著時(shí)間的推移，ES的數(shù)據(jù)存儲(chǔ)空間壓力巨大。同時(shí)由于沒(méi)有規(guī)劃好es的索引使用，個(gè)別索引甚至出現(xiàn)超過(guò)最大文檔數(shù)限制的問(wèn)題，現(xiàn)實(shí)情況給運(yùn)維人員帶來(lái)的挑戰(zhàn)是需要以最小的代價(jià)來(lái)解決這個(gè)問(wèn)題。下面以內(nèi)網(wǎng)開(kāi)發(fā)、測(cè)試環(huán)境舉例使用python腳本解決這個(gè)問(wèn)題。

Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents. You can monitor shard sizes using the _cat/shards API.

實(shí)現(xiàn)思路

es本身支持“_delete_by_query”的形式對(duì)查詢出來(lái)的數(shù)據(jù)進(jìn)行刪除。首先我們通過(guò)”_cat/indices“入口獲取當(dāng)前es服務(wù)上所有的索引信息。 Python腳本刪除查詢出來(lái)的數(shù)據(jù)進(jìn)

第一列表示索引當(dāng)前的健康狀態(tài)
第三列表示索引的名稱
第四列表示索引在服務(wù)器上的存儲(chǔ)目錄名
第五、六列表示索引的副本數(shù)和分片信息
第七列表示當(dāng)前索引的文檔數(shù)
最后兩列分別表示當(dāng)前索引的存儲(chǔ)占用空間，倒數(shù)第二列等于倒數(shù)第一列乘以副本數(shù)

其次我們通過(guò)curl形式拼接成刪除命令發(fā)送給es服務(wù)端執(zhí)行，其中createtime字段為數(shù)據(jù)的產(chǎn)生時(shí)間，單位為毫秒

curl -X POST "http://192.168.1.19:9400/fjhb-surveyor-v2/_delete_by_query?pretty" -H 'Content-Type: application/json' -d '
     {"query":{ "range": {
            "createTime": {   
                "lt": 1580400000000,    
                "format": "epoch_millis"
            }
        }
}}'

具體實(shí)現(xiàn)

#!/usr/bin/python
# -*- coding: UTF-8 -*-

###導(dǎo)入必須的模塊
import requests
import time
import datetime
import os

#定義獲取ES數(shù)據(jù)字典函數(shù)，返回索引名和索引占用存儲(chǔ)空間大小字典
def getData(env):
    header = {"Content-Type": "application/x-www-form-urlencoded",
              "user-agent": "User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
}
    data = {}
    with open('result.txt','w+') as f:
        req = requests.get(url=env+'/_cat/indices',headers=header).text
        f.write(req)
        f.seek(0)
        for line in f.readlines():
            data[line.split()[2]] = line.split()[-1]
    return data

#定義unix時(shí)間轉(zhuǎn)換函數(shù)，以毫秒形式返回，返回值為int類型
def unixTime(day):
    today = datetime.date.today()
    target_day = today + datetime.timedelta(day)
    unixtime = int(time.mktime(target_day.timetuple())) * 1000
    return unixtime

#定義刪除es數(shù)據(jù)函數(shù)，調(diào)用系統(tǒng)curl命令進(jìn)行刪除，需要傳入環(huán)境、需要?jiǎng)h除數(shù)據(jù)的時(shí)間范圍（即多少天之前的數(shù)據(jù)）參數(shù)，由于索引數(shù)量眾多，我們只處理超過(guò)1G的索引即可
def delData(env,day):
    header = 'Content-Type: application/json'
    for key, value in getData(env).items():
        if 'gb' in value:
            size = float(value.split('gb')[0])
            if size > 1:
                url = dev + '/' + key + '/_delete_by_query?pretty'
                command = ("curl -X POST \"%s\" -H '%s' "
                           "-d '{\"query\":{ \"range\": {\"createTime\": {\"lt\": %s,\"format\": \"epoch_millis\"}}}}'" % (
                           url, header, day))
                print(command)
                os.system(command)

if __name__ == '__main__':
    dev = 'http://192.168.1.19:9400'
    test1 = 'http://192.168.1.19:9200'
    test2 = 'http://192.168.1.19:9600'
    day = unixTime(-30)
    delData(dev,day)
    delData(test1,day)
    delData(test2,day)

結(jié)果驗(yàn)證

刪除前
Python腳本刪除查詢出來(lái)的數(shù)據(jù)進(jìn)
刪除后

注意事項(xiàng)

1、目前腳本采用操作系統(tǒng)crontab方式調(diào)度，一天運(yùn)行一次
2、首次刪除因?yàn)閿?shù)據(jù)量龐大，需要耗費(fèi)較長(zhǎng)時(shí)間，后續(xù)每天刪除一天的數(shù)據(jù)量，刪除效率尚可
3、腳本未考慮服務(wù)器報(bào)錯(cuò)等例外情況與告警通知，實(shí)際應(yīng)用場(chǎng)景需要進(jìn)行補(bǔ)充
4、腳本未考慮日志記錄，實(shí)際應(yīng)用場(chǎng)景需要進(jìn)行補(bǔ)充

向AI問(wèn)一下細(xì)節(jié)

Python腳本刪除查詢出來(lái)的數(shù)據(jù)進(jìn)

需求背景

實(shí)現(xiàn)思路

具體實(shí)現(xiàn)

結(jié)果驗(yàn)證

注意事項(xiàng)

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽