elasticsearch 5.x中IK分詞器怎么用

發(fā)布時(shí)間：2021-10-19 16:09:06 來源：億速云閱讀：94 作者：柒染欄目：大數(shù)據(jù)

本篇文章為大家展示了elasticsearch 5.x中IK分詞器怎么用，內(nèi)容簡明扼要并且容易理解，絕對能使你眼前一亮，通過這篇文章的詳細(xì)介紹希望你能有所收獲。

ik分詞器的地址 https://github.com/medcl/elasticsearch-analysis-ik/releases ，分詞器插件需要和ES版本匹配

由于es是5.6.16版本，所有我們下載5.6.16

https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.6.16/elasticsearch-analysis-ik-5.6.16.zip

解壓后，把安裝包放在ES節(jié)點(diǎn)的plugins目錄，包名重命名為ik

elasticsearch 5.x中IK分詞器怎么用

重啟ES，測試下IK分詞效果

（1）無分詞器下的效果

GET _analyze?pretty
{
  "text":"安徽省長江流域"
}

返回結(jié)果。

{
  "tokens": [
    {
      "token": "安",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "徽",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "省",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "長",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "江",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    },
    {
      "token": "流",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<IDEOGRAPHIC>",
      "position": 5
    },
    {
      "token": "域",
      "start_offset": 6,
      "end_offset": 7,
      "type": "<IDEOGRAPHIC>",
      "position": 6
    }
  ]
}

可見 “安徽省長江流域” 每個(gè)字都分成了一個(gè)詞

（2）IK分詞器下的效果，ik_smart分詞器

GET _analyze?pretty
{
  "analyzer": "ik_smart",
  "text":"安徽省長江流域"
}

結(jié)果

{
  "tokens": [
    {
      "token": "安徽省",
      "start_offset": 0,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "長江流域",
      "start_offset": 3,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

（3）IK分詞器下的效果，ik_smart分詞器

GET _analyze?pretty
{
  "analyzer": "ik_max_word",
  "text":"安徽省長江流域"
}

結(jié)果

{
  "tokens": [
    {
      "token": "安徽省",
      "start_offset": 0,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "安徽",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "省長",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "長江流域",
      "start_offset": 3,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "長江",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "江流",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "流域",
      "start_offset": 5,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 6
    }
  ]
}

為什么IK分詞器能分析中文詞匯呢？因?yàn)樵谒腸onfig目錄內(nèi)置了一些詞典。

那么如果我們需要識(shí)別一些新的詞匯怎么辦？例如一部連續(xù)劇 “權(quán)利的游戲”

自定義詞典

在IK插件的config目錄下創(chuàng)建tv目錄，新建 tv.dic 文件（注意，一定要UTF-8無BOM的格式）

然后在 IKAnalyzer.cfg.xml 文件在添加配置

重啟ES、Kibana ，試下效果

GET _analyze?pretty
{
  "analyzer": "ik_smart",
  "text":"權(quán)利的游戲"
}

分詞結(jié)果

{
  "tokens": [
    {
      "token": "權(quán)利的游戲",
      "start_offset": 0,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 0
    }
  ]
}

上述內(nèi)容就是elasticsearch 5.x中IK分詞器怎么用，你們學(xué)到知識(shí)或技能了嗎？如果還想學(xué)到更多技能或者豐富自己的知識(shí)儲(chǔ)備，歡迎關(guān)注億速云行業(yè)資訊頻道。

向AI問一下細(xì)節(jié)

elasticsearch 5.x中IK分詞器怎么用

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽