Java中Term Vector的概念和使用方法是什么

發(fā)布時間：2021-12-21 10:43:56 來源：億速云閱讀：206 作者：iii 欄目：開發(fā)技術(shù)

本篇內(nèi)容主要講解“Java中Term Vector的概念和使用方法是什么”，感興趣的朋友不妨來看看。本文介紹的方法操作簡單快捷，實(shí)用性強(qiáng)。下面就讓小編來帶大家學(xué)習(xí)“Java中Term Vector的概念和使用方法是什么”吧!

term vector是什么？

每次有document數(shù)據(jù)插入時，elasticsearch除了對document進(jìn)行正排、倒排索引的存儲之外，如果此索引的field設(shè)置了term_vector參數(shù)，elasticsearch還會對這個的分詞信息進(jìn)行計算、統(tǒng)計，比如這個document有多少個field，每個field的值分詞處理后得到的term的df值，ttf值是多少，每個term存儲的位置偏移量等信息，這些統(tǒng)計信息統(tǒng)稱為term vector。term vector的值有5個

no：不存儲term vector信息，默認(rèn)值
yes：只存儲field terms信息，不包含position和offset信息
with_positions：存儲term信息和position信息
with_offsets：存儲term信息和offset信息
with_positions_offsets：存儲完整的term vector信息，包括field terms、position、offset信息。

term vector的信息生成有兩種方式：index-time和query-time。index-time即建立索引時生成term vector信息，query-time是在查詢過程中實(shí)時生成term vector信息，前者以空間換時間，后者以時間換空間。

Java中Term Vector的概念和使用方法是什么

term vector有什么作用？

term vector本質(zhì)上是一個數(shù)據(jù)探查的工具（可以看成是一個debugger工具），上面記錄著一個document內(nèi)的field分詞后的term的詳細(xì)情況，如拆分成幾個term，每個term在正排索引的哪個位置，各自的df值、ttf值分別是多少等等。一般用于數(shù)據(jù)疑似問題的排查，比如說排序和搜索與預(yù)期的結(jié)果不一致，需要了解根本原因，可以拿這個工具手動進(jìn)行數(shù)據(jù)分析，幫助判斷問題的根源。

讀懂term vector信息

我們來看看一個完整的term vector報文，都有哪些信息，帶#號的一行代碼是添加的注釋，如下示例：

{
  "_index": "music",
  "_type": "children",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "text": {
      "field_statistics": {
        "sum_doc_freq": 3,
        "doc_count": 1,
        "sum_ttf": 3
      },
      "terms": {
        "elasticsearch": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 2,
              "start_offset": 11,
              "end_offset": 24
            }
          ]
        },
        "hello": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 5
            }
          ]
        },
        "java": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 6,
              "end_offset": 10
            }
          ]
        }
      }
    }
  }
}

一段完整的term vector信息，term vector是按field為維度來統(tǒng)計的，主要包含三個部分：

field statistics
term statistics
term information

field statistics

指該索引和type下所有的document，對這個field所有term的統(tǒng)計信息，注意document的范圍，不是某一條，是指定index/type下的所有document。

sum_doc_freq(sum of document frequency)：這個field中所有的term的df之和。
doc_count(document count)：有多少document包含這個field，有些document可能沒有這個field。
sum_ttf(sum of total term frequency)：這個field中所有的term的tf之和。

term statistics

hello為當(dāng)前document中，text field字段分詞后的term，查詢時設(shè)置term_statistics=true時生效。

doc_freq(document frequency)：有多少document包含這個term。
ttf(total term frequency)：這個term在所有document中出現(xiàn)的頻率。
term_freq(term frequency in the field)：這個term在當(dāng)前document中出現(xiàn)的頻率。

term information

示例中tokens里面的內(nèi)容，tokens里面是個數(shù)組

position：這個term在field里的正排索引位置，如果有多個相同的term，tokens下面會有多條記錄。
start_offset：這個term在field里的偏移，表示起始位置偏移量。
end_offset：這個term在field里的偏移量，表示結(jié)束位置偏移量。

term vector使用案例

建立索引music，type命名為children，指定text字段為index-time，fullname字段為query-time

PUT /music
{
  "mappings": {
    "children": {
      "properties": {
        "content": {
            "type": "text",
            "term_vector": "with_positions_offsets",
            "store" : true,
            "analyzer" : "standard"
         },
         "fullname": {
            "type": "text",
            "analyzer" : "standard"
        }
      }
    }
  }
}

添加3條示例數(shù)據(jù)

PUT /music/children/1
{
  "fullname" : "Jean Ritchie",
  "content" : "Love Somebody"
}

PUT /music/children/2
{
  "fullname" : "John Smith",
  "content" : "wake me, shark me ..."
}
PUT /music/children/3
{
  "fullname" : "Peter Raffi",
  "content" : "brush your teeth"
}

對document id為1這條數(shù)據(jù)進(jìn)行term vector探查

GET /music/children/1/_termvectors
{
  "fields" : ["content"],
  "offsets" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

得到的結(jié)果即為上文的term vector示例。另外可以提一下，用這3個document的id進(jìn)行查詢，field_statistics部分是一樣的。

term vector常見用法

除了上一節(jié)的標(biāo)準(zhǔn)查詢用法，還有一些參數(shù)可以豐富term vector的查詢。

doc參數(shù)

GET /music/children/_termvectors
{
  "doc" : {
    "fullname" : "Peter Raffi",
    "content" : "brush your teeth"
  },
  "fields" : ["content"],
  "offsets" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

這個語法的含義是針對指定的doc進(jìn)行term vector分析，doc里的內(nèi)容可以隨意指定，特別實(shí)用。

per_field_analyzer參數(shù)
可以指定字段的分詞器進(jìn)行探查

GET /music/children/_termvectors
{
  "doc" : {
    "fullname" : "Jimmie Davis",
    "content" : "you are my sunshine"
  },
  "fields" : ["content"],
  "offsets" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true,
  "per_field_analyzer" : {
    "text": "standard"
  }
}

filter參數(shù)
對term vector統(tǒng)計結(jié)果進(jìn)行過濾

GET /music/children/_termvectors
{
  "doc" : {
    "fullname" : "Jimmie Davis",
    "content" : "you are my sunshine"
  },
  "fields" : ["content"],
  "offsets" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true,
  "filter" : {
      "max_num_terms" : 3,
      "min_term_freq" : 1,
      "min_doc_freq" : 1
    }
}

根據(jù)term統(tǒng)計信息，過濾出你想要看到的term vector統(tǒng)計結(jié)果。也挺有用的，比如你探查數(shù)據(jù)可以過濾掉一些出現(xiàn)頻率過低的term。

docs參數(shù)
允許你同時對多個doc進(jìn)行探查，這個使用頻率看個人習(xí)慣。

GET _mtermvectors
{
   "docs": [
      {
         "_index": "music",
         "_type": "children",
         "_id": "2",
         "term_statistics": true
      },
      {
         "_index": "music",
         "_type": "children",
         "_id": "1",
         "fields": [
            "content"
         ]
      }
   ]
}

term vector使用建議

有兩種方式可以得到term vector信息，一種是像上面案例，建立時指定，另一種是直接查詢時生成

index-time，在mapping里配置，建立索引的時候，就直接給你生成這些term和field的統(tǒng)計信息，如果term_vector設(shè)置為with_positions_offsets，索引所占的空間是不設(shè)置term vector時的2倍。
query-time，你之前沒有生成過任何的Term vector信息，然后在查看term vector的時候，直接就可以看到了，會on the fly，現(xiàn)場計算出各種統(tǒng)計信息，然后返回給你。

這兩種方式采用哪種取決于對term vector的使用期望，query-time更常用一些，畢竟這個工具的用處是協(xié)助定位問題，實(shí)時計算就行。

到此，相信大家對“Java中Term Vector的概念和使用方法是什么”有了更深的了解，不妨來實(shí)際操作一番吧！這里是億速云網(wǎng)站，更多相關(guān)內(nèi)容可以進(jìn)入相關(guān)頻道進(jìn)行查詢，關(guān)注我們，繼續(xù)學(xué)習(xí)！

向AI問一下細(xì)節(jié)

Java中Term Vector的概念和使用方法是什么

field statistics

term statistics

term information

term vector常見用法

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽