The index Plan

發(fā)布時間：2020-07-05 03:45:10 來源：網(wǎng)絡(luò) 閱讀：298 作者：努力的C 欄目：開發(fā)技術(shù)

In order to index the CSV, we want to take two fields from each row, title and description, and turn them into suitable terms. For straightforward textual search we don’t need document values.

Because we’re dealing with free text, and because we know the whole dataset is in English, we can use stemming so that for instance searching for “sundial” and “sundials” will both match the same documents. This way people don’t need to worry too much about exactly which words to use in their query.

Finally, we want a way of separating the two fields. In Xapian this is done using term prefixes, basically by putting short strings at the beginning of terms to indicate which field the term indexes. As well as prefixed terms, we also want to generate unprefixed terms, so that as well as searching within fields you can also search for text in any field.

There are some conventional prefixes used, which is helpful if you ever need to interoperate with omega (a web-based search engine) or other compatible systems. From this, we’ll use ‘S’ to prefix title (it stands for ‘subject’), and for description we’ll use ‘XD’. A full list of conventional prefixes is given at the top of the omega documentation on termprefixes.

When you’re indexing multiple fields like this, the term positions used for each field when indexed unprefixed need to be kept apart. Say you have a title of “The Saints”, and description “Don’t like rabbits? Keep reading.” If you index those fields without a gap, the phrase search “Saints don’t like rabbits” will match, where it really shouldn’t. Usually a gap of 100 between each field is enough.

To write to a database, we use the WritableDatabase class, which allows us to create, update or overwrite a database.

To create terms, we use Xapian’s TermGenerator, a built-in class to make turning free text into terms easier. It will split into words, apply stemming, and then add term prefixes as needed. It can also take care of term positions, including the gap between different fields.

為了對CSV進(jìn)行索引，我們要從每行中取兩個字段，標(biāo)題和描述，并將其轉(zhuǎn)換成合適的term。對于簡單的文本搜索，我們不需要文檔值。

因為我們正在處理自由文本，并且因為我們知道整個數(shù)據(jù)集是英文的，所以我們可以使用詞干，例如搜索“sundial”和“sundials”都將匹配相同的文檔。這樣一來，人們不需要太多關(guān)心在查詢中使用哪些單詞。

最后，我們想要一種分離這兩個字段的方法。在Xapian中，這是使用trem prefixes完成的，基本上是通過在術(shù)語開頭放短字符串來指示術(shù)語索引的字段。除了前綴術(shù)語之外，我們還要生成無偏見的術(shù)語，以便在字段內(nèi)搜索，也可以在任何字段中搜索文本。

有一些常規(guī)的前綴使用，如果您需要與omega（基于Web的搜索引擎）或其他兼容系統(tǒng)進(jìn)行互操作，這是有幫助的。從此，我們將使用'S'來標(biāo)題（它代表'subject'），對于描述，我們將使用'XD'。 omega文檔的頂部提供了常規(guī)前綴的完整列表。

當(dāng)您對這樣的多個字段進(jìn)行索引時，需要將索引未修改的每個字段使用的術(shù)語位置分開。說你有一個標(biāo)題“圣徒”，并描述“不喜歡兔子？繼續(xù)讀書?！叭绻銢]有間隙地索引這些字段，搜索”圣徒不喜歡兔子“這個詞將會匹配，真的不應(yīng)該。通常每個領(lǐng)域之間的差距就足夠了。

要寫入數(shù)據(jù)庫，我們使用WritableDatabase類，它允許我們創(chuàng)建，更新或覆蓋數(shù)據(jù)庫。

要創(chuàng)建條款，我們使用Xapian的TermGenerator，一個內(nèi)置的類來使自由文本變得更容易。它將分割成單詞，應(yīng)用詞干，然后根據(jù)需要添加術(shù)語前綴。它也可以照顧到職位，包括不同領(lǐng)域之間的差距。

向AI問一下細(xì)節(jié)

The index Plan

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽