Solr配置文件schema.xml的內(nèi)容是什么

發(fā)布時間：2021-12-20 10:53:55 來源：億速云閱讀：130 作者：小新欄目：互聯(lián)網(wǎng)科技

這篇文章主要介紹了Solr配置文件schema.xml的內(nèi)容是什么，具有一定借鑒價值，感興趣的朋友可以參考下，希望大家閱讀完這篇文章之后大有收獲，下面讓小編帶著大家一起了解一下。

一、字段配置（schema）
   schema.xml位于solr/collection1/conf/目錄下，類似于數(shù)據(jù)表配置文件，
   定義了加入索引的數(shù)據(jù)的數(shù)據(jù)類型，主要包括type、fields和其他的一些缺省設(shè)置。

   1、先來看下type節(jié)點(diǎn)，這里面定義FieldType子節(jié)點(diǎn)，包括name,class,positionIncrementGap等一些參數(shù)。
       name：就是這個FieldType的名稱。
       class：指向org.apache.solr.analysis包里面對應(yīng)的class名稱，用來定義這個類型的行為。
       <schema name="example" version="1.2">
           <types>
               <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" />
               <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true" />
               <fieldtype name="binary" class="solr.BinaryField" />
               <fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0" />
               <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" omitNorms="true" positionIncrementGap="0" />
               <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0" />
               <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" omitNorms="true" positionIncrementGap="0" />
               ...
           </ types >
           ...
       </ schema >

       必要的時候fieldType還需要自己定義這個類型的數(shù)據(jù)在建立索引和進(jìn)行查詢的時候要使用的分析器analyzer，包括分詞和過濾，如下：
       <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
           <analyzer>
               <tokenizer class="solr.WhitespaceTokenizerFactory" />
           </analyzer >
       </fieldType >
       <fieldType name="text" class="solr.TextField" positionIncrementGap="100" >
           <analyzer type="index" >
               
               <tokenizer class="solr.WhitespaceTokenizerFactory" />
               
               
               <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
               <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
               <filter class="solr.LowerCaseFilterFactory" />
               <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt" />
           </analyzer >
           <analyzer type="query" >
               <tokenizer class="solr.WhitespaceTokenizerFactory" />
               <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
               <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
               <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
               <filter class="solr.LowerCaseFilterFactory" />
               <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt" />
           </analyzer >
       </fieldType >

   2、再來看下fields節(jié)點(diǎn)內(nèi)定義具體的字段（類似數(shù)據(jù)庫的字段），含有以下屬性：
       name：字段名
       type：之前定義過的各種FieldType
       indexed：是否被索引
       stored：是否被存儲（如果不需要存儲相應(yīng)字段值，盡量設(shè)為false）
       multiValued：是否有多個值（對可能存在多值的字段盡量設(shè)置為true，避免建索引時拋出錯誤）
       <fields>
           <field name="id" type="integer" indexed="true" stored="true" required="true" />
           <field name="name" type="text" indexed="true" stored="true" />
           <field name="summary" type="text" indexed="true" stored="true" />
           <field name="author" type="string" indexed="true" stored="true" />
           <field name="date" type="date" indexed="false" stored="true" />
           <field name="content" type="text" indexed="true" stored="false" />
           <field name="keywords" type="keyword_text" indexed="true" stored="false" multiValued="true" />
           
           <field name="all" type="text" indexed="true" stored="false" multiValued="true" />
       </fields>

   3、建議建立一個拷貝字段，將所有的全文本字段復(fù)制到一個字段中，以便進(jìn)行統(tǒng)一的檢索：
       以下是拷貝設(shè)置：
       <copyField source="name" dest="all" />
       <copyField source="summary" dest="all" />


   4、動態(tài)字段，沒有具體名稱的字段，用dynamicField字段
       如：name為*_i，定義它的type為int，那么在使用這個字段的時候，任務(wù)以_i結(jié)果的字段都被認(rèn)為符合這個定義。如name_i, school_i
       <dynamicField name="*_i" type="int" indexed="true" stored="true" />
       <dynamicField name="*_s" type="string" indexed="true" stored="true" />
       <dynamicField name="*_l" type="long"   indexed="true" stored="true" />
       <dynamicField name="*_t" type="text" indexed="true" stored="true" />
       <dynamicField name="*_b" type="boolean" indexed="true" stored="true" />
       <dynamicField name="*_f" type="float" indexed="true" stored="true" />
       <dynamicField name="*_d" type="double" indexed="true" stored="true" />
       <dynamicField name="*_dt" type="date" indexed="true" stored="true" />


schema.xml文檔注釋中的信息：
   1、為了改進(jìn)性能，可以采取以下幾種措施：
       將所有只用于搜索的，而不需要作為結(jié)果的field（特別是一些比較大的field）的stored設(shè)置為false
       將不需要被用于搜索的，而只是作為結(jié)果返回的field的indexed設(shè)置為false
       刪除所有不必要的copyField聲明
       為了索引字段的最小化和搜索的效率，將所有的 text fields的index都設(shè)置成field，然后使用copyField將他們都復(fù)制到一個總的 text field上，然后對他進(jìn)行搜索。
       為了最大化搜索效率，使用java編寫的客戶端與solr交互（使用流通信）
       在服務(wù)器端運(yùn)行JVM（省去網(wǎng)絡(luò)通信），使用盡可能高的Log輸出等級，減少日志量。
   2、<schema name="example" version="1.2">
       name：標(biāo)識這個schema的名字
       version：現(xiàn)在版本是1.2

   3、filedType
       <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" />
       name：標(biāo)識而已。
       class和其他屬性決定了這個fieldType的實(shí)際行為。（class以solr開始的，都是在org.appache.solr.analysis包下）
       可選的屬性：
       sortMissingLast和sortMissingFirst兩個屬性是用在可以內(nèi)在使用String排序的類型上（包括：string,boolean,sint,slong,sfloat,sdouble,pdate）。
       sortMissingLast="true"，沒有該field的數(shù)據(jù)排在有該field的數(shù)據(jù)之后，而不管請求時的排序規(guī)則。
       sortMissingFirst="true"，跟上面倒過來唄。
       2個值默認(rèn)是設(shè)置成false
       StrField類型不被分析，而是被逐字地索引/存儲。
       StrField和TextField都有一個可選的屬性“compressThreshold”，保證壓縮到不小于一個大?。▎挝唬篶har）
       <fieldType name="text" class="solr.TextField" positionIncrementGap="100" >
       solr.TextField 允許用戶通過分析器來定制索引和查詢，分析器包括一個分詞器（tokenizer）和多個過濾器（filter）
       positionIncrementGap：可選屬性，定義在同一個文檔中此類型數(shù)據(jù)的空白間隔，避免短語匹配錯誤。
       name: 字段類型名
       class: java類名
       indexed: 缺省true。說明這個數(shù)據(jù)應(yīng)被搜索和排序，如果數(shù)據(jù)沒有indexed，則stored應(yīng)是true。
       stored: 缺省true。說明這個字段被包含在搜索結(jié)果中是合適的。如果數(shù)據(jù)沒有stored,則indexed應(yīng)是true。
       sortMissingLast: 指沒有該指定字段數(shù)據(jù)的document排在有該指定字段數(shù)據(jù)的document的后面
       sortMissingFirst: 指沒有該指定字段數(shù)據(jù)的document排在有該指定字段數(shù)據(jù)的document的前面
       omitNorms: 字段的長度不影響得分和在索引時不做boost時，設(shè)置它為true。一般文本字段不設(shè)置為true。
       termVectors: 如果字段被用來做more like this 和highlight的特性時應(yīng)設(shè)置為true。
       compressed: 字段是壓縮的。這可能導(dǎo)致索引和搜索變慢，但會減少存儲空間，只有StrField和TextField是可以壓縮，這通常適合字段的長度超過200個字符。
       multiValued: 字段多于一個值的時候，可設(shè)置為true。
       positionIncrementGap: 和multiValued
       一起使用，設(shè)置多個值之間的虛擬空白的數(shù)量
       <tokenizer class="solr.WhitespaceTokenizerFactory" />
       空格分詞，精確匹配。
       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
       在分詞和匹配時，考慮 "-"連字符，字母數(shù)字的界限，非字母數(shù)字字符，這樣 "wifi"或"wi fi"都能匹配"Wi-Fi"。
       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
       同義詞
       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
       在禁用字（stopword）刪除后，在短語間增加間隔
       stopword：即在建立索引過程中（建立索引和搜索）被忽略的詞，比如is this等常用詞。在conf/stopwords.txt維護(hù)。

   4、fields
       <field name="id" type="string" indexed="true" stored="true" required="true" />
       name：標(biāo)識而已。
       type：先前定義的類型。
       indexed：是否被用來建立索引（關(guān)系到搜索和排序）
       stored：是否儲存
       compressed：[false]，是否使用gzip壓縮（只有TextField和StrField可以壓縮）
       mutiValued：是否包含多個值
       omitNorms：是否忽略掉Norm，可以節(jié)省內(nèi)存空間，只有全文本field和need an index-time boost的field需要norm。（具體沒看懂，注釋里有矛盾）
       termVectors：[false]，當(dāng)設(shè)置true，會存儲 term vector。當(dāng)使用MoreLikeThis，用來作為相似詞的field應(yīng)該存儲起來。
       termPositions：存儲 term vector中的地址信息，會消耗存儲開銷。
       termOffsets：存儲 term vector 的偏移量，會消耗存儲開銷。
       default：如果沒有屬性需要修改，就可以用這個標(biāo)識下。

       <field name="text" type="text" indexed="true" stored="false" multiValued="true" />
       包羅萬象（有點(diǎn)夸張）的field，包含所有可搜索的text fields，通過copyField實(shí)現(xiàn)。
       <copyField source="cat" dest="text" />
       <copyField source="name" dest="text" />
       <copyField source="manu" dest="text" />
       <copyField source="features" dest="text" />
       <copyField source="includes" dest="text" />
       在添加索引時，將所有被拷貝field（如cat）中的數(shù)據(jù)拷貝到text field中
       作用：
       將多個field的數(shù)據(jù)放在一起同時搜索，提供速度
       將一個field的數(shù)據(jù)拷貝到另一個，可以用2種不同的方式來建立索引。
       <dynamicField name="*_i" type="int" indexed="true" stored="true" />
       如果一個field的名字沒有匹配到，那么就會用動態(tài)field試圖匹配定義的各種模式。
       "*"只能出現(xiàn)在模式的最前和最后
       較長的模式會被先去做匹配
       如果2個模式同時匹配上，最先定義的優(yōu)先
       <dynamicField name="*" type="ignored" multiValued=" true" />
       如果通過上面的匹配都沒找到，可以定義這個，然后定義個type，當(dāng)String處理。（一般不會發(fā)生）
       但若不定義，找不到匹配會報錯。

   5、其他一些標(biāo)簽
       <uniqueKey > id </ uniqueKey >
       文檔的唯一標(biāo)識，必須填寫這個field（除非該field被標(biāo)記required="false"），否則solr建立索引報錯。
       <defaultSearchField > text </ defaultSearchField >
       如果搜索參數(shù)中沒有指定具體的field，那么這是默認(rèn)的域。
       <solrQueryParser defaultOperator="OR" />
       配置搜索參數(shù)短語間的邏輯，可以是"AND|OR"。

感謝你能夠認(rèn)真閱讀完這篇文章，希望小編分享的“Solr配置文件schema.xml的內(nèi)容是什么”這篇文章對大家有幫助，同時也希望大家多多支持億速云，關(guān)注億速云行業(yè)資訊頻道，更多相關(guān)知識等著你來學(xué)習(xí)!

向AI問一下細(xì)節(jié)

Solr配置文件schema.xml的內(nèi)容是什么

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽