您好,登錄后才能下訂單哦!
本篇文章為大家展示了如何用ElasticSearch實(shí)現(xiàn)基于標(biāo)簽的興趣推薦,內(nèi)容簡(jiǎn)明扼要并且容易理解,絕對(duì)能使你眼前一亮,通過(guò)這篇文章的詳細(xì)介紹希望你能有所收獲。
下面將通過(guò)ElasticSearch(簡(jiǎn)稱(chēng)ES)倒排索引的特性實(shí)現(xiàn)基于標(biāo)簽的興趣推薦
操作系統(tǒng):ubuntu 20.04
Docker version 19.03.8
ElasticSearch 7.X
Curl工具,推薦Insomnia
ES GUI工具, 推薦appbaseio/dejavu
文章索引中有字段tags,存儲(chǔ)了文章有關(guān)的標(biāo)簽
每個(gè)用戶都有自己的興趣標(biāo)簽tags
興趣推薦就是用興趣標(biāo)簽去匹配文章的標(biāo)簽,用戶的一個(gè)興趣標(biāo)簽命中N篇文章,用戶的多個(gè)興趣標(biāo)簽命中M篇文章,M和N有交叉,即文章中有重復(fù),重復(fù)出現(xiàn)次數(shù)最多的文章就是最貼近用戶興趣的。原理理解起來(lái)簡(jiǎn)單,使用ES的目的是解決快速查詢(xún)和排序的問(wèn)題。
docker環(huán)境安裝單機(jī)版ES,用來(lái)測(cè)試
docker run -d --name elasticsearch -v /home/cherokee/docker-data/es-data:/usr/share/elasticsearch/data -e http.cors.enabled=true -e http.cors.allow-origin="*" -e http.cors.allow-headers=X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization -e http.cors.allow-credentials=true -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" successage/es-ik
在本地啟動(dòng)了ES服務(wù),通過(guò) http://localhost:9200 可以訪問(wèn)
創(chuàng)建一個(gè)名為rcmd的索引
curl --request PUT \ --url http://localhost:9200/rcmd
curl --request PUT \ --url http://localhost:9200/rcmd/_mapping \ --header 'content-type: application/json' \ --data '{ "properties": { "tags": { "type": "keyword", "store": true }, "update_time": { "type": "date", "store": true } } }'
兩個(gè)字段:
tags,文章的興趣標(biāo)簽,keyword類(lèi)型就是不需要全文檢索,標(biāo)簽以數(shù)組的形式存放
update_time,更新時(shí)間,這是給興趣推薦加一個(gè)額外的排序條件,實(shí)際項(xiàng)目中往往是需要結(jié)合時(shí)間和匹配度來(lái)排序的
插入一些數(shù)據(jù)
curl --request POST \ --url http://localhost:9200/rcmd/_doc \ --header 'content-type: application/json' \ --data '{ "tags": [ "布料", "抹布", "褲子", "衣服", "生活" ], "update_time": "2020-06-01T00:02:11.030" }'
再插入一條,同樣標(biāo)簽,但是時(shí)間不一樣,后面例子中有妙用
curl --request POST \ --url http://localhost:9200/rcmd/_doc \ --header 'content-type: application/json' \ --data '{ "tags": [ "布料", "抹布", "褲子", "衣服", "生活" ], "update_time": "2020-07-01T00:02:11.030" }'
curl --request POST \ --url http://localhost:9200/rcmd/_doc \ --header 'content-type: application/json' \ --data '{ "tags": [ "啤酒", "米酒", "飲料", "餐飲", "生活" ], "update_time": "2020-06-02T00:02:11.030" }'
curl --request POST \ --url http://localhost:9200/rcmd/_doc \ --header 'content-type: application/json' \ --data '{ "tags": [ "火鍋", "自助餐", "外賣(mài)", "燒烤", "餐飲" ], "update_time": "2020-06-03T00:02:11.030" }'
curl --request POST \ --url http://localhost:9200/rcmd/_doc \ --header 'content-type: application/json' \ --data '{ "tags": [ "太陽(yáng)", "月亮", "大海", "星星", "自然" ], "update_time": "2020-06-01T00:02:11.030" }'
curl --request POST \ --url http://localhost:9200/rcmd/_doc \ --header 'content-type: application/json' \ --data '{ "tags": [ "人類(lèi)", "動(dòng)物", "植物", "地球", "自然" ], "update_time": "2020-06-01T00:02:11.030" }'
curl --request POST \ --url http://localhost:9200/rcmd/_doc \ --header 'content-type: application/json' \ --data '{ "tags": [ "男人", "女人", "小孩", "老人", "人類(lèi)" ], "update_time": "2020-06-02T00:02:11.030" }'
最終數(shù)據(jù)如下
curl --request POST \ --url http://localhost:9200/rcmd/_search \ --header 'content-type: application/json' \ --data '{ "query": { "bool": { "should": [ { "constant_score": { "boost": 1, "filter": { "match": { "tags": "生活" } } } }, { "constant_score": { "boost": 1, "filter": { "match": { "tags": "衣服" } } } }, { "constant_score": { "boost": 1, "filter": { "match": { "tags": "火鍋" } } } } ] } } }'
should表達(dá)式的意義是匹配“生活”、“衣服”、“火鍋”三個(gè)標(biāo)簽中任何一個(gè)的文章都可以返回。用constant_score查詢(xún),如果某個(gè)文章涵蓋標(biāo)簽越多分值就越高。也就是說(shuō)如果某個(gè)文章標(biāo)簽完全涵蓋了這三個(gè)標(biāo)簽,那么它的分值最高的。查詢(xún)結(jié)果如下:
{ "took": 2, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 4, "relation": "eq" }, "max_score": 2.0, "hits": [ { "_index": "rcmd", "_type": "_doc", "_id": "brQO63MBTdXKc2eArv9A", "_score": 2.0, "_source": { "tags": [ "布料", "抹布", "褲子", "衣服", "生活" ], "update_time": "2020-06-01T00:02:11.030" } }, { "_index": "rcmd", "_type": "_doc", "_id": "b7QP63MBTdXKc2eAPf_Y", "_score": 2.0, "_source": { "tags": [ "布料", "抹布", "褲子", "衣服", "生活" ], "update_time": "2020-07-01T00:02:11.030" } }, { "_index": "rcmd", "_type": "_doc", "_id": "cLQQ63MBTdXKc2eA6_8v", "_score": 1.0, "_source": { "tags": [ "啤酒", "米酒", "飲料", "餐飲", "生活" ], "update_time": "2020-06-02T00:02:11.030" } }, { "_index": "rcmd", "_type": "_doc", "_id": "cbQS63MBTdXKc2eAcP-N", "_score": 1.0, "_source": { "tags": [ "火鍋", "自助餐", "外賣(mài)", "燒烤", "餐飲" ], "update_time": "2020-06-03T00:02:11.030" } } ] } }
有兩篇文章涵蓋了其中兩個(gè)標(biāo)簽“生活”和“衣服”,得分為2,排到了前面。這個(gè)排序基本滿足了興趣匹配的要求。
實(shí)際的項(xiàng)目中往往是用戶的興趣標(biāo)簽的權(quán)值不一樣,假設(shè)用戶的興趣標(biāo)簽是["火鍋","生活","衣服"],排在越前面的權(quán)重越高,查詢(xún)的時(shí)候需要給關(guān)鍵詞設(shè)定權(quán)重,上面的查詢(xún)語(yǔ)句所有boost都是默認(rèn)值1,現(xiàn)在根據(jù)需求改動(dòng)權(quán)值再查詢(xún)。
curl --request POST \ --url http://localhost:9200/rcmd/_search \ --header 'content-type: application/json' \ --data '{ "query": { "bool": { "should": [ { "constant_score": { "boost": 1, "filter": { "match": { "tags": "生活" } } } }, { "constant_score": { "boost": 4, "filter": { "match": { "tags": "衣服" } } } }, { "constant_score": { "boost": 6, "filter": { "match": { "tags": "火鍋" } } } } ] } } }'
分別給三個(gè)詞加上權(quán)重6、4、1,查詢(xún)結(jié)果如下:
{ "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 4, "relation": "eq" }, "max_score": 6.0, "hits": [ { "_index": "rcmd", "_type": "_doc", "_id": "cbQS63MBTdXKc2eAcP-N", "_score": 6.0, "_source": { "tags": [ "火鍋", "自助餐", "外賣(mài)", "燒烤", "餐飲" ], "update_time": "2020-06-03T00:02:11.030" } }, { "_index": "rcmd", "_type": "_doc", "_id": "brQO63MBTdXKc2eArv9A", "_score": 5.0, "_source": { "tags": [ "布料", "抹布", "褲子", "衣服", "生活" ], "update_time": "2020-06-01T00:02:11.030" } }, { "_index": "rcmd", "_type": "_doc", "_id": "b7QP63MBTdXKc2eAPf_Y", "_score": 5.0, "_source": { "tags": [ "布料", "抹布", "褲子", "衣服", "生活" ], "update_time": "2020-07-01T00:02:11.030" } }, { "_index": "rcmd", "_type": "_doc", "_id": "cLQQ63MBTdXKc2eA6_8v", "_score": 1.0, "_source": { "tags": [ "啤酒", "米酒", "飲料", "餐飲", "生活" ], "update_time": "2020-06-02T00:02:11.030" } } ] } }
可以看到包含“火鍋”的文章排到了第一,包含“衣服”和“生活”的文章雖然兩個(gè)詞都命中,但是在權(quán)值的弱化之下排到了第二第三位。
curl --request POST \ --url http://localhost:9200/rcmd/_search \ --header 'content-type: application/json' \ --data '{ "query": { "function_score": { "query": { "bool": { "must": [ { "range": { "update_time": { "from": "2020-06-01", "to": "2020-08-01" } } }, { "bool": { "should": [ { "term": { "tags": { "term": "火鍋", "boost": 2 } } }, { "term": { "tags": { "term": "衣服", "boost": 1 } } }, { "term": { "tags": { "term": "生活", "boost": 1 } } } ] } } ] } }, "functions": [ { "gauss": { "update_time": { "scale": "3d", "origin": "2020-07-02T00:01:00.000" } } } ] } }, "_source": { "include": [ "tags", "update_time" ] }, "from": 0, "size": 10 }'
以上是相對(duì)完整的一個(gè)查詢(xún),首先對(duì)update_time發(fā)布時(shí)間做了限制,只選擇一定范圍內(nèi)的數(shù)據(jù),隨后是標(biāo)簽的匹配,多個(gè)標(biāo)簽匹配條件之間是"OR"的關(guān)系,標(biāo)簽具有不同的權(quán)重,接下來(lái)用衰減函數(shù)gauss對(duì)update_time做衰減排序,衰減函數(shù)的意義是越近越好,scale": "3d"就是以3天為一個(gè)階梯先對(duì)數(shù)據(jù)進(jìn)行排序,相同階梯內(nèi)的數(shù)據(jù)再按照標(biāo)簽匹配度排序。 注:gauss中的origin可以不指定 最終的查詢(xún)結(jié)果:
{ "took": 2, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 4, "relation": "eq" }, "max_score": 3.6649413, "hits": [ { "_index": "rcmd", "_type": "_doc", "_id": "b7QP63MBTdXKc2eAPf_Y", "_score": 3.6649413, "_source": { "update_time": "2020-07-01T00:02:11.030", "tags": [ "布料", "抹布", "褲子", "衣服", "生活" ] } }, { "_index": "rcmd", "_type": "_doc", "_id": "cbQS63MBTdXKc2eAcP-N", "_score": 4.4511746E-28, "_source": { "update_time": "2020-06-03T00:02:11.030", "tags": [ "火鍋", "自助餐", "外賣(mài)", "燒烤", "餐飲" ] } }, { "_index": "rcmd", "_type": "_doc", "_id": "cLQQ63MBTdXKc2eA6_8v", "_score": 1.764942E-30, "_source": { "update_time": "2020-06-02T00:02:11.030", "tags": [ "啤酒", "米酒", "飲料", "餐飲", "生活" ] } }, { "_index": "rcmd", "_type": "_doc", "_id": "brQO63MBTdXKc2eArv9A", "_score": 2.8566082E-32, "_source": { "update_time": "2020-06-01T00:02:11.030", "tags": [ "布料", "抹布", "褲子", "衣服", "生活" ] } } ] } }
同樣是匹配了“衣服”和“生活”的兩篇文章,一篇在最前面,一篇在最后面,是因?yàn)閡pdate_time的緣故,一篇是7月1日發(fā)布的,另一篇在6月1日,不在同一時(shí)間階梯內(nèi),日期久遠(yuǎn)的排到了后面。中間的兩篇,各自匹配了一個(gè)標(biāo)簽,分別是“燒烤”和“生活”,兩篇文章時(shí)間階梯沒(méi)有明顯的區(qū)別,然而匹配“火鍋”的排到了前面,是因?yàn)椤盎疱仭钡年P(guān)鍵詞加了較高的權(quán)重。 至此,我們實(shí)現(xiàn)了按照標(biāo)簽匹配文章,并且結(jié)合了時(shí)間因素和匹配度評(píng)分的興趣推薦。
以上例子沒(méi)有在超大數(shù)據(jù)環(huán)境下測(cè)試過(guò),還沒(méi)有具體的性能指標(biāo)。
上述內(nèi)容就是如何用ElasticSearch實(shí)現(xiàn)基于標(biāo)簽的興趣推薦,你們學(xué)到知識(shí)或技能了嗎?如果還想學(xué)到更多技能或者豐富自己的知識(shí)儲(chǔ)備,歡迎關(guān)注億速云行業(yè)資訊頻道。
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。