多字段搜索场景优化
多字段搜索的三种场景:
- 最佳字段(Best Fields) : 多个字段中返回评分最高的
当字段之间相互竞争,又相互关联。例如,对于博客的 title和 body这样的字段,评分来自最匹配字段
- 多数字段(Most Fields):匹配多个字段,返回各个字段评分之和
处理英文内容时的一种常见的手段是,在主字段( English Analyzer),抽取词干,加入同义词,以匹配更多的文档。相同的文本,加入子字段(Standard Analyzer),以提供更加精确的匹配。其他字段作为匹配文档提高相关度的信号,匹配字段越多则越好。
- 混合字段(Cross Fields): 跨字段匹配,待查询内容在多个字段中都显示
对于某些实体,例如人名,地址,图书信息。需要在多个字段中确定信息,单个字段只能作为整体的一部分。希望在任何这些列出的字段中找到尽可能多的词。
最佳字段搜索
将任何与任一查询匹配的文档作为结果返回,采用字段上最匹配的评分作为最终评分返回。
官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/8.14/query-dsl-dis-max-query.html
案例
DELETE /blogsPUT /blogs/_doc/1{ "title": "Quick brown rabbits", "body": "Brown rabbits are commonly seen."}PUT /blogs/_doc/2{ "title": "Keeping pets healthy", "body": "My quick brown fox eats rabbits on a regular basis."}# 搜索棕色的狐狸POST /blogs/_search{ "query": { "bool": { "should": [ { "match": { "title": "Brown fox" }}, { "match": { "body": "Brown fox" }} ] } }}
思考:查询结果不符合预期,为什么?
bool should的算法过程:
上述例子中,title和body属于竞争关系,不应该将分数简单叠加,而是应该找到单个最佳匹配的字段的评分。
使用dis max query查询
POST /blogs/_search{ "query": { "dis_max": { "queries": [ { "match": { "title": "Brown fox" }}, { "match": { "body": "Brown fox" }} ] } }}
可以通过tie_breaker参数调整
Tie Breaker是一个介于0-1之间的浮点数。0代表使用最佳匹配;1代表所有语句同等重要。
-
-
-
最终得分=最佳匹配字段+其他匹配字段*tie_breaker
POST /blogs/_search{ "query": { "dis_max": { "queries": [ { "match": { "title": "Quick pets" }}, { "match": { "body": "Quick pets" }} ] } }}POST /blogs/_search{ "query": { "dis_max": { "queries": [ { "match": { "title": "Quick pets" }}, { "match": { "body": "Quick pets" }} ], "tie_breaker": 0.1 } }}
使用 best_fields 查询
best_fields策略获取最佳匹配字段的得分, final_score = max(其他匹配字段得分, 最佳匹配字段得分)
采用 best_fields 查询,并添加参数 tie_breaker=0.1,final_score = 其他匹配字段得分 * 0.1 + 最佳匹配字段得分
Best Fields是默认类型,可以不用指定,等价于dis_max查询方式
POST /blogs/_search{ "query": { "multi_match": { "type": "best_fields", "query": "Brown fox", "fields": ["title","body"], "tie_breaker": 0.2 } }}
案例
PUT /employee{ "settings" : { "index" : { "analysis.analyzer.default.type": "ik_max_word" } }}POST /employee/_bulk{"index":{"_id":1}}{"empId":"1","name":"员工001","age":20,"sex":"男","mobile":"19000001111","salary":23343,"deptName":"技术部","address":"湖北省武汉市洪山区光谷大厦","content":"i like to write best elasticsearch article"}{"index":{"_id":2}}{"empId":"2","name":"员工002","age":25,"sex":"男","mobile":"19000002222","salary":15963,"deptName":"销售部","address":"湖北省武汉市江汉路","content":"i think java is the best programming language"}{"index":{"_id":3}}{"empId":"3","name":"员工003","age":30,"sex":"男","mobile":"19000003333","salary":20000,"deptName":"技术部","address":"湖北省武汉市经济开发区","content":"i am only an elasticsearch beginner"}{"index":{"_id":4}}{"empId":"4","name":"员工004","age":20,"sex":"女","mobile":"19000004444","salary":15600,"deptName":"销售部","address":"湖北省武汉市沌口开发区","content":"elasticsearch and hadoop are all very good solution, i am a beginner"}{"index":{"_id":5}}{"empId":"5","name":"员工005","age":20,"sex":"男","mobile":"19000005555","salary":19665,"deptName":"测试部","address":"湖北省武汉市东湖隧道","content":"spark is best big data solution based on scala, an programming language similar to java"}{"index":{"_id":6}}{"empId":"6","name":"员工006","age":30,"sex":"女","mobile":"19000006666","salary":30000,"deptName":"技术部","address":"湖北省武汉市江汉路","content":"i like java developer"}{"index":{"_id":7}}{"empId":"7","name":"员工007","age":60,"sex":"女","mobile":"19000007777","salary":52130,"deptName":"测试部","address":"湖北省黄冈市边城区","content":"i like elasticsearch developer"}{"index":{"_id":8}}{"empId":"8","name":"员工008","age":19,"sex":"女","mobile":"19000008888","salary":60000,"deptName":"技术部","address":"湖北省武汉市江汉大学","content":"i like spark language"}{"index":{"_id":9}}{"empId":"9","name":"员工009","age":40,"sex":"男","mobile":"19000009999","salary":23000,"deptName":"销售部","address":"河南省郑州市郑州大学","content":"i like java developer"}{"index":{"_id":10}}{"empId":"10","name":"张湖北","age":35,"sex":"男","mobile":"19000001010","salary":18000,"deptName":"测试部","address":"湖北省武汉市东湖高新","content":"i like java developer, i also like elasticsearch"}{"index":{"_id":11}}{"empId":"11","name":"王河南","age":61,"sex":"男","mobile":"19000001011","salary":10000,"deptName":"销售部","address":"河南省开封市河南大学","content":"i am not like java"}{"index":{"_id":12}}{"empId":"12","name":"张大学","age":26,"sex":"女","mobile":"19000001012","salary":11321,"deptName":"测试部","address":"河南省开封市河南大学","content":"i am java developer, java is good"}{"index":{"_id":13}}{"empId":"13","name":"李江汉","age":36,"sex":"男","mobile":"19000001013","salary":11215,"deptName":"销售部","address":"河南省郑州市二七区","content":"i like java and java is very best, i like it, do you like java"}{"index":{"_id":14}}{"empId":"14","name":"王技术","age":45,"sex":"女","mobile":"19000001014","salary":16222,"deptName":"测试部","address":"河南省郑州市金水区","content":"i like c++"}{"index":{"_id":15}}{"empId":"15","name":"张测试","age":18,"sex":"男","mobile":"19000001015","salary":20000,"deptName":"技术部","address":"河南省郑州市高新开发区","content":"i think spark is good"}GET /employee/_search{ "query": { "multi_match": { "query": "elasticsearch beginner 湖北省 开封市", "type": "best_fields", "fields": [ "content", "address" ] } }, "size": 15} # 查看执行计划GET /employee/_explain/3{ "query": { "multi_match": { "query": "elasticsearch beginner 湖北省 开封市", "type": "best_fields", "fields": [ "content", "address" ] } }}GET /employee/_explain/3{ "query": { "multi_match": { "query": "elasticsearch beginner 湖北省 开封市", "type": "best_fields", "fields": [ "content", "address" ], "tie_breaker": 0.1 } }}
使用多数字段搜索
most_fields策略获取全部匹配字段的累计得分(综合全部匹配字段的得分),等价于bool should查询方式
GET /employee/_explain/3{ "query": { "multi_match": { "query": "elasticsearch beginner 湖北省 开封市", "type": "most_fields", "fields": [ "content", "address" ] } }}
案例
DELETE /titlesPUT /titles{ "mappings": { "properties": { "title": { "type": "text", "analyzer": "english", "fields": { "std": { "type": "text", "analyzer": "standard" } } } } }}POST titles/_bulk{ "index": { "_id": 1 }}{ "title": "My dog barks" }{ "index": { "_id": 2 }}{ "title": "I see a lot of barking dogs on the road " }GET /titles/_search{ "query": { "match": { "title": "barking dogs" } }}
用广度匹配字段title包括尽可能多的文档——以提升召回率——同时又使用字段title.std 作为信号将相关度更高的文档置于结果顶部。
GET /titles/_search{ "query": { "multi_match": { "query": "barking dogs", "type": "most_fields", "fields": [ "title", "title.std" ] } }}
每个字段对于最终评分的贡献可以通过自定义值boost 来控制。比如,使title 字段更为重要,这样同时也降低了其他信号字段的作用:
#增加title的权重GET /titles/_search{ "query": { "multi_match": { "query": "barking dogs", "type": "most_fields", "fields": [ "title^10", "title.std" ] } }}
跨字段搜索
搜索内容在多个字段中都显示,类似bool+dis_max组合
DELETE /addressPUT /address{ "settings" : { "index" : { "analysis.analyzer.default.type": "ik_max_word" } }}PUT /address/_bulk{ "index": { "_id": "1"} }{"province": "湖南","city": "长沙"}{ "index": { "_id": "2"} }{"province": "湖南","city": "常德"}{ "index": { "_id": "3"} }{"province": "广东","city": "广州"}{ "index": { "_id": "4"} }{"province": "湖南","city": "邵阳"}#使用most_fields的方式结果不符合预期,不支持operatorGET /address/_search{ "query": { "multi_match": { "query": "湖南常德", "type": "most_fields", "fields": ["province","city"] } }}# 可以使用cross_fields,支持operator#与copy_to相比,其中一个优势就是它可以在搜索时为单个字段提升权重。GET /address/_search{ "query": { "multi_match": { "query": "湖南常德", "type": "cross_fields", "operator": "and", "fields": ["province","city"] } }}
还可以用copy...to 解决,但是需要额外的存储空间
DELETE /address# copy_to参数允许将多个字段的值复制到组字段中,然后可以将其作为单个字段进行查询PUT /address{ "mappings" : { "properties" : { "province" : { "type" : "keyword", "copy_to": "full_address" }, "city" : { "type" : "text", "copy_to": "full_address" } } }, "settings" : { "index" : { "analysis.analyzer.default.type": "ik_max_word" } }}PUT /address/_bulk{ "index": { "_id": "1"} }{"province": "湖南","city": "长沙"}{ "index": { "_id": "2"} }{"province": "湖南","city": "常德"}{ "index": { "_id": "3"} }{"province": "广东","city": "广州"}{ "index": { "_id": "4"} }{"province": "湖南","city": "邵阳"}GET /address/_search{ "query": { "match": { "full_address": { "query": "湖南常德", "operator": "and" } } }}