9、ElasticSearch多字段搜索场景优化

作者:微信小助手

发布时间:2025-01-01T21:43:23

多字段搜索场景优化

多字段搜索的三种场景:

  • 最佳字段(Best Fields) : 多个字段中返回评分最高的

当字段之间相互竞争,又相互关联。例如,对于博客的 title和 body这样的字段,评分来自最匹配字段

  • 多数字段(Most Fields):匹配多个字段,返回各个字段评分之和

处理英文内容时的一种常见的手段是,在主字段( English Analyzer),抽取词干,加入同义词,以匹配更多的文档。相同的文本,加入子字段(Standard Analyzer),以提供更加精确的匹配。其他字段作为匹配文档提高相关度的信号,匹配字段越多则越好。

  • 混合字段(Cross Fields): 跨字段匹配,待查询内容在多个字段中都显示

对于某些实体,例如人名,地址,图书信息。需要在多个字段中确定信息,单个字段只能作为整体的一部分。希望在任何这些列出的字段中找到尽可能多的词。

最佳字段搜索

将任何与任一查询匹配的文档作为结果返回,采用字段上最匹配的评分作为最终评分返回。

官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/8.14/query-dsl-dis-max-query.html

案例

DELETE /blogsPUT /blogs/_doc/1{    "title": "Quick brown rabbits",    "body":  "Brown rabbits are commonly seen."}PUT /blogs/_doc/2{    "title": "Keeping pets healthy",    "body":  "My quick brown fox eats rabbits on a regular basis."}# 搜索棕色的狐狸POST /blogs/_search{    "query": {        "bool": {            "should": [                { "match": { "title": "Brown fox" }},                { "match": { "body":  "Brown fox" }}            ]        }    }}


思考:查询结果不符合预期,为什么?

bool should的算法过程:

  • 查询should语句中的两个查询
  • 加和两个查询的评分
  • 乘以匹配语句的总数
  • 除以所有语句的总数

上述例子中,title和body属于竞争关系,不应该将分数简单叠加,而是应该找到单个最佳匹配的字段的评分。


使用dis max query查询

POST /blogs/_search{    "query": {        "dis_max": {            "queries": [                { "match": { "title""Brown fox" }},                { "match": { "body":  "Brown fox" }}            ]        }    }}

可以通过tie_breaker参数调整

Tie Breaker是一个介于0-1之间的浮点数。0代表使用最佳匹配;1代表所有语句同等重要。  

  1. 获得最佳匹配语句的评分_score 。
  2. 将其他匹配语句的评分与tie_breaker相乘
  3. 对以上评分求和并规范化

最终得分=最佳匹配字段+其他匹配字段*tie_breaker

POST /blogs/_search{    "query": {        "dis_max": {            "queries": [                { "match": { "title""Quick pets" }},                { "match": { "body":  "Quick pets" }}            ]        }    }}POST /blogs/_search{    "query": {        "dis_max": {            "queries": [                { "match": { "title""Quick pets" }},                { "match": { "body":  "Quick pets" }}            ],            "tie_breaker": 0.1        }    }}

使用 best_fields 查询

best_fields策略获取最佳匹配字段的得分, final_score = max(其他匹配字段得分, 最佳匹配字段得分)

采用 best_fields 查询,并添加参数 tie_breaker=0.1,final_score = 其他匹配字段得分 * 0.1 + 最佳匹配字段得分

Best Fields是默认类型,可以不用指定,等价于dis_max查询方式

POST /blogs/_search{  "query": {    "multi_match": {      "type""best_fields",      "query""Brown fox",      "fields": ["title","body"],      "tie_breaker": 0.2    }  }}

案例

PUT /employee{    "settings" : {        "index" : {            "analysis.analyzer.default.type""ik_max_word"        }    }}POST /employee/_bulk{"index":{"_id":1}}{"empId":"1","name":"员工001","age":20,"sex":"男","mobile":"19000001111","salary":23343,"deptName":"技术部","address":"湖北省武汉市洪山区光谷大厦","content":"i like to write best elasticsearch article"}{"index":{"_id":2}}{"empId":"2","name":"员工002","age":25,"sex":"男","mobile":"19000002222","salary":15963,"deptName":"销售部","address":"湖北省武汉市江汉路","content":"i think java is the best programming language"}{"index":{"_id":3}}{"empId":"3","name":"员工003","age":30,"sex":"男","mobile":"19000003333","salary":20000,"deptName":"技术部","address":"湖北省武汉市经济开发区","content":"i am only an elasticsearch beginner"}{"index":{"_id":4}}{"empId":"4","name":"员工004","age":20,"sex":"女","mobile":"19000004444","salary":15600,"deptName":"销售部","address":"湖北省武汉市沌口开发区","content":"elasticsearch and hadoop are all very good solution, i am a beginner"}{"index":{"_id":5}}{"empId":"5","name":"员工005","age":20,"sex":"男","mobile":"19000005555","salary":19665,"deptName":"测试部","address":"湖北省武汉市东湖隧道","content":"spark is best big data solution based on scala, an programming language similar to java"}{"index":{"_id":6}}{"empId":"6","name":"员工006","age":30,"sex":"女","mobile":"19000006666","salary":30000,"deptName":"技术部","address":"湖北省武汉市江汉路","content":"i like java developer"}{"index":{"_id":7}}{"empId":"7","name":"员工007","age":60,"sex":"女","mobile":"19000007777","salary":52130,"deptName":"测试部","address":"湖北省黄冈市边城区","content":"i like elasticsearch developer"}{"index":{"_id":8}}{"empId":"8","name":"员工008","age":19,"sex":"女","mobile":"19000008888","salary":60000,"deptName":"技术部","address":"湖北省武汉市江汉大学","content":"i like spark language"}{"index":{"_id":9}}{"empId":"9","name":"员工009","age":40,"sex":"男","mobile":"19000009999","salary":23000,"deptName":"销售部","address":"河南省郑州市郑州大学","content":"i like java developer"}{"index":{"_id":10}}{"empId":"10","name":"张湖北","age":35,"sex":"男","mobile":"19000001010","salary":18000,"deptName":"测试部","address":"湖北省武汉市东湖高新","content":"i like java developer, i also like elasticsearch"}{"index":{"_id":11}}{"empId":"11","name":"王河南","age":61,"sex":"男","mobile":"19000001011","salary":10000,"deptName":"销售部","address":"河南省开封市河南大学","content":"i am not like java"}{"index":{"_id":12}}{"empId":"12","name":"张大学","age":26,"sex":"女","mobile":"19000001012","salary":11321,"deptName":"测试部","address":"河南省开封市河南大学","content":"i am java developer, java is good"}{"index":{"_id":13}}{"empId":"13","name":"李江汉","age":36,"sex":"男","mobile":"19000001013","salary":11215,"deptName":"销售部","address":"河南省郑州市二七区","content":"i like java and java is very best, i like it, do you like java"}{"index":{"_id":14}}{"empId":"14","name":"王技术","age":45,"sex":"女","mobile":"19000001014","salary":16222,"deptName":"测试部","address":"河南省郑州市金水区","content":"i like c++"}{"index":{"_id":15}}{"empId":"15","name":"张测试","age":18,"sex":"男","mobile":"19000001015","salary":20000,"deptName":"技术部","address":"河南省郑州市高新开发区","content":"i think spark is good"}GET /employee/_search{  "query": {    "multi_match": {      "query""elasticsearch beginner 湖北省 开封市",      "type""best_fields",      "fields": [        "content",        "address"      ]    }  },  "size"15}  # 查看执行计划GET /employee/_explain/3{    "query": {    "multi_match": {      "query""elasticsearch beginner 湖北省 开封市",      "type""best_fields",      "fields": [        "content",        "address"      ]    }  }}GET /employee/_explain/3{    "query": {    "multi_match": {      "query""elasticsearch beginner 湖北省 开封市",      "type""best_fields",      "fields": [        "content",        "address"      ],      "tie_breaker"0.1    }  }}


使用多数字段搜索

most_fields策略获取全部匹配字段的累计得分(综合全部匹配字段的得分),等价于bool should查询方式

GET /employee/_explain/3{    "query": {    "multi_match": {      "query""elasticsearch beginner 湖北省 开封市",      "type""most_fields",      "fields": [        "content",        "address"      ]    }  }}

案例

DELETE /titlesPUT /titles{  "mappings": {    "properties": {      "title": {        "type""text",        "analyzer""english",        "fields": {          "std": {            "type""text",            "analyzer""standard"          }        }      }    }  }}POST titles/_bulk"index": { "_id": 1 }}"title""My dog barks" }"index": { "_id": 2 }}"title""I see a lot of barking dogs on the road " }# 结果与预期不匹配GET /titles/_search{  "query": {    "match": {      "title": "barking dogs"    }  }}

用广度匹配字段title包括尽可能多的文档——以提升召回率——同时又使用字段title.std 作为信号将相关度更高的文档置于结果顶部。

GET /titles/_search{  "query": {    "multi_match": {      "query""barking dogs",      "type""most_fields",      "fields": [        "title",        "title.std"      ]    }  }}

每个字段对于最终评分的贡献可以通过自定义值boost 来控制。比如,使title 字段更为重要,这样同时也降低了其他信号字段的作用:

#增加title的权重GET /titles/_search{  "query": {    "multi_match": {      "query""barking dogs",      "type""most_fields",      "fields": [        "title^10",        "title.std"      ]    }  }}

跨字段搜索

搜索内容在多个字段中都显示,类似bool+dis_max组合

DELETE /addressPUT /address{    "settings" : {        "index" : {            "analysis.analyzer.default.type""ik_max_word"        }    }}PUT /address/_bulk"index": { "_id""1"} }{"province""湖南","city""长沙"}"index": { "_id""2"} }{"province""湖南","city""常德"}"index": { "_id""3"} }{"province""广东","city""广州"}"index": { "_id""4"} }{"province""湖南","city""邵阳"}#使用most_fields的方式结果不符合预期,不支持operatorGET /address/_search{  "query": {    "multi_match": {      "query""湖南常德",      "type""most_fields",      "fields": ["province","city"]    }  }}# 可以使用cross_fields,支持operator#与copy_to相比,其中一个优势就是它可以在搜索时为单个字段提升权重GET /address/_search{  "query": {    "multi_match": {      "query""湖南常德",      "type""cross_fields",      "operator""and"      "fields": ["province","city"]    }  }}

还可以用copy...to 解决,但是需要额外的存储空间

DELETE /address# copy_to参数允许将多个字段的值复制到组字段中,然后可以将其作为单个字段进行查询PUT /address{  "mappings" : {      "properties" : {        "province" : {          "type" : "keyword",          "copy_to""full_address"        },        "city" : {          "type" : "text",          "copy_to""full_address"        }      }    },    "settings" : {        "index" : {            "analysis.analyzer.default.type": "ik_max_word"        }    }}PUT /address/_bulk{ "index": { "_id": "1"} }{"province": "湖南","city""长沙"}{ "index": { "_id": "2"} }{"province": "湖南","city""常德"}{ "index": { "_id": "3"} }{"province": "广东","city""广州"}{ "index": { "_id": "4"} }{"province": "湖南","city""邵阳"}GET /address/_search{  "query": {    "match": {      "full_address": {        "query": "湖南常德",        "operator""and"      }    }  }}