本文档覆盖 ElasticSearch 核心概念、查询语法、聚合分析、集群运维及性能调优,适用于开发工程师、SRE 及数据分析师日常查阅。
ElasticSearch(简称 ES)是一个基于 Apache Lucene 构建的分布式搜索与分析引擎,其核心设计目标包括:
| 核心术语 | 类比(关系型数据库) | 说明 |
|---|---|---|
| Index | Database | 逻辑命名空间,包含一组相似文档 |
| Type | Table | ES 7.x 后已废弃,一个 Index 仅一个 _doc |
| Document | Row | 一条 JSON 记录,搜索的最小单元 |
| Field | Column | 文档中的键值对 |
| Mapping | Schema | 定义字段类型、分词器、是否索引等 |
| Shard | Partition | 物理分片,Index 的数据子集 |
| Replica | Replica Set | Shard 的副本,提供读扩展与故障恢复 |
ES 的核心搜索效率来源于倒排索引(Inverted Index):
文档集合:
Doc1: "ElasticSearch is powerful"
Doc2: "ElasticSearch is fast"
Doc3: "Lucene powers ElasticSearch"
倒排索引:
elasticsearch -> [Doc1, Doc2, Doc3]
powerful -> [Doc1]
fast -> [Doc2]
lucene -> [Doc3]
powers -> [Doc3]
关键机制:
┌─────────────────────────────────────────┐
│ Cluster │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │
│ │ [P0,R1] │ │ [P1,R0] │ │ [P2,R0]│ │
│ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────┘
P = Primary Shard, R = Replica Shard
分片分配策略:
PUT /products
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"analysis": {
"analyzer": {
"custom_ik": {
"tokenizer": "ik_max_word",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "custom_ik",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"price": {
"type": "float"
},
"category": {
"type": "keyword"
},
"created_at": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
},
"tags": {
"type": "keyword"
}
}
}
}
关键设计原则:
text 用于全文搜索(会被分词),keyword 用于精确匹配、排序、聚合name 和 name.keywordPUT _ilm/policy/logs_policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_size": "50GB",
"max_age": "30d"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": {
"number_of_shards": 1
},
"forcemerge": {
"max_num_segments": 1
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"freeze": {}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
GET /git/_search?q=author_name:"Hugo Gu"
URI 搜索适合快速调试,但功能有限,生产环境推荐使用 Request Body 查询。
GET /git/_search
{
"size": 0,
"query": {
"term": {
"author_name": "Hugo Gu"
}
},
"aggregations": {
"avg_files": {
"avg": {
"field": "files"
}
}
}
}
| 查询类型 | 用途 | 示例 |
|---|---|---|
match |
全文搜索,会分词 | { "match": { "title": "elasticsearch" } } |
term |
精确匹配,不分词 | { "term": { "status": "published" } } |
range |
范围查询 | { "range": { "age": { "gte": 18, "lte": 60 } } } |
bool |
组合查询 | must, should, must_not, filter |
wildcard |
通配符 | { "wildcard": { "name": "elasti*" } } |
prefix |
前缀匹配 | { "prefix": { "name": "elas" } } |
exists |
字段存在性 | { "exists": { "field": "email" } } |
{
"query": {
"bool": {
"must": [
{ "match": { "title": "elasticsearch" } },
{ "range": { "date": { "gte": "2024-01-01" } } }
],
"should": [
{ "match": { "tag": "tutorial" } },
{ "match": { "tag": "guide" } }
],
"must_not": [
{ "match": { "status": "deleted" } }
],
"filter": [
{ "term": { "category": "tech" } }
]
}
}
}
关键区别:
must / should / must_not:参与相关性评分(_score)filter:不评分,结果被缓存,性能更高,适合精确过滤条件聚合是 ES 最强大的功能之一,支持从简单统计到复杂数据分析。
| 类型 | 用途 | 示例 |
|---|---|---|
metric |
数值计算 | avg, sum, max, min, stats, cardinality |
bucket |
分组统计 | terms, date_histogram, range, filters |
pipeline |
基于聚合结果再计算 | derivative, moving_avg, bucket_selector |
以下是一个生产级聚合查询,用于分析代码仓库的提交趋势(源自原内容中的 GrimoireLab 场景):
GET /git/_search
{
"aggs": {
"commits_over_time": {
"date_histogram": {
"field": "grimoire_creation_date",
"calendar_interval": "30d",
"time_zone": "Asia/Shanghai",
"min_doc_count": 1
},
"aggs": {
"unique_authors": {
"cardinality": {
"field": "author_uuid",
"precision_threshold": 3000
}
},
"total_lines_added": {
"sum": {
"field": "lines_added"
}
},
"total_lines_removed": {
"sum": {
"field": "lines_removed"
}
}
}
}
},
"size": 0,
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "*",
"analyze_wildcard": true,
"default_field": "*"
}
},
{
"range": {
"grimoire_creation_date": {
"gte": "2016-03-06",
"lte": "2023-03-06",
"format": "epoch_millis"
}
}
}
]
}
},
"timeout": "30000ms"
}
{
"aggregations": {
"commits_over_time": {
"buckets": [
{
"key_as_string": "2023-01-01",
"key": 1672531200000,
"doc_count": 150,
"unique_authors": {
"value": 12
},
"total_lines_added": {
"value": 45000
},
"total_lines_removed": {
"value": 12000
}
}
]
}
}
}
{
"aggs": {
"by_category": {
"terms": {
"field": "category"
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
},
"price_stats": {
"stats": {
"field": "price"
}
},
"moving_avg_price": {
"moving_avg": {
"buckets_path": "avg_price",
"window": 5,
"model": "linear"
}
}
}
}
}
}
当需要动态计算字段值时使用,支持 Painless 脚本语言:
{
"script_fields": {
"inverted_lines_removed": {
"script": {
"source": "return doc['lines_removed'].value * -1",
"lang": "painless"
}
},
"contribution_ratio": {
"script": {
"source": """
double added = doc['lines_added'].value;
double removed = doc['lines_removed'].value;
if (removed == 0) return added;
return added / removed;
""",
"lang": "painless"
}
}
}
}
Painless 脚本注意事项:
doc['field'] 返回的是 List,需用 .value 获取单值用于格式化返回日期字段,避免默认的 epoch_millis 格式:
{
"docvalue_fields": [
{
"field": "author_date",
"format": "yyyy-MM-dd HH:mm:ss"
},
{
"field": "commit_date",
"format": "date_time"
}
]
}
{
"query": {
"match": {
"content": "elasticsearch"
}
},
"highlight": {
"fields": {
"content": {
"fragment_size": 150,
"number_of_fragments": 3,
"pre_tags": ["<mark>"],
"post_tags": ["</mark>"]
}
}
}
}
将常用查询参数化,便于复用:
POST _scripts/author_search
{
"script": {
"lang": "mustache",
"source": {
"query": {
"bool": {
"must": [
{ "term": { "author_name": "{{author_name}}" } },
{ "range": { "grimoire_creation_date": { "gte": "{{start_date}}", "lte": "{{end_date}}" } } }
]
}
}
}
}
}
GET /git/_search/template
{
"id": "author_search",
"params": {
"author_name": "Hugo Gu",
"start_date": "2023-01-01",
"end_date": "2023-12-31"
}
}
# 集群健康状态
GET _cluster/health
# 节点列表与资源使用
GET _cat/nodes?v
# 分片分配情况
GET _cat/shards?v
# 索引统计
GET _cat/indices?v
# 集群统计
GET _cluster/stats
| API | 用途 |
|---|---|
GET _cat/health |
集群健康(green/yellow/red) |
GET _cat/nodes |
节点信息、负载、内存 |
GET _cat/indices |
索引列表、文档数、存储大小 |
GET _cat/shards |
分片分配、 relocating 状态 |
GET _cat/segments |
段信息,用于 force merge 决策 |
GET _cat/pending_tasks |
主节点待处理任务 |
GET _cat/thread_pool |
线程池状态,发现拒绝请求 |
# 动态调整副本数
PUT /my_index/_settings
{
"number_of_replicas": 2
}
# 集群级设置
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "85%",
"cluster.routing.allocation.disk.watermark.high": "90%"
}
}
| 策略 | 配置 | 效果 |
|---|---|---|
| 批量写入 | bulk API,每批 5-15MB |
减少网络往返 |
| 刷新间隔 | index.refresh_interval: "30s" |
降低刷新频率,提升吞吐 |
| 副本数 | 写入时设为 0,完成后恢复 | 减少复制开销 |
| 禁用 _all | "index.query.default_field": "*" |
避免 _all 字段索引开销 |
| 使用自增 ID | 避免随机 ID 导致的频繁分段 | 减少合并压力 |
PUT /logs/_settings
{
"index": {
"refresh_interval": "30s",
"number_of_replicas": 0,
"translog.durability": "async"
}
}
| 策略 | 说明 |
|---|---|
使用 filter 替代 must |
不评分 + 缓存 = 更快 |
| 避免深度分页 | from: 10000 性能极差,使用 search_after |
| 控制返回字段 | _source: ["title", "date"] 减少 IO |
| 预加载 Fielddata | 对聚合字段启用 eager_global_ordinals |
| 使用 Routing | 按用户 ID 路由,减少查询分片数 |
// 首次查询
GET /products/_search
{
"size": 100,
"query": { "match_all": {} },
"sort": [
{ "price": "asc" },
{ "_id": "asc" }
]
}
// 后续查询使用 search_after
GET /products/_search
{
"size": 100,
"query": { "match_all": {} },
"search_after": [29.99, "product_12345"],
"sort": [
{ "price": "asc" },
{ "_id": "asc" }
]
}
问题:同一索引中,某字段在不同文档中被推断为 text 和 long,导致后续写入失败。
解决:
PUT /my_index
{
"mappings": {
"dynamic": "strict",
"properties": {
"count": { "type": "integer" }
}
}
}
或使用动态模板:
{
"mappings": {
"dynamic_templates": [
{
"strings_as_keywords": {
"match_mapping_type": "string",
"mapping": {
"type": "keyword",
"ignore_above": 256
}
}
}
]
}
}
问题:索引只有 1 个主分片,数据量增大后无法水平扩展。
解决:
常见原因:
解决:
cardinality 的 precision_threshold 控制精度keyword 而非 textindices.fielddata.cache.size: "30%"问题:网络分区导致多个节点认为自己是 Master。
解决:
# elasticsearch.yml
discovery.zen.minimum_master_nodes: 2 # ES 6.x
# 或 ES 7.x+ 使用:
cluster.initial_master_nodes: ["node-1", "node-2", "node-3"]
诊断:
GET _cat/thread_pool/write?v
# 查看 rejected 列是否持续增长
解决:
| 操作 | 方法 + 端点 |
|---|---|
| 创建索引 | PUT /index_name |
| 删除索引 | DELETE /index_name |
| 索引文档 | POST /index/_doc 或 PUT /index/_doc/{id} |
| 获取文档 | GET /index/_doc/{id} |
| 更新文档 | POST /index/_update/{id} |
| 删除文档 | DELETE /index/_doc/{id} |
| 搜索 | GET /index/_search |
| 批量操作 | POST /_bulk |
| 状态 | 含义 |
|---|---|
| green | 所有主分片和副本分片都正常分配 |
| yellow | 所有主分片正常,但至少一个副本未分配 |
| red | 至少一个主分片未分配,数据可能丢失 |
| 版本 | 关键变化 |
|---|---|
| 5.x → 6.x | 单索引单 Type,移除 _all 字段 |
| 6.x → 7.x | 默认 1 主分片,移除 Type,引入 ILM |
| 7.x → 8.x | 移除 Mapping Types,Security 默认开启 |
维护说明:本文档基于 ElasticSearch 7.x/8.x 版本编写,部分 API 在旧版本中可能有所不同。建议定期查阅 官方文档 获取最新信息。