Elasticsearch CRUD操作与分词器 | Eddie'Blog
Elasticsearch CRUD操作与分词器

Elasticsearch CRUD操作与分词器

eddie 218 2021-06-22

目录

Elasticsearch 部署

1. 下载ES

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.13.2-linux-x86_64.tar.gz

2. 解压ES

tar -zxvf elasticsearch-7.13.2-linux-x86_64.tar.gz 

3. 移动目录

mv elasticsearch-7.13.2 /usr/local/

4. 创建存放数据的文件夹

mkdir /usr/local/elasticsearch-7.13.2/data

5. 修改es的Yaml文件

vim /usr/local/elasticsearch-7.13.2/config/elasticsearch.yml

# 起始全部默认屏蔽的,就添加或者修改原来属性这几个参数
cluster.name: my-elasticsearch
node.name: es-node1
path.data: /usr/local/elasticsearch-7.13.2/data
path.logs: /usr/local/elasticsearch-7.13.2//logs
network.host: 0.0.0.0
cluster.initial_master_nodes: ["es-node1"]
xpack.ml.enabled: false
bootstrap.memory_lock: false
bootstrap.system_call_filter: false

6. 修改es.jvm内存

vim /usr/local/elasticsearch-7.13.2/config/jvm.options

# 根据自己需求设置
-Xms1g
-Xmx1g

7. 修改es文件夹的用户权限

[root@localhost elasticsearch-7.13.2]# chown -R esuser:esuser /usr/local/elasticsearch-7.13.2/

[root@localhost elasticsearch-7.13.2]# ll
total 608
drwxr-xr-x  2 esuser esuser   4096 Jun 11 05:06 bin
drwxr-xr-x  3 esuser esuser    199 Jun 22 19:32 config
drwxr-xr-x  3 esuser esuser     19 Jun 22 19:32 data
drwxr-xr-x  9 esuser esuser    107 Jun 11 05:06 jdk
drwxr-xr-x  3 esuser esuser   4096 Jun 11 05:06 lib
-rw-r--r--  1 esuser esuser   3860 Jun 11 04:59 LICENSE.txt
drwxr-xr-x  2 esuser esuser   4096 Jun 22 19:32 logs
drwxr-xr-x 59 esuser esuser   4096 Jun 11 05:06 modules
-rw-r--r--  1 esuser esuser 594096 Jun 11 05:04 NOTICE.txt
drwxr-xr-x  2 esuser esuser      6 Jun 11 05:04 plugins
-rw-r--r--  1 esuser esuser   2710 Jun 11 04:59 README.asciidoc

8. 修改最大文件数

[esuser@localhost bin]$ su root
Password: 
[root@localhost bin]# vim /etc/security/limits.conf 

* soft nofile 65536
* hard nofile 131072
* soft nproc 2048
* hard nproc 4096

9. 修改 vm.max_map_count

[root@localhost bin]# vim /etc/sysctl.conf
[root@localhost bin]# sysctl -p
net.ipv4.ip_forward = 1
vm.max_map_count = 655360

10. 前台运行

su esuser
cd /usr/local/elasticsearch-7.13.2/bin
./elasticsearch

11. 端口放行

firewall-cmd --permanent --add-port=9200-9300/tcp
firewall-cmd --reload

12. Web访问ES查看信息

http://192.168.8.108:9200/

{
  "name" : "es-node1",
  "cluster_name" : "my-elasticsearch",
  "cluster_uuid" : "zEajxvlOSnOodvJCbPJsWg",
  "version" : {
    "number" : "7.13.2",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "4d960a0733be83dd2543ca018aa4ddc42e956800",
    "build_date" : "2021-06-10T21:01:55.251515791Z",
    "build_snapshot" : false,
    "lucene_version" : "8.8.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

13. 后台方式启动

su esuser
cd /usr/local/elasticsearch-7.13.2/bin
./elasticsearch -d


# 如需关闭
[esuser@localhost bin]$ jps
6357 Jps
6191 Elasticsearch

[esuser@localhost bin]$ kill -9 6191

Elasticsearch-Head 的安装

Github ES-Head官方

  1. Running with built in server
  2. Running with docker
  3. Running as a Chrome extension
  4. Running as a plugin of Elasticsearch (deprecated)

1. 安装NodeJs

yum install centos-release-scl-rh
yum-config-manager --enable rhel-server-rhscl-7-rpms
yum install rh-nodejs10
scl enable rh-nodejs10 bash

https://www.softwarecollections.org/en/scls/?search=NodeJS

2. 安装Head (Running with built in server)

3. ES配置跨域

 vim /usr/local/elasticsearch-7.13.2/config/elasticsearch.yml
 
###################### 使用head等插件监控集群信息,需要打开以下配置项 ###########
http.cors.enabled: true
http.cors.allow-origin: "*"
http.cors.allow-credentials: true

4. 运行与连接

  1. kill ES在重启运行
  2. 在es-head页面连接

Elasticsearch 的索引操作

  • elasticsearch-head
  • postman
  • code

PostMan

返回ES健康状态信息

GET http://192.168.8.108:9200/_cluster/health

{
    "cluster_name": "my-elasticsearch",
    "status": "green",
    "timed_out": false,
    "number_of_nodes": 1,
    "number_of_data_nodes": 1,
    "active_primary_shards": 0,
    "active_shards": 0,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 0,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 100.0
}

删除ES索引

DELETE http://192.168.8.108:9200/index_123

{
    "acknowledged": true
}

添加ES索引

PUT http://192.168.8.108:9200/index_temp

# json格式传入
{
    "settings": {
        "index": {
            "number_of_shards": "2",
            "number_of_replicas": "0"
        }
    }
}

# 返回json
{
    "acknowledged": true,
    "shards_acknowledged": true,
    "index": "index_temp"
}

查询ES索引

GET http://192.168.8.108:9200/index_temp

{
    "index_temp": {
        "aliases": {},
        "mappings": {},
        "settings": {
            "index": {
                "routing": {
                    "allocation": {
                        "include": {
                            "_tier_preference": "data_content"
                        }
                    }
                },
                "number_of_shards": "2",               ## 两个分片
                "provided_name": "index_temp",
                "creation_date": "1624419628430",
                "number_of_replicas": "0",             ## 零个副本
                "uuid": "58mTyZGOSzeXfcZ72jgBPw",
                "version": {
                    "created": "7130299"
                }
            }
        }
    }
}

查询所有ES索引信息

GET http://192.168.8.108:9200/_cat/indices?v

health status index      uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   index_temp 58mTyZGOSzeXfcZ72jgBPw   2   0          0            0       416b           416b

- pri 主分片
- rep 副本

Mappings

类似于数据库的schema的定义,mapping会把文档映射成lucene需要的扁平格式,一个mapping属于一个索引的type,一个type中有一个mapping定义,7.0后一个索引只有一个type,所以不需要在mapping中定义type的信息。作用如下:

  • 定义索引这里面的字段和名称
  • 定义字段的数据类型,字符串、布尔、数字......
  • 字段,倒排索引相关的配置,是否分词。
创建映射前,创建 index_mapping 索引
PUT http://192.168.8.108:9200/index_mapping

# 传入的json串
{
    "mappings": {
        "properties": {
            "realname": {
                "type": "text", // 相当于数据域的 varchar 或者 代码里面的 string, 可以做分词
                "index": true // realname 设置为true就会使用这个为索引 (默认:true)
            },
            "username": {
                "type": "keyword", // 精确匹配,不会被分词的
                "index": false // username 设置为true就会使用这个为索引 (默认:true)
            }
        }
    }
}

在创建后,是不能修改字段里面的type属性的,除非你把索引删除重创建,或者是追加字段也可以

主要的数据类型
  • text,keyword,string
  • ;pmg.omteger,short,byte
  • double,float
  • boolean
  • date
  • object
  • 数组不能混,类型一致
POST http://192.168.8.108:9200/index_mapping/_mapping

# 需要追加的字段
{
    "properties": {
        "id": {
            "type": "long"
        },
        "age": {
            "type": "integer"
        },
        "money1": { 
            "type": "double"
        },
        "money2": {
            "type": "float"
        },
        "sex": {
            "type": "byte"
        },
        "score": {
            "type": "short"
        },
        "is_teenger": {
            "type": "boolean"
        },
        "birthday": {
            "type": "date"
        },
        "relationship": {
            "type": "object"
        }
    }
}

# 返回
{
    "acknowledged": true
}

可以在 es-head 的索引信息查看是否追加成功

text与keyword分词的区别
text
GET http://192.168.8.108:9200/index_mapping/_analyze

# 测试 text 文本内容分词
{
    "field": "realname",
    "text": "i am eddie"
}


# 返回分词结果集
{
    "tokens": [
        {
            "token": "i",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "am",
            "start_offset": 2,
            "end_offset": 4,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "eddie",
            "start_offset": 5,
            "end_offset": 10,
            "type": "<ALPHANUM>",
            "position": 2
        }
    ]
}
keyword
GET http://192.168.8.108:9200/index_mapping/_analyze

# 测试 keyword 文本内容分词
{
    "field": "username",
    "text": "i am eddie"
}


# 返回分词结果集
{
    "tokens": [
        {
            "token": "i am eddie",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0
        }
    ]
}

默认英文,如果中文会拆分一个个字显示的,之后需要安装插件

文档的基本操作

es-head 新建索引

索引名称* my_index
分片数* 1
副本数* 0

PostMan

指定ID的方式: /_index/_doc/_id

ES自动生成ID方式:/_index/_doc/

向索引_doc 文档-添加数据

POST http://192.168.8.108:9200/my_index/_doc/1

# json串, 不要在后面 // 的备注,不然 es-head 是查看不了结果,只能自己 GET
{
    "id": 1001,
    "name": "eddie-1",
    "desc": "i am eddie",
    "create_date": "2021-06-23"
}

可以在 es-head 数据浏览-字段-desc 搜索 "eddie"

向索引_doc 文档-删除数据

格式: /_index/_doc/_id
DELETE http://192.168.8.108:9200/my_index/_doc/5

# 返回 json串
{
    "_index": "my_index",
    "_type": "_doc",
    "_id": "5",
    "_version": 2,
    "result": "deleted",    ## 如果是已经不存在或者重覆删除,就会提示 "not_found"
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 18,
    "_primary_term": 1
}

并非是物理删除,通过标记来作为逻辑删除,依然存在磁盘里,
当磁盘数据越来越多时候,ES才会删除这些标记删除的数据,进行删除

向索引_doc 文档-更新数据

单个字段更新
格式:/_index/_doc/_id/_update
POST http://192.168.8.108:9200/my_index/_doc/1/_update

# 传入需要修改的 desc 字段的内容, 返回的结果集:"result": "updated"
{
    "doc":{
        "desc":"我是eddie"
    }
}
根据文档 _id 全量字段更新
PUT http://192.168.8.108:9200/my_index/_doc/1

# 传入的json, 把 _id 里面字段全部更新
{
    "id": 1,
    "name": "eddie-1-update",
    "desc": "i am eddie---更新咯",
    "create_date": "2021-06-23"
}

更新之后 _version 就会累加的

向索引_doc 文档-查询数据

根据文档 _id 查看
GET http://192.168.8.108:9200/my_index/_doc/1

# 返回json
{
    "_index": "my_index",
    "_type": "_doc",
    "_id": "1",
    "_version": 6,
    "_seq_no": 5,
    "_primary_term": 1,
    "found": true,
    "_source": {
        "id": 1001,
        "name": "eddie-1",
        "desc": "i am eddie",
        "create_date": "2021-06-23"
    }
}
使用 _search 查询指定索引的全部文档内容
GET http://192.168.8.108:9200/my_index/_doc/_search
只查询索引文档里面某个字段信息
# 只范围 id 字段
GET http://192.168.8.108:9200/my_index/_doc/1?_source=id

# 多个字段返回,比如 id 与 name
GET http://192.168.8.108:9200/my_index/_doc/1?_source=id,name

# 查询所有索引文档里面的 id 与 name
GET http://192.168.8.108:9200/my_index/_doc/_search?_source=id,name
判断这个索引是否存在
# 存在的话返回状态码=200,反之不存在返回404,优势就是返回的数据大小,可以更节省资源

GET http://192.168.8.108:9200/my_index/_doc/1

文档乐观锁控制

主要的几个参数:

  • version
  • if_seq_no
  • if_primary_term

在每次修改时候 _version 都会累加, 可以应用到在同时修改的时候,对比这个版本号。

例子

创建一个索引文档
POST http://192.168.8.108:9200/my_index/_doc/2021

# 传入json
{
    "id": 2021,
    "name": "eddie-2021",
    "desc": "i am eddie---2021",
    "create_date": "2021-06-23"
}

# 返回json
{
    "_index": "my_index",
    "_type": "_doc",
    "_id": "2021",
    "_version": 1,           # version=1
    "result": "created",
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 23,
    "_primary_term": 1
}

查询一下是否创建鸟~

GET http://192.168.8.108:9200/my_index/_doc/2021

更新数据

POST http://192.168.8.108:9200/my_index/_doc/2021

# 传入
{
    "doc":{
        "name":"lee wait city"
    }
}

# 返回
{
    "_index": "my_index",
    "_type": "_doc",
    "_id": "2021",
    "_version": 2,           # 变成 version=2
    "result": "updated",
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 24,               # 24
    "_primary_term": 1
}

乐观锁方式,根据版本号修改

POST http://192.168.8.108:9200/my_index/_doc/2021?version=2

# 传入json
{
    "doc":{
        "name":"lee wait city1"
    }
}

# 返回错误信息
{
    "error": {
        "root_cause": [
            {
                "type": "action_request_validation_exception",
                "reason": "Validation Failed: 1: internal versioning can not be used for optimistic concurrency control. Please use `if_seq_no` and `if_primary_term` instead;"
            }
        ],
        "type": "action_request_validation_exception",
        "reason": "Validation Failed: 1: internal versioning can not be used for optimistic concurrency control. Please use `if_seq_no` and `if_primary_term` instead;"
    },
    "status": 400
}

提示使用 if_seq_no and if_primary_term, 其实在旧的版本是用_version

按上面提示修正请求参数

POST http://192.168.8.108:9200/my_index/_doc/2021?if_seq_no=24&if_primary_term=1

# 传入json
{
    "doc":{
        "name":"lee wait city1"
    }
}

# 返回json
{
    "_index": "my_index",
    "_type": "_doc",
    "_id": "2021",
    "_version": 3,
    "result": "updated",
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 25,
    "_primary_term": 1
}

根据 "_seq_no": 24 和 "_primary_term": 1 作为乐观锁的关键参数,然后更新完成后累加到=25,其实跟_version是一样的

内置分词器

  • standard : 不会拆分类似 I's
  • simple : 拆分类似 i 和 s
  • whitespace : 根据空格拆分
  • stop : 没有意义的单词,直接剔除
  • keyword : 将整个文本变成一个,不会再拆分了

英语分词

POST http://192.168.8.108:9200/_analyze

# 传入一段话去拆分
{
    "analyzer": "standard",
    "text": "I study in baidu.com"
}

---根据索引分词
POST http://192.168.8.108:9200/my_index/_analyze

{
    "analyzer": "standard",
    "field": "desc",
    "text": "I study in baidu.com"
}

中文分词

POST http://192.168.8.108:9200/_analyze

# 传入一段话去拆分
{
    "analyzer": "standard",
    "text": "我在百度学习"
}

中文不支持分词,是一个一个文字拆分的结果集

ik中文分词器

Github ik分词器

使用的版本,需要对应ES版本

解压到指定目录下

unzip elasticsearch-analysis-ik-7.13.2.zip -d /usr/local/elasticsearch-7.13.2/plugins/ik

需要重启运行ES才会生效

例子 - 最细细粒度

POST http://192.168.8.108:9200/_analyze

{
    "analyzer": "ik_max_word", 
    "text": "我在百度学习,JAVA技术!"
}

上面例子拆分为:"我,在,百度,百,度,学习,JAVA,技术"

例子 - 最粗细粒度

POST http://192.168.8.108:9200/_analyze

{
    "analyzer": "ik_smart",
    "text": "我在百度学习,JAVA技术!"
}

上面例子拆分为:"我,在,百度,学习,JAVA,技术"

自定义中文词典

指定自定义分词文件

vim /usr/local/elasticsearch-7.13.2/plugins/ik/config/IKAnalyzer.cfg.xml 

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict">custom.dic</entry>                      <!-- This -->
         <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
        <!-- <entry key="remote_ext_dict">words_location</entry> -->
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

custom.dic 是录入自定义词组文件,保存需要是utf-8格式

vim custom.dic

埃德网
埃德
德网
埃
德
网
骚货

:wq! ++enc=utf8

需要重启运行ES才会生效


# Elasticsearch