elasticsearch html_strip过滤html 在logstash导入前实现html便签清除

l2qq · 2020-07-08 16:37

字数 6990 评论 0

html_strip过滤的配置发现无效，问题是es不会通过ik分词器处理过滤html标签：

测试代码：

PUT l2qq
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom", 
          "tokenizer": "ik_smart",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  },
    "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "@version": {
        "type": "keyword"
      },
      "allcontent": {
        "type": "text",
        "analyzer": "my_custom_analyzer",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "content": {
        "type": "text",
        "analyzer": "my_custom_analyzer",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "id": {
        "type": "long"
      },
      "last_index_time": {
        "type": "date"
      },
      "lasttimeat": {
        "type": "date"
      },
      "title": {
        "type": "keyword"
      }
    }
  }
}

代码中的：setting进行设置自定义analyzer进行html_strip处理不生效，所以其实没必要这样搞了，直接：

PUT /l2qq
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "@version": {
        "type": "keyword"
      },
      "allcontent": {
        "type": "text",
        "analyzer": "ik_smart",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "content": {
        "type": "text",
        "analyzer": "ik_smart",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "id": {
        "type": "long"
      },
      "last_index_time": {
        "type": "date"
      },
      "lasttimeat": {
        "type": "date"
      },
      "title": {
        "type": "text",
        "analyzer": "ik_smart",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

正确的处理办法：

应该是通过logstash导入数据进行处理，logstash通过filter过滤字段的html代码片段

	filter{
		mutate{
			gsub => [ "content", "<script(.*?)</script>", "" ]
		}
		mutate{
			gsub => [ "content", "<iframe(.*?)</iframe>", "" ]
		}
		mutate{
			gsub => [ "content", "<style(.*?)</style>", "" ]
		}
		mutate{
			gsub => [ "content", "<(.*?)>", "" ]
		}
		mutate{
			gsub => [ "content", "&nbsp;", "" ]
		}
	}

elasticsearch html_strip过滤html 在logstash导入前实现html便签清除

l2qq · 2020-07-08 16:37

最新评论 0