Sometimes we just don’t have control over the source of data coming into our elasticsearch indices. In such cases cleaning Elasticsearch data and removing unwanted data such as html tags before they are put into your elasticsearch index. This is to prevent unwanted and unpredictable behaviour.

For instance given the text bellow:

[bash]<a href=\"http://somedomain.com>\">website</a>[/bash]

If the above is indexed without clean the html, a search for “somedomain” will match documents with the above link. It might be what you want, but in most cases users do not. So to prevent this you can use a custom analyser to clean your data.

Bellow is an example solution with cool techniques to debug and analyse your analyser such as query the actual data that is in your index. Note not the Elasticsearch document _source field which will always hold the true 100% raw data that hits elasticsearch unmodified.

Cleaning Elasticsearch Data

Create a new

Index with the required html_strip mapping filter configured

[bash]
PUT /html_poc_v3
{
"settings": {
    "analysis": {
      "analyzer": {
        "my_html_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ]
        }
      }
    }
},
"mappings": {
    "html_poc_type": {
      "properties": {
        "body": {
          "type": "string",
          "analyzer": "my_html_analyzer"
        },
        "description": {
          "type": "string",
          "analyzer": "standard"
        },
        "title": {
          "type": "string",
          "index_analyzer": "my_html_analyzer"
        },
        "urlTitle": {
          "type": "string"
        }
      }
    }
}
}
[/bash]

Post Some Data

[bash]
POST /html_poc_v3/html_poc_type/02
{
"description": "Description <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
"title": "Title <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
"body": "Body <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>"
}
[/bash]

Now retrieve indexed data

This will by-pass the _source field and fetch the actual indexed data/tokens

[bash]
GET /html_poc_v3/html_poc_type/_search?pretty=true
{
"query": {
    "match_all": {}
},
"script_fields": {
    "title": {
      "script": "doc[field].values",
      "params": {
        "field": "title"
      }
    },
    "description": {
      "script": "doc[field].values",
      "params": {
        "field": "description"
      }
    },
    "body": {
      "script": "doc[field].values",
      "params": {
        "field": "body"
      }
    }
}
}
[/bash]

Example Response

Note: the difference for title, description and body

[bash]
{
"took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "html_poc_v3",
            "_type": "html_poc_type",
            "_id": "02",
            "_score": 1,
            "fields": {
               "title": [
                  [
                     "Some",
                     "Title",
                     "déjà",
                     "vu",
                     "website"
                  ]
               ],
               "body": [
                  [
                     "Body",
                     "Some",
                     "déjà",
                     "vu",
                     "website"
                  ]
               ],
               "description": [
                  [
                     "a",
                     "agrave",
                     "d",
                     "description",
                     "eacute",
                     "href",
                     "http",
                     "j",
                     "p",
                     "some",
                     "somedomain.com",
                     "vu",
                     "website"
                  ]
               ]
            }
         }
      ]
   }
}
[/bash]