CRUD apps make a huge part of what we interact with on the internet and on our personal devices. You'll probably wind up making one sooner or later!
But what about searching in a CRUD app? To find what you want to read/update/delete, you'll probably have to implement search, right? Does that make an app SCRUD? Or maybe CRUDS? What's a good way?
Well, the easiest way would probably a simple wildcard search. Maybe a query to an SQL databse like so:
> SELECT * from places where name like '%Houston%';
This is pretty good. Now if I type things like Hou
, I also get search-as-you-type! Pretty neat, eh?
Okay. Well, what if someone mistypes, and starts typing out Houstin
, what do? Well, there's SOUNDEX().
> SELECT SOUNDEX('Houston');
+--------------------+
| SOUNDEX('Houston') |
+--------------------+
| H235 |
+--------------------+
1 row in set (0.000 sec)
> SELECT SOUNDEX('Houstin');
+--------------------+
| SOUNDEX('Houstin') |
+--------------------+
| H235 |
+--------------------+
1 row in set (0.000 sec)
That works! Now we can store location name SOUNDEX()
tokens in the database, and search on a SOUNDEX()
call. Great!
But wait... we just lost search-as-you-type. The tokens are only built on the completed name, so if I search Hou
, it'll wind up with a different SOUNDEX()
token. We could build a table of SOUNDEX()
tokens for each piece, starting with H
, then Ho
, then Hou
, but that is starting to get into the realm of "Wow this is absolutely awful." Is there a better way?
Elasticsearch has something that can do this for you called edge N-grams. An N-gram is just a sequence of text. An edge N-gram is an N-gram anchored to the beginning of a word token. Elasticsearch can break our locations up into a pile of edge N-grams, and associate them with a single record, like so:
curl -X GET "23.251.150.194:9200/places/_analyze" -H 'Content-Type: application/json' -d'{"analyzer": "autocomplete_store", "text":"Houston"}'
{
"tokens": [
{
"token": "h",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "ho",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "hou",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "hous",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 3
},
{
"token": "houst",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 4
},
{
"token": "housto",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 5
},
{
"token": "houston",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 6
}
]
}
So if I search Hou
, it should match one of the many N-grams in our awesome index, and produce the relevant result!
So what does this require? Well, first we need to create an index in Elasticsearch:
$ curl -X PUT "23.251.150.194:9200/places" -H 'Content-Type: application/json' -d '
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete_store": {
"tokenizer": "autocomplete_tokenization",
"filter": [
"lowercase",
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete_tokenization": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 50,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"location": {
"type": "geo_point"
},
"name": {
"type": "text",
"analyzer": "autocomplete_store",
"search_analyzer": "autocomplete_search"
},
"population": {
"type": "integer"
},
"state": {
"type": "text",
"analyzer": "autocomplete_store",
"search_analyzer": "autocomplete_search"
}
}
}
}'
That looks complicated, but it's really not too bad. All it's saying is:
- Create an index called
places
in Elasticsearch, with a mapping that has atext
name, atext
state, ageo_point
location, and aninteger
population. - Store values for "name" and "state" using the
autocomplete_store
analyzer, which filters the text by lowercasing the whole string, which then uses theautocomplete_tokenization
tokenizer, and tokenizes a string asedge_ngram
. - Analyze queries using the
autocomplete_search
analyzer, which just lowercases the input string.
We set a minimium N-gram length of 3, and maximum of 50. We don't want to go too small, because for one, we don't want to store N-grams we don't need, two, no user will ever get relevant results with just a single character as input, and three, if you ever use synonyms then you'll find out that N-grams definitely match on synonyms. Most abbreviations are 2 characters long, which can cause havoc if you're trying to store addresses, only to find out that Elasticsearch swapped out the "St" in "Staten Island" with "Saint", making "Staten Island" show as a relevant result when typing out the word "Saint".
And to query, it's a simple match
operation:
curl -X GET "127.0.0.1:9200/places/_search?pretty=true&search_type=dfs_query_then_fetch" -H 'Content-Type: application/json' -d '
{
"query": {
"match": {
"name": "Hou"
}
}
}
'
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 40,
"relation" : "eq"
},
"max_score" : 8.342884,
"hits" : [
{
"_index" : "places",
"_type" : "_doc",
"_id" : "eh-W7oMBwIds9pfTh-z8",
"_score" : 8.342884,
"_source" : {
"name" : "Houma",
"state" : "Louisiana",
"location" : "29.5800,-90.7059",
"population" : "146665"
}
},
{
"_index" : "places",
"_type" : "_doc",
"_id" : "5SCb7oMBwIds9pfTkyi7",
"_score" : 8.342884,
"_source" : {
"name" : "Houck",
"state" : "Arizona",
"location" : "35.2714,-109.2237",
"population" : "935"
}
},
{
"_index" : "places",
"_type" : "_doc",
"_id" : "tyCg7oMBwIds9pfTFVqZ",
"_score" : 8.342884,
"_source" : {
"name" : "House",
"state" : "New Mexico",
"location" : "34.6492,-103.9038",
"population" : "69"
}
},
{
"_index" : "places",
"_type" : "_doc",
"_id" : "OCCg7oMBwIds9pfTp2Eo",
"_score" : 8.342884,
"_source" : {
"name" : "Hough",
"state" : "Oklahoma",
"location" : "36.8720,-101.5747",
"population" : "12"
}
},
{
"_index" : "places",
"_type" : "_doc",
"_id" : "Th-W7oMBwIds9pfTcet2",
"_score" : 7.128949,
"_source" : {
"name" : "Houston",
"state" : "Texas",
"location" : "29.7860,-95.3885",
"population" : "5724418"
}
},
{
"_index" : "places",
"_type" : "_doc",
"_id" : "QyCY7oMBwIds9pfT_QuP",
"_score" : 7.128949,
"_source" : {
"name" : "Houston",
"state" : "Mississippi",
"location" : "33.8963,-89.0031",
"population" : "3426"
}
},
{
"_index" : "places",
"_type" : "_doc",
"_id" : "SCCZ7oMBwIds9pfTPg7D",
"_score" : 7.128949,
"_source" : {
"name" : "Houston",
"state" : "Missouri",
"location" : "37.3212,-91.9610",
"population" : "2927"
}
},
{
"_index" : "places",
"_type" : "_doc",
"_id" : "UCCa7oMBwIds9pfTCBeq",
"_score" : 7.128949,
"_source" : {
"name" : "Houston",
"state" : "Alaska",
"location" : "61.6159,-149.8003",
"population" : "1952"
}
},
{
"_index" : "places",
"_type" : "_doc",
"_id" : "ViCa7oMBwIds9pfT6SGN",
"_score" : 7.128949,
"_source" : {
"name" : "Houston",
"state" : "Pennsylvania",
"location" : "40.2494,-80.2110",
"population" : "1260"
}
},
{
"_index" : "places",
"_type" : "_doc",
"_id" : "hyCb7oMBwIds9pfTdSc8",
"_score" : 7.128949,
"_source" : {
"name" : "Houston",
"state" : "Minnesota",
"location" : "43.7583,-91.5706",
"population" : "984"
}
}
]
}
}
This will return all cities with an N-gram that matches "Hou". That's our search suggestions! There's a lot of tuning and tweaking I can't cover here, such as determining relevancy (by e.g. using our "population" field to bubble up more relevant cities), and also searching across multiple indicies (such as, what if we wanted separate city/neighborhood indicies, but to search across them at the same time? Well here's something you can use to start: curl -X GET "127.0.0.1:9200/places,neighborhoods/_search?pretty=true&search_type=dfs_query_then_fetch" -H 'Content-Type: application/json'
) The Elasticsearch documentation is really good, and I can't recommend it enough. Also, drop me a line if this is confusing, and I might be able to help out.