Full-text search using text indexes
Text indexes in ClickHouse (also known as "inverted indexes") provide fast full-text capabilities on string data. The index maps each token in the column to the rows which contain the token. The tokens are generated by a process called tokenization. For example, ClickHouse tokenizes the English sentence "All cat like mice." by default as ["All", "cat", "like", "mice"] (note that the trailing dot is ignored). More advanced tokenizers are available, for example for log data.
Creating a Text Index
To create a text index, first enable the corresponding experimental setting:
A text index can be defined on a String and FixedString column using the following syntax:
The tokenizer argument specifies the tokenizer:
defaultsplits strings along non-alphanumeric ASCII characters.ngramsplit strings into equally large n-grams.splitsplit strings along certain user-defined separator strings.no_opperforms no tokenization, i.e. every row value is a token.
To test how the tokenizers split the input string, you can use ClickHouse's tokens function:
As an example,
returns
If you chose the ngram tokenizer, you can set the ngram length using the (optional) parameter ngram_size.
If ngram_size is not specified, the default ngram size is 3.
The smallest and largest possible ngram size are 2 and 8.
If you chose the split tokenizer, you can set the separators using the (optional) parameter separators.
The parameter expects a list of strings, for example, separators = [', ', '; ', '\n', '\\'].
Note that each string can consist of multiple characters (', ' in the example).
If parameter split is not specified, a single whitespace [' '] is used by default.
The split tokenizer applies the split separators left-to-right.
This can create ambiguities.
For example, the separator strings ['%21', '%'] will cause %21abc to be tokenized as ['abc'], whereas switching both separators strings ['%', '%21'] will output ['21abc'].
In the most cases, you want that matching prefers longer separators first.
This can generally be done by passing the separator strings in order of descending length.
If the separator strings happen to form a prefix code, they can be passed in arbitrary order.
Text indexes in ClickHouse are implemented as secondary indexes. However, unlike other skipping indexes, text indexes have a default index GRANULARITY of 64. This value has been chosen empirically and it provides a good trade-off between speed and index size for most use cases. Advanced users can specify a different index granularity (we do not recommend this).
Advanced parameters
The default values of the following advanced parameters will work well in virtually all situations. We do not recommend changing them.
Optional parameter dictionary_block_size (default: 128) specifies the size of dictionary blocks in rows.
Optional parameter dictionary_block_frontcoding_compression (default: 1) specifies if the dictionary blocks use front coding as compression.
Optional parameter max_cardinality_for_embedded_postings (default: 16) specifies the cardinality threshold below which posting lists should be embedded into dictionary blocks.
Optional parameter bloom_filter_false_positive_rate (default: 0.1) specifies the false-positive rate of the dictionary bloom filter.
Text indexes can be added to or removed from a column after the table has been created:
Using a Text Index
Using a text index in SELECT queries is straightforward as common string search functions will leverage the index automatically.
Supported functions
The text index can be used if text functions are used in the WHERE clause of a SELECT query:
= and !=
= (equals) and != (notEquals ) match the entire given search term.
Example:
The text index supports = and !=, yet equality and inequality search only make sense with the no_op tokenizer (which causes the index to store entire row values).
IN and NOT IN
IN (in) and NOT IN (notIn) are similar to functions equals and notEquals but they match all (IN) or none (NOT IN) of the search terms.
Example:
The same restrictions as for = and != apply, i.e. IN and NOT IN only make sense in conjunction with the no_op tokenizer.
LIKE, NOT LIKE and match
These functions currently use the text index for filtering only if the index tokenizer is either default or ngram.
In order to use LIKE like, NOT LIKE (notLike), and the match function with text indexes, ClickHouse must be able to extract complete tokens from the search term.
Example:
support in the example could match support, supports, supporting etc.
This kind of query is a substring query and it cannot be sped up by a text index.
To leverage a text index for LIKE queries, the LIKE pattern must be rewritten in the following way:
The spaces left and right of support make sure that the term can be extracted as a token.
startsWith and endsWith
Similar to LIKE, functions startsWith and endsWith can only use a text index, if complete tokens can be extracted from the search term.
Example:
In the example, only clickhouse is considered a token.
support is no token because it can match support, supports, supporting etc.
To find all rows that start with clickhouse supports, please end the search pattern with a trailing space:
Similarly, endsWith should be used with a leading space:
hasToken and hasTokenOrNull
Functions hasToken and hasTokenOrNull match against a single given token.
Unlike the previously mentioned functions, they do not tokenize the search term (they assume the input is a single token).
Example:
Functions hasToken and hasTokenOrNull are the most performant functions to use with the text index.
searchAny and searchAll
Functions searchAny and searchAll match against one or all of the given tokens.
Like hasToken, no tokenization of the search terms takes place.
Example:
Implementation
Index layout
Each text index consists of two (abstract) data structures:
- a dictionary which maps each token to a postings list, and
- a set of postings lists, each representing a set of row numbers.
Since a text index is a skip index, these data structures exist logically per index granule.
During index creation, three files are created (per part):
Dictionary blocks file (.dct)
The tokens in an index granule are sorted and stored in dictionary blocks of 128 tokens each (the block size is configurable by parameter dictionary_block_size).
A dictionary blocks file (.dct) consists all the dictionary blocks of all index granules in a part.
Index granules file (.idx)
The index granules file contains for each dictionary block the block's first token, its relative offset in the dictionary blocks file, and a bloom filter for all tokens in the block. This sparse index structure is similar to ClickHouse's sparse primary key index). The bloom filter allows to skip dictionary blocks early if the searched token is not contained in a dictionary block.
Postings lists file (.pst)
The posting lists for all tokens are laid out sequentially in the postings list file.
To save space while still allowing fast intersection and union operations, the posting lists are stored as roaring bitmaps.
If the cardinality of a posting list is less than 16 (configurable by parameter max_cardinality_for_embedded_postings), it is embedded into the dictionary.
Direct read
Certain types of text queries can be sped up significantly by an optimization called "direct read". More specifically, if the SELECT query does not project from the text column, the optimization can be applied.
Example:
Supported functions
Example: Hackernews dataset
Let's look at the performance improvements of text indexes on a large dataset with lots of text. We will use 28.7M rows of comments on the popular Hacker News website. Here is the table without an text index:
The 28.7M rows are in a Parquet file in S3 - let's insert them into the hackernews table:
Consider the following simple search for the term ClickHouse (and its varied upper and lower cases) in the comment column:
Notice it takes 3 seconds to execute the query:
We will use ALTER TABLE and add an text index on the lowercase of the comment column, then materialize it (which can take a while - wait for it to materialize):
We run the same query...
...and notice the query executes 4x faster:
We can also search for one or all of multiple terms, i.e., disjunctions or conjunctions: