Full-text Search is a PostgreSQL® feature that facilitates the indexing of natural language text documents, and in the identification of indexed documents that match a given query. Matching documents can be sorted based on their relevance to the query, and document excerpts can be generated with the matching terms highlighted. A set of SQL data types, operators, and functions are provided to assist with the indexing, querying, and ranking of documents.
PostgreSQL uses the term document to mean any fragment of natural language text— essentially, strings containing human-readable words separated by whitespace and punctuation. Documents are often stored as text columns but can also be generated dynamically—such as by concatenating multiple columns together (even from multiple tables).
1 2 3 4 |
-- concatenating multiple columns to form a document. -- the table contains chapters from eBooks, one row per chapter SELECT book_title || ' ' || book_author || ' ' || title || ' ' || body_html AS document FROM chapters |
Some examples of natural language text are blog posts (such as the one you’re reading now), books, essays, user comments, forum posts, social media messages, emails, newsgroup and chat messages, newspaper and magazine articles, and product descriptions in a catalog. These types of documents are typical candidates for full-text search indexing.
Note that not all human-readable strings contain natural language. For example, usernames, passwords, and URLs are often human-readable, but don’t typically contain natural language.
What’s Not to Like About LIKE?
The text datatype has several operators for performing basic string pattern matching, notably LIKE/ ILIKE (SQL wildcard matches, case sensitive and insensitive, respectively), SIMILAR TO (SQL regex) and ~ (POSIX regex).
While it is possible to use these operators to perform very basic searches, pattern matching has several limitations that make it less than ideal for implementing useful searches. These operators lack linguistic support—such as understanding the structure of text (including punctuation), recognizing word variants and synonyms, and ignoring frequently used words. They have no ability to rank results based on relevance to the query, and, critically, they can be slow due to limited indexing support.
To explore some of these limitations, let’s look at some typical requirements for a comprehensive search feature of an application/website:
- Users enter search terms that will be converted into queries against the database, and the results are displayed back to the user.
- Search should be case-insensitive.
- Words in the query should match variants (such as suffixes) of that word in the document, e.g. ‘cat’ should match ‘cats’ (and vice-versa)
- Documents with related words/synonyms should be found, e.g. documents containing ‘feline’ or ‘kitten’ should be found when searching for ‘cat’.
- Phrases can be searched for (often surrounded by double-quotes, e.g. "the fat black cat").
- Users can flag certain words to be excluded (e.g., by prefixing the word with a dash: cat -fat).
- The results are ordered by some sort of relevance metric relating to the user’s query, e.g. if one document contains the word ‘cat’ (or variants thereof) multiple times, and another document only mentions ‘cat’ once, the first document is ranked higher.
These requirements have been kept intentionally vague, as they often depend on the specifics of the application.
For the example queries below, the following table definition and data was used:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
CREATE TABLE example AS SELECT * FROM ( VALUES ('the fat black cat chased after the rat'), ('one cat, two cats, many cats!'), ('the kitten and the dog played together'), ('that is one fine feline'), ('it is raining cats and dogs'), ('after eating the whole lasagne, he was a fat cat.'), ('don''t go into the catacombs after dark'), ('the bobcat has a spotted coat'), ('check the library catalog'), ('for a filesystem with deduplication look at zfs'), ('add one or more predicates to the query') ) AS t (document); |
Were the application to perform searches using the ILIKE operator (which does case-insensitive matching), a user’s query for ‘cat’ could be rewritten in SQL to resemble ILIKE '%cat%'. This would find every document containing the string ‘cat’. However, this query will also find documents with words containing the substring ‘cat’. Some matches, such as ‘cats’ would be considered relevant to the query. Others, such as ‘cathode’, ‘catalog’, ‘deduplication’, and ‘predicate’ are probably irrelevant results.
1 2 |
-- this finds too many irrelevant results. SELECT document, document ILIKE '%cat%' AS matches FROM example; |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
+-------------------------------------------------+-------+ |document |matches| +-------------------------------------------------+-------+ |the cat chased after the rat |true | |one cat, two cats, many cats! |true | |the kitten and the dog played together |false | |that is one fine feline |false | |it is raining cats and dogs |true | |after eating the whole lasagne, he was a fat cat.|true | |don't go into the catacombs after dark |true | |the bobcat has a spotted coat |true | |check the library catalog |true | |for a filesystem with deduplication look at zfs |true | |add one or more predicates to the query |true | +-------------------------------------------------+-------+ |
Trying to narrow down the query with ILIKE '% cat %' doesn’t improve things either. While words such as ‘catacombs’ and ‘deduplication’ will no longer match, ‘cats’ won’t match either, and sentences starting and ending with ‘cat’, and punctuation other than spaces causes problems too—’cat!’, ‘cat,’, ‘cat.’ are all ignored.
1 2 |
-- too specific. doesn't even match 'cats'. SELECT document, document ILIKE '% cat %' AS matches FROM example |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
+-------------------------------------------------+-------+ |document |matches| +-------------------------------------------------+-------+ |the cat chased after the rat |true | |one cat, two cats, many cats! |false | |the kitten and the dog played together |false | |that is one fine feline |false | |it is raining cats and dogs |false | |after eating the whole lasagne, he was a fat cat.|false | |don't go into the catacombs after dark |false | |the bobcat has a spotted coat |false | |check the library catalog |false | |for a filesystem with deduplication look at zfs |false | |add one or more predicates to the query |false | +-------------------------------------------------+-------+ |
ILIKE also doesn’t provide any assistance for matching synonyms or related words. If the application wanted to find documents containing ‘feline’, ‘kitten’, and ‘cat’ for the query ‘cat’, additional ILIKE operators would need to be OR’d together in the SQL. Some words can have many variants, which makes this a laborious approach.
1 2 |
-- this still has all the other shortcomings to, such as '%cat%' matching too many results SELECT document, (document ILIKE '%cat%' OR document ILIKE '%kitten%' OR document ILIKE '%feline%') AS matches FROM example; |
Lastly, ILIKE only returns a boolean indicating if a string matches the pattern. It does not provide a scoring metric that can be used to rank results by relevance.
Similar problems exist with the regex operators. While they are more powerful than LIKE, and certain shortcomings of LIKE can be fixed with creative regex patterns, they still perform pattern matching, and have the same fundamental limitations.
To work around the limitations of these operators, you would likely end up tediously reimplementing large parts of PostgreSQL’s built-in full-text search! Instead of doing that, let’s explore what PostgreSQL has to offer.
Full Text Search Fundamentals
Getting Documents Ready for Search
PostgreSQL’s full-text search does not operate directly on documents stored using the text data type. Instead, documents need to be first converted into the tsvector data type, which is a format that is optimized for search.
To convert a document stored as
text into a
tsvector the
to_tsvector function should be used:
to_tsvector([ config regconfig, ] document text) → tsvector
The to_tsvector function performs a number of processing steps on the document text. At a very high level to_tsvector first breaks the document down into words using a parser. Each word is then looked up against one or more dictionaries. A dictionary is a mapping of words to their normalized forms. These normalized forms are called lexemes. If the word matches a dictionary entry, that entry’s lexeme is added to the tsvector.
The resulting tsvector is an alphabetically sorted set of lexemes present in the source document. Each lexeme in the tsvector also includes position information. This is a list of integers representing the location of each source word. This position information is required for phrase searching and is useful for ordering matches based on proximity ranking.
The process of normalizing words into lexemes is dictionary-specific, but this almost always includes case-folding (converting UPPERCASE and Title-Case words into lowercase) and the removal of suffixes to reduce words to their root form (e.g. ‘cats’ becomes ‘cat’). Application-specific dictionaries can be created, and these can be used to customize the normalization process (e.g. to map domain-specific synonyms and phrases to a common term, such as ‘kitten’ and ‘feline’ to ‘cat’). This normalization process makes it fast and easy to find all variants of a word without needing to specify each variant in the query.
Dictionaries also typically don’t contain entries for extremely common words (such as ‘the’ and ‘and’). These common words are called stop words and aren’t terribly useful for searching. By skipping them, the generated tsvector will be smaller, which improves performance.
The choice of parser and dictionaries is called a configuration and can be specified by the config parameter to to_tsvector. This parameter is optional, but it is recommended to always specify it to ensure the right configuration is used. If omitted, the global default_text_search_config value will be used. Various default configurations exist ( \dF in psql will show a list), though you will likely want to customize the configuration.
See Section 20.11.2 and Section 12.1.3 of the PostgreSQL documentation for more details on the default_text_search_config option and text search configuration in general.
Here is an example of calling to_tsvector on a fragment of text using the built-in english configuration:
1 2 3 4 |
SELECT to_tsvector('english', 'Hello world. ' || 'The quick brown fox jumped over the lazy dog. ' || 'Cats and kittens (and dogs too) love to chase mice. ' || 'Hello world!') |
This returns the following tsvector:
1 |
'brown':5 'cat':12 'chase':20 'dog':11,16 'fox':6 'hello':1,22 'jump':7 'kitten':14 'lazi':10 'love':18 'mice':21 'quick':4 'world':2,23 |
All the lexemes have been transformed to lower case, and several words have been reduced to their root (‘cats’ → ‘cat’). The punctuation marks (‘!’ and ‘.’) and stop words (‘the’, ‘and’, and ‘to’) have been removed. Words that exist multiple times in the input (‘dog’, ‘hello’, and ‘world’) have more than one position entry.
Creating custom configurations and dictionaries is outside the scope of this article.
Writing Queries
PostgreSQL’s full-text search uses a specialized mini query language (DSL) that permits writing complex and advanced queries that combine boolean logic, grouping and phrase/proximity search. A query written in this DSL is stored using tsquery data type. Queries stored as tsquery can be evaluated against a document’s tsvector to determine if the document matches the query.
The DSL consists of single lexemes, separated using operators. Parentheses can be used to group operators and lexemes, to force precedence/binding. See Section 8.11.2 of the PostgreSQL documentation for detailed documentation on the tsquery DSL.
Symbol | Operator | Arity | Precedence |
& | Boolean AND | Binary | Lowest |
| | Boolean OR | Binary | Lowest |
! | Boolean NOT | Unary | Highest |
<N> | Followed By (where N is a positive number) | Binary | High |
<-> | Followed By (equivalent to <1>) | Binary | High |
Boolean AND tests that both lexemes are present in the document. A match will be found if both lexemes exist.
Boolean OR tests that 1 or both lexemes are present in the document. A match will be found if either lexeme exists.
Boolean NOT tests that the lexeme is not present in the document. A match will be found if the lexeme does not exist.
Followed By tests that both lexemes are present in the document, and that the distance between the 2 lexemes matches the given value. A match will be found when the 2 lexemes exist and are exactly N words apart.
A selection of functions exist to convert query text into a tsquery: to_tsquery, plainto_tsquery, phraseto_tsquery and websearch_to_tsquery.
Each of these functions will perform normalization and stop word removal on the input query text, using the same process as to_tsvector. Parsing of the query text depends on the function.
to_tsquery
to_tsquery([ config regconfig, ] querytext text) → tsquery
to_tsquery understands the full DSL. It expects the input to be well-formed and follow the tsquery syntax. Any syntax errors will become SQL exceptions.
1 |
SELECT to_tsquery('english', 'The & Quick & Brown & Foxes') |
1 |
'quick' & 'brown' & 'fox' |
plainto_tsquery
plainto_tsquery([ config regconfig, ] querytext text) → tsquery
plainto_tsquery returns a query that tests if all the non-stop word lexemes are present in the document. It parses the query text using the same process as to_tsvector . The resulting lexemes are combined using the Boolean AND operator.
1 |
SELECT plainto_tsquery('english', 'Jumped Over The Lazy Dogs') |
1 |
'jump' & 'lazi' & 'dog' |
phraseto_tsquery
phraseto_tsquery([ config regconfig, ] querytext text) → tsquery
phraseto_tsquery returns a query that tests if the given phrase exists in the document. This is useful for “phrase search”. It behaves similar to phraseto_tsquery except that the lexemes are combined using the Followed By operator. If stop words exist between lexemes in the document, the Followed By operator will account for this.
1 |
SELECT phraseto_tsquery('english', 'Jumped Over The Lazy Dogs') |
1 |
'jump' <3> 'lazi' <-> 'dog' |
In this example, ‘Over’ and ‘The’ are considered stop words. ‘Lazy’ is the 3rd word after ‘Jumped’, hence the Followed By operator between lexemes ‘jump’ and ‘lazi’ has a distance value of 3. ‘Dogs’ immediately follows ‘Lazy’, so the Followed By operator between lexemes ‘lazi’ and ‘dog’ has a distance of 1.
websearch_to_tsquery
websearch_to_tsquery([ config regconfig, ] querytext text) → tsquery
websearch_to_tsquery works a bit differently to the other to_tsquery functions. It understands an alternate syntax that is a basic variant of what is commonly available on web search engines. It combines parts of plainto_tsquery and phraseto_tsquery, while also understanding some basic operators.
Unquoted phrases will be converted to lexemes combined using Boolean AND (as in plainto_tsquery). Double-quoted phrases will be converted to lexemes combined using Followed By (as in phraseto_tsquery). or and - (a dash) are converted to Boolean OR and Boolean NOT, respectively.
1 2 3 4 5 6 7 |
SELECT query_text, websearch_to_tsquery(query_text) FROM (VALUES ('the quick brown fox'), ('"the quick brown fox"'), ('brown or fox'), ('"quick brown" fox -jumped') ) AS t(query_text) |
1 2 3 4 5 6 7 8 |
+-------------------------+-------------------------------------+ |query_text |websearch_to_tsquery | +-------------------------+-------------------------------------+ |the quick brown fox |'quick' & 'brown' & 'fox' | |"the quick brown fox" |'quick' <-> 'brown' <-> 'fox' | |brown or fox |'brown' | 'fox' | |"quick brown" fox -jumped|'quick' <-> 'brown' & 'fox' & !'jump'| +-------------------------+-------------------------------------+ |
websearch_to_tsquery is likely the most ideal built-in function to use when generating a tsquery from user input.
Note that “phrase search” queries that use Followed By operators, such as those generated by phraseto_tsquery and websearch_to_tsquery, can sometimes return unexpected results. See Inexact Phrase Search below for details.
Evaluating Queries
A tsquery can be evaluated against a tsvector, testing if the document matches the query. Evaluation is done using the @@ SQL operator. This operator takes a tsquery and a tsvector as operands (either order), and returns a Boolean indicating whether a match occurred.
tsvector @@ tsquery → boolean
tsquery @@ tsvector → boolean
For example, here a document is searched for a word using a single-lexeme tsquery:
1 2 3 4 5 |
WITH input AS ( SELECT to_tsvector('english', 'The quick brown fox jumped over the lazy dog.') AS document, to_tsquery('english', 'Dogs') AS query ) SELECT document, query, document @@ query AS matches FROM input; |
1 2 3 4 5 |
+-----------------------------------------------------+-----+-------+ |document |query|matches| +-----------------------------------------------------+-----+-------+ |'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2|'dog'|true | +-----------------------------------------------------+-----+-------+ |
Here, ‘Dogs’ matched ‘dog’ because ‘dogs’ in the query is normalized to the lexeme ‘dog’, which also exists in the document.
As expected, if the tsquery doesn’t match the tsvector, the @@ operator will return false.
1 2 3 4 5 |
WITH input AS ( SELECT to_tsvector('english', 'The quick brown fox jumped over the lazy dog.') AS document, websearch_to_tsquery('english', 'cats or kittens') AS query ) SELECT document, query, document @@ query AS matches FROM input; |
1 2 3 4 5 |
+-----------------------------------------------------+----------------+-------+ |document |query |matches| +-----------------------------------------------------+----------------+-------+ |'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2|'cat' | 'kitten'|false | +-----------------------------------------------------+----------------+-------+ |
Switching the query to something slightly more complex, here a document is searched for a phrase:
1 2 3 4 5 |
WITH input AS ( SELECT to_tsvector('english', 'The quick brown fox jumped over the lazy dog.') AS document, phraseto_tsquery('english', 'Jumped Over The Lazy') AS query ) SELECT document, query, document @@ query AS matches FROM input; |
1 2 3 4 5 |
+-----------------------------------------------------+-----------------+-------+ |document |query |matches| +-----------------------------------------------------+-----------------+-------+ |'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2|'jump' <3> 'lazi'|true | +-----------------------------------------------------+-----------------+-------+ |
And a match is found, as expected.
Storing and Indexing Documents
So far, we’ve seen how to convert documents into tsvector, query text into tsquery, and how to combine them together to determine if a document matches a query. However, the examples have only worked with a single document, and always included the document text inline.
The more typical use case for full-text search is to search an entire table worth of documents in one go:
1 |
SELECT title FROM documents WHERE to_tsvector(body) @@ to_tsquery('...') |
Without the appropriate indexes, this will run to_tsvector over the contents of the column on every SELECT. Running it over every row of a large table for every search will result in abysmal performance.
Adding an expression index is one way to improve performance. However, the details of how to do this will be left as an exercise to the reader (see Section 12.2.2. Creating Indexes of the PostgreSQL documentation). We consider the approach outlined below of using a separate, generated, tsvector column to have a number of benefits over a pure expression-index based approach, namely better performance. When using indexes, to_tsvector may need to be re-run to verify index matches.
Since tsvector is a data type, a column of type tsvector can be created on a table, and the results of to_tsvector are stored in this column. Hence, it is typical to store the documents’ tsvector alongside the original:
1 2 3 4 5 |
CREATE TABLE documents ( name text PRIMARY KEY, body text NOT NULL, tsv tsvector NOT NULL ); |
Marking the tsvector column as GENERATED (PostgreSQL 12 and later), simplifies things for client applications when they perform an INSERT or UPDATE. For older versions of PostgreSQL without generated column support, two convenience trigger functions are included that are designed to generate the tsvector from other columns in a table (see Section 12.4.3. Triggers for Automatic Updates of the PostgreSQL documentation). Of course, a custom trigger function can be used as well. The specific pros and cons of GENERATED columns compared to BEFORE INSERT OR UPDATE triggers is outside the scope of this article. However, we consider generated columns to be easier to grok, and should be used unless the limitations of generated columns force the use of trigger functions.
1 2 3 4 5 6 |
-- using GENERATED CREATE TABLE documents_generated ( name text PRIMARY KEY, body text NOT NULL, tsv tsvector NOT NULL GENERATED ALWAYS AS (to_tsvector('english', name || ' ' || body)) STORED ); |
An index can then be created over the tsvector column to speed up queries:
1 |
CREATE INDEX textsearch_idx ON documents USING GIN (tsv); |
Once the tsvector column is set up, it can be used in queries like normal:
1 |
SELECT title FROM documents WHERE tsv @@ to_tsquery('...') |
And that’s all there is to it for basic text search. To summarize, there are 3 main steps involved:
- Convert each document into a tsvector using to_tsvector. Typically, this is done once on INSERT/UPDATE and the resulting tsvector is stored in a column alongside the original document. Indexes on this column speed up searches.
- Query text is converted to a tsquery using one of the ...to_tsquery functions.
- Evaluate the tsquery against the document tsvectors using the @@ operator.
Some Gotchas
Casting Text to tsvector/tsquery
It is possible to cast text directly to tsvector and tsquery. However, this probably won’t do what you expect!
Like most PostgreSQL data types, tsvector and tsquery have a compact binary format and a textual representation. The cast from text expects a string in the tsvector/ tsquery textual representation format (see Section 8.11. Text Search Types of the PostgreSQL documentation). If the string is well-formed, the cast will succeed, but no stop word removal, word normalization, or other preprocessing steps will occur—it is assumed that the lexemes in the input have already been normalized.
1 2 3 4 5 |
WITH input AS ( SELECT 'the QuIcK brown!! fox'::tsvector AS document, plainto_tsquery('quick brown fox') AS query ) SELECT document, query, document @@ query AS matches FROM input |
1 2 3 4 5 |
+-----------------------------+-------------------------+-------+ |document |query |matches| +-----------------------------+-------------------------+-------+ |'QuIcK' 'brown!!' 'fox' 'the'|'quick' & 'brown' & 'fox'|false | +-----------------------------+-------------------------+-------+ |
Inexact Phrase Search
Phrase searches may behave unexpectedly when the query text includes stop words.
In the following example, the built-in english configuration considers ‘Over’ and ‘The’ to be stop words, and phraseto_tsquery replaces them with a Followed By with Distance 3 ( <3>) operator. As a result, this query will match any document that has the lexeme ‘jump’ followed by any two words, followed by ‘lazi’, for example:
1 2 3 4 5 |
WITH input AS ( SELECT to_tsvector('english', 'The quick brown fox jumped past a lazy dog. ') AS document, phraseto_tsquery('english', 'Jumped Over The Lazy') AS query ) SELECT document, query, document @@ query AS matches FROM input; |
1 2 3 4 5 |
+--------------------------------------------------------------+-----------------+-------+ |document |query |matches| +--------------------------------------------------------------+-----------------+-------+ |'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'past':6 'quick':2|'jump' <3> 'lazi'|true | +--------------------------------------------------------------+-----------------+-------+ |
Here, “Jumped Over The Lazy” matches “jumped past a lazy dog”. The match is successful because the tsvector contains the lexemes ‘jump’ and ‘lazi’, at positions 5 and 8, respectively – a distance of 3 words.
Limitations
Limit | Value | Behaviour |
Size of each individual lexeme | 2047 bytes | SQL ERROR raised |
Total size of a
tsvector. This includes lexemes and positional information. (type binary repr/on-disk size, for details see ts_type.h) |
1048575 bytes (~1MiB) | SQL ERROR raised |
Total number of lexemes per tsvector | 264 | unknown (size limit is reached first) |
Total number of position entries per lexeme per tsvector | 255 | Position entries after 255th are discarded |
Maximum value of a lexeme position entry | 16383 | Values greater than 16383 are rounded to 16383 |
Total number of lexemes and operators (nodes) per tsquery | 32768 | SQL ERROR raised |
There are a number of limitations to be aware of when using PostgreSQL full-text search.
Unless you’re trying to index an article about the world’s longest English word (the chemical name for Titin, which is ~189819 letters long!), it’s unlikely that you’ll hit the lexeme size limit. However, indexing large fragments of text may hit the tsvector size limit of ~1MiB. Note that this limit is based on the size of the binary representation (used to store a tsvector in memory and on disk). A typical workaround is to split the document into chunks (e.g. for books, split by chapter).
If your application intends to use phrase search pay careful attention to the tsvector limits around lexeme positions. No error or warning is emitted when either of these limits are reached while generating the tsvector. A tsvector can only store 255 position entries per lexeme. If a lexeme occurs more than 255 times in a document, only the first 255 positions will be stored. Each of these position entries also has a maximum value, which is capped at 16383. If a lexeme occurs after the 16383th position in the document, its position will be rounded to 16383. Note that stop words, while not included in the tsvector, influence lexeme positions.
Total Number of Lexemes per tsvector
In the following example, the word ‘fox’ is repeated 300 times in the source document, yet only 255 positions are recorded in the tsvector. ( unnest is a built-in utility function that converts a tsvector into a table so that various queries can be made against it)
1 2 3 4 5 6 7 8 |
WITH document AS ( SELECT repeat('fox ', 300) as body ), lexemes AS ( SELECT (unnest(to_tsvector(document.body))).* FROM document ), positions AS ( SELECT lexeme, unnest(positions) AS position FROM lexemes ) SELECT * FROM positions |
1 2 3 4 5 6 7 8 9 10 11 |
+------+--------+ |lexeme|position| +------+--------+ |fox |1 | |fox |2 | |fox |3 | |fox |4 | <snip> |fox |254 | |fox |255 | +------+--------+ |
Maximum Value of a Lexeme Position Entry
Here, a document, consisting of ‘hello’, followed by 20000 copies of ‘the’, and ending with ‘world’, is converted to a tsvector. ‘the’ was chosen as the filler because it is a stop word. While a lexeme for ‘the’ is not present in the tsvector, it still influences the lexeme positions of ‘hello’ and ‘world’. ‘world’ is the 20001th word in the document, yet because the maximum position value limit is reached, PostgreSQL rounds the position value to 16383:
1 2 3 4 5 6 7 8 |
WITH document AS ( SELECT 'hello ' || repeat('the ', 20000) || 'world' as body ), lexemes AS ( SELECT (unnest(to_tsvector(document.body))).* FROM document ), positions AS ( SELECT lexeme, unnest(positions) AS position FROM lexemes ) SELECT * FROM positions |
1 2 3 4 5 6 |
+------+--------+ |lexeme|position| +------+--------+ |hello |1 | |world |16383 | +------+--------+ |
Phrase Search and Large Document Limitations Example
Phrase search can work unreliably on large fragments of text because lexeme position information will either be wrong (rounded to 16383) or missing entirely, as shown in the following example:
1 2 3 4 5 6 7 8 |
SELECT name, to_tsvector(t.body) @@ phraseto_tsquery('hello world') AS phrase_matches, to_tsvector(t.body) @@ plainto_tsquery('hello world') AS plain_matches FROM ( VALUES ('phrase', 'hello world'), ('positions discarded', repeat('hello ', 300) || ' hello world'), ('positions rounded', repeat('the ', 20000) || ' hello world') ) AS t(name, body) |
1 2 3 4 5 6 7 |
+-------------------+--------------+-------------+ |name |phrase_matches|plain_matches| +-------------------+--------------+-------------+ |phrase |true |true | |positions discarded|false |true | |positions rounded |false |true | +-------------------+--------------+-------------+ |
In both the positions discarded and positions rounded tests, phrase_matches is false because the Followed By operator (as generated by phraseto_tsquery) will check the tsvector and, based on available information, incorrectly conclude that there is no instance where a ‘world’ lexeme directly follows a ‘hello’ (i.e. has a position difference of 1).
The positions discarded phrase test fails because only the first 255 position entries of the ‘hello’ lexeme are retained in the tsvector. The last recorded ‘hello’ position is 255, yet ‘world’ has a position of 301, a difference of 46.
The positions rounded phrase test fails because the 20000 instances of the ‘the’ stop word influence the position values of the ‘hello’ and ‘world’ lexemes. Both position values overflow and get rounded to 16383, and hence have a difference of 0.
Give PostgreSQL Full-Text Search a Try
The full-text search features built-in to PostgreSQL are more than capable for providing basic text search functionality to an application. Developers often reach first for heavyweight solutions, such as OpenSearch. If you’re already using PostgreSQL as your application data store, try the built-in text search first before adding additional dependencies to your application. It is easier to keep the indexed documents in sync with the originals when they’re stored alongside each other, and generated columns simplify this even further. Full-text search queries can be combined with other SQL operators, which enables powerful querying, filtering, sorting capabilities, all within one database.
If you are interested in trying out PostgreSQL’s full-text search then check out Instaclustr’s managed PostgreSQL, which has full support for full-text search. You can try it without entering a credit card with our free trial today by visiting our console.