Plugin PT15 (Positional Token 15)
PT15, short for Positional Token 15, is an indexing and search algorithm that leverages token position within a document to enhance search relevance.
It is inspired by Thomas Wilkerling’s work on Flexsearch.
The algorithm splits a document into tokens and stores them in 15 predefined positional buckets.
Each bucket represents a relative position within the text, scaling based on the document length.
When searching, the position of matching tokens is considered, giving higher scores to tokens found in earlier positions.
This approach prioritizes the placement of key terms, making the search results more relevant for queries where token order and position matter.
Installation
You can install the plugin using any major Node.js package manager.
Usage
This plugin will replace the default scoring algorithm (BM25) with the PT15 algorithm, making search faster and the index size smaller.
And that’s it! The Orama plugin will do the rest for you.
PT15 vs BM25, pros and cons
QPS and BM25 are two very different scoring algorithms, and they offer pros and cons depending on the use case.
PT15 Pros
-
Position-Aware Scoring: PT15 takes into account the position of tokens in a document, allowing for more contextually relevant search results, particularly in cases where the order and placement of words matter (e.g., titles or headlines).
-
Efficient Token Search: By storing tokens in predefined positional buckets, PT15 can quickly search through documents based on both token presence and position, providing fast and efficient search results.
-
Boosts Key Positions: Tokens in earlier or more prominent positions (e.g., the start of a sentence or document) can be given higher importance, which is useful for scoring queries where the position of a term plays a role in relevance.
-
Handling of Large Documents: PT15 scales token positions in documents longer than 15 tokens, ensuring that even long documents are mapped into the 15-position scheme, maintaining efficient indexing regardless of document size.
-
Partial Token Matching: PT15 can handle partial matches by storing token parts (prefixes) within each positional bucket, enabling the algorithm to support queries even if only a part of a token is present.
PT15 Cons
-
Limited to 15 Positions: With a fixed maximum of 15 positional buckets, PT15 may oversimplify token positioning for longer documents, potentially losing precision for content where token positions beyond the 15th are important.
-
Position Bias: The algorithm inherently gives higher weight to tokens that appear earlier in the document. While beneficial in some cases, this can bias the search results toward the beginning of documents, which may not always reflect true relevance for longer or complex documents.
How to choose
When deciding between PT15 and BM25, consider the nature of your content and the importance of token position in your search queries.
Use PT15 when:
- You need to prioritize the position of tokens in your search results.
- Token order and placement are crucial for relevance.
- You want to boost the importance of tokens in specific positions.
- You are working with shorter documents or queries where token position is a key factor.
BM25 may be more suitable when:
- Token position is less critical for relevance.
- You are dealing with long-form content or technical documentation.
- You need a more balanced approach to scoring that considers multiple relevance factors.
- You want to maintain a more traditional scoring algorithm for search results.