Skip to content

Commit aeae291

Browse files
authored
Merge pull request #3384 from liyun95/v2.5.x
update fts
2 parents 21a5f36 + 4b04b6c commit aeae291

File tree

2 files changed

+35
-26
lines changed

2 files changed

+35
-26
lines changed

assets/full-text-search.png

17.6 KB
Loading

site/en/userGuide/search-query-get/full-text-search.md

Lines changed: 35 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -16,43 +16,45 @@ By integrating full text search with semantic-based dense vector search, you can
1616

1717
</div>
1818

19-
## Overview
19+
## BM25 implementation
2020

21-
Full text search simplifies the process of text-based searching by eliminating the need for manual embedding. This feature operates through the following workflow:
21+
Milvus provides full text search powered by the BM25 relevance algorithm, a widely adopted scoring function in information retrieval systems, and Milvus integrates it into the search workflow to deliver accurate, relevance-ranked text results.
2222

23-
1. **Text input**: You insert raw text documents or provide query text without any need for manual embedding.
23+
Full text search in Milvus follows the workflow below:
2424

25-
1. **Text analysis**: Milvus uses an [analyzer](analyzer-overview.md) to tokenize input text into individual, searchable terms.
25+
1. **Raw text input**: You insert text documents or provide a query using plain text, no embedding models required.
2626

27-
1. **Function processing**: The built-in function receives tokenized terms and converts them into sparse vector representations.
27+
1. **Text analysis**: Milvus uses an [analyzer](analyzer-overview.md) to process your text into meaningful terms that can be indexed and searched.
2828

29-
1. **Collection store**: Milvus stores these sparse embeddings in a collection for efficient retrieval.
29+
1. **BM25 function processing**: A built-in function transforms these terms into sparse vector representations optimized for BM25 scoring.
3030

31-
1. **BM25 scoring**: During a search, Milvus applies the BM25 algorithm to calculate scores for the stored documents and ranks matched results based on relevance to the query text.
31+
1. **Collection store**: Milvus stores the resulting sparse embeddings in a collection for fast retrieval and ranking.
32+
33+
1. **BM25 relevance scoring**: At search time, Milvus applies the BM25 scoring function to compute document relevance and return ranked results that best match the query terms.
3234

3335
![Full Text Search](../../../../assets/full-text-search.png)
3436

3537
To use full text search, follow these main steps:
3638

37-
1. [Create a collection](full-text-search.md#Create-a-collection-for-full-text-search): Set up a collection with necessary fields and define a function to convert raw text into sparse embeddings.
39+
1. [Create a collection](full-text-search.md#Create-a-collection-for-BM25-full-text-search): Set up the required fields and define a BM25 function that converts raw text into sparse embeddings.
3840

3941
1. [Insert data](full-text-search.md#Insert-text-data): Ingest your raw text documents to the collection.
4042

41-
1. [Perform searches](full-text-search.md#Perform-full-text-search): Use query texts to search through your collection and retrieve relevant results.
43+
1. [Perform searches](full-text-search.md#Perform-full-text-search): Use natural-language query text to retrieve ranked results based on BM25 relevance.
4244

43-
## Create a collection for full text search
45+
## Create a collection for BM25 full text search
4446

45-
To enable full text search, create a collection with a specific schema. This schema must include three necessary fields:
47+
To enable BM25-powered full text search, you must prepare a collection with the required fields, define a BM25 function to generate sparse vectors, configure an index, and then create the collection.
4648

47-
- The primary field that uniquely identifies each entity in a collection.
49+
### Define schema fields
4850

49-
- A `VARCHAR` field that stores raw text documents, with the `enable_analyzer` attribute set to `True`. This allows Milvus to tokenize text into specific terms for function processing.
51+
Your collection schema must include at least three required fields:
5052

51-
- A `SPARSE_FLOAT_VECTOR` field reserved to store sparse embeddings that Milvus will automatically generate for the `VARCHAR` field.
53+
- **Primary field**: Uniquely identifies each entity in the collection.
5254

53-
### Define the collection schema
55+
- **Text field** (`VARCHAR`): Stores raw text documents. Must set `enable_analyzer=True` so Milvus can process the text for BM25 relevance ranking. By default, Milvus uses the [`standard`](standard-analyzer.md)[ analyzer](standard-analyzer.md) for text analysis. To configure a different analyzer, refer to [Analyzer Overview](analyzer-overview.md).
5456

55-
First, create the schema and add the necessary fields:
57+
- **Sparse vector field** (`SPARSE_FLOAT_VECTOR`): Stores sparse embeddings automatically generated by the BM25 function.
5658

5759
<div class="multipleCode">
5860
<a href="#python">Python</a>
@@ -72,9 +74,11 @@ client = MilvusClient(
7274

7375
schema = client.create_schema()
7476

75-
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
76-
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True)
77-
schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR)
77+
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True) # Primary field
78+
# highlight-start
79+
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True) # Text field
80+
schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR) # Sparse vector field; no dim required for sparse vectors
81+
# highlight-end
7882
```
7983

8084
```java
@@ -197,15 +201,19 @@ export schema='{
197201
}'
198202
```
199203

200-
In this configuration,
204+
In the preceding config,
201205

202206
- `id`: serves as the primary key and is automatically generated with `auto_id=True`.
203207

204-
- `text`: stores your raw text data for full text search operations. The data type must be `VARCHAR`, as `VARCHAR` is Milvus string data type for text storage. Set `enable_analyzer=True` to allow Milvus to tokenize the text. By default, Milvus uses the `standard`[ analyzer](standard-analyzer.md) for text analysis. To configure a different analyzer, refer to [Analyzer Overview](analyzer-overview.md).
208+
- `text`: stores your raw text data for full text search operations. The data type must be `VARCHAR`, as `VARCHAR` is Milvus string data type for text storage.
205209

206210
- `sparse`: a vector field reserved to store internally generated sparse embeddings for full text search operations. The data type must be `SPARSE_FLOAT_VECTOR`.
207211

208-
Now, define a function that will convert your text into sparse vector representations and then add it to the schema:
212+
### Define the BM25 function
213+
214+
The BM25 function converts tokenized text into sparse vectors that support BM25 scoring.
215+
216+
Define the function and add it to your schema:
209217

210218
<div class="multipleCode">
211219
<a href="#python">Python</a>
@@ -220,6 +228,7 @@ bm25_function = Function(
220228
name="text_bm25_emb", # Function name
221229
input_field_names=["text"], # Name of the VARCHAR field containing raw text data
222230
output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
231+
# highlight-next-line
223232
function_type=FunctionType.BM25, # Set to `BM25`
224233
)
225234

@@ -304,7 +313,7 @@ export schema='{
304313
</tr>
305314
<tr>
306315
<td><p><code>name</code></p></td>
307-
<td><p>The name of the function. This function converts your raw text from the <code>text</code> field into searchable vectors that will be stored in the <code>sparse</code> field.</p></td>
316+
<td><p>The name of the function. This function converts your raw text from the <code>text</code> field into BM25-compatible sparse vectors that will be stored in the <code>sparse</code> field.</p></td>
308317
</tr>
309318
<tr>
310319
<td><p><code>input_field_names</code></p></td>
@@ -316,19 +325,19 @@ export schema='{
316325
</tr>
317326
<tr>
318327
<td><p><code>function_type</code></p></td>
319-
<td><p>The type of the function to use. Set the value to <code>FunctionType.BM25</code>.</p></td>
328+
<td><p>The type of the function to use. Must be <code>FunctionType.BM25</code>.</p></td>
320329
</tr>
321330
</table>
322331

323332
<div class="alert note">
324333

325-
For collections with multiple `VARCHAR` fields requiring text-to-sparse-vector conversion, add separate functions to the collection schema, ensuring each function has a unique name and `output_field_names` value.
334+
If multiple `VARCHAR` fields require BM25 processing, define **one BM25 function per field**, each with a unique name and output field.
326335

327336
</div>
328337

329338
### Configure the index
330339

331-
After defining the schema with necessary fields and the built-in function, set up the index for your collection. To simplify this process, use `AUTOINDEX` as the `index_type`, an option that allows Milvus to choose and configure the most suitable index type based on the structure of your data.
340+
After defining the schema with necessary fields and the built-in function, set up the index for your collection.
332341

333342
<div class="multipleCode">
334343
<a href="#python">Python</a>

0 commit comments

Comments
 (0)