You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: site/en/userGuide/search-query-get/full-text-search.md
+35-26Lines changed: 35 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,43 +16,45 @@ By integrating full text search with semantic-based dense vector search, you can
16
16
17
17
</div>
18
18
19
-
## Overview
19
+
## BM25 implementation
20
20
21
-
Full text search simplifies the process of text-based searching by eliminating the need for manual embedding. This feature operates through the following workflow:
21
+
Milvus provides full text search powered by the BM25 relevance algorithm, a widely adopted scoring function in information retrieval systems, and Milvus integrates it into the search workflow to deliver accurate, relevance-ranked text results.
22
22
23
-
1.**Text input**: You insert raw text documents or provide query text without any need for manual embedding.
23
+
Full text search in Milvus follows the workflow below:
24
24
25
-
1.**Text analysis**: Milvus uses an [analyzer](analyzer-overview.md) to tokenize input text into individual, searchable terms.
25
+
1.**Raw text input**: You insert text documents or provide a query using plain text, no embedding models required.
26
26
27
-
1.**Function processing**: The built-in function receives tokenized terms and converts them into sparse vector representations.
27
+
1.**Text analysis**: Milvus uses an [analyzer](analyzer-overview.md) to process your text into meaningful terms that can be indexed and searched.
28
28
29
-
1.**Collection store**: Milvus stores these sparse embeddings in a collection for efficient retrieval.
29
+
1.**BM25 function processing**: A built-in function transforms these terms into sparse vector representations optimized for BM25 scoring.
30
30
31
-
1.**BM25 scoring**: During a search, Milvus applies the BM25 algorithm to calculate scores for the stored documents and ranks matched results based on relevance to the query text.
31
+
1.**Collection store**: Milvus stores the resulting sparse embeddings in a collection for fast retrieval and ranking.
32
+
33
+
1.**BM25 relevance scoring**: At search time, Milvus applies the BM25 scoring function to compute document relevance and return ranked results that best match the query terms.
32
34
33
35

34
36
35
37
To use full text search, follow these main steps:
36
38
37
-
1.[Create a collection](full-text-search.md#Create-a-collection-for-full-text-search): Set up a collection with necessary fields and define a function to convert raw text into sparse embeddings.
39
+
1.[Create a collection](full-text-search.md#Create-a-collection-for-BM25-full-text-search): Set up the required fields and define a BM25 function that converts raw text into sparse embeddings.
38
40
39
41
1.[Insert data](full-text-search.md#Insert-text-data): Ingest your raw text documents to the collection.
40
42
41
-
1.[Perform searches](full-text-search.md#Perform-full-text-search): Use query texts to search through your collection and retrieve relevant results.
43
+
1.[Perform searches](full-text-search.md#Perform-full-text-search): Use natural-language query text to retrieve ranked results based on BM25 relevance.
42
44
43
-
## Create a collection for full text search
45
+
## Create a collection for BM25 full text search
44
46
45
-
To enable full text search, create a collection with a specific schema. This schema must include three necessary fields:
47
+
To enable BM25-powered full text search, you must prepare a collection with the required fields, define a BM25 function to generate sparse vectors, configure an index, and then create the collection.
46
48
47
-
- The primary field that uniquely identifies each entity in a collection.
49
+
### Define schema fields
48
50
49
-
- A `VARCHAR` field that stores raw text documents, with the `enable_analyzer` attribute set to `True`. This allows Milvus to tokenize text into specific terms for function processing.
51
+
Your collection schema must include at least three required fields:
50
52
51
-
-A `SPARSE_FLOAT_VECTOR`field reserved to store sparse embeddings that Milvus will automatically generate for the `VARCHAR` field.
53
+
-**Primary field**: Uniquely identifies each entity in the collection.
52
54
53
-
### Define the collection schema
55
+
-**Text field** (`VARCHAR`): Stores raw text documents. Must set `enable_analyzer=True` so Milvus can process the text for BM25 relevance ranking. By default, Milvus uses the [`standard`](standard-analyzer.md)[ analyzer](standard-analyzer.md) for text analysis. To configure a different analyzer, refer to [Analyzer Overview](analyzer-overview.md).
54
56
55
-
First, create the schema and add the necessary fields:
57
+
-**Sparse vector field** (`SPARSE_FLOAT_VECTOR`): Stores sparse embeddings automatically generated by the BM25 function.
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True) # Primary field
78
+
# highlight-start
79
+
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True) # Text field
80
+
schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR) # Sparse vector field; no dim required for sparse vectors
81
+
# highlight-end
78
82
```
79
83
80
84
```java
@@ -197,15 +201,19 @@ export schema='{
197
201
}'
198
202
```
199
203
200
-
In this configuration,
204
+
In the preceding config,
201
205
202
206
-`id`: serves as the primary key and is automatically generated with `auto_id=True`.
203
207
204
-
-`text`: stores your raw text data for full text search operations. The data type must be `VARCHAR`, as `VARCHAR` is Milvus string data type for text storage. Set `enable_analyzer=True` to allow Milvus to tokenize the text. By default, Milvus uses the `standard`[ analyzer](standard-analyzer.md) for text analysis. To configure a different analyzer, refer to [Analyzer Overview](analyzer-overview.md).
208
+
-`text`: stores your raw text data for full text search operations. The data type must be `VARCHAR`, as `VARCHAR` is Milvus string data type for text storage.
205
209
206
210
-`sparse`: a vector field reserved to store internally generated sparse embeddings for full text search operations. The data type must be `SPARSE_FLOAT_VECTOR`.
207
211
208
-
Now, define a function that will convert your text into sparse vector representations and then add it to the schema:
212
+
### Define the BM25 function
213
+
214
+
The BM25 function converts tokenized text into sparse vectors that support BM25 scoring.
215
+
216
+
Define the function and add it to your schema:
209
217
210
218
<divclass="multipleCode">
211
219
<a href="#python">Python</a>
@@ -220,6 +228,7 @@ bm25_function = Function(
220
228
name="text_bm25_emb", # Function name
221
229
input_field_names=["text"], # Name of the VARCHAR field containing raw text data
222
230
output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
231
+
# highlight-next-line
223
232
function_type=FunctionType.BM25, # Set to `BM25`
224
233
)
225
234
@@ -304,7 +313,7 @@ export schema='{
304
313
</tr>
305
314
<tr>
306
315
<td><p><code>name</code></p></td>
307
-
<td><p>The name of the function. This function converts your raw text from the <code>text</code> field into searchable vectors that will be stored in the <code>sparse</code> field.</p></td>
316
+
<td><p>The name of the function. This function converts your raw text from the <code>text</code> field into BM25-compatible sparse vectors that will be stored in the <code>sparse</code> field.</p></td>
308
317
</tr>
309
318
<tr>
310
319
<td><p><code>input_field_names</code></p></td>
@@ -316,19 +325,19 @@ export schema='{
316
325
</tr>
317
326
<tr>
318
327
<td><p><code>function_type</code></p></td>
319
-
<td><p>The type of the function to use. Set the value to <code>FunctionType.BM25</code>.</p></td>
328
+
<td><p>The type of the function to use. Must be <code>FunctionType.BM25</code>.</p></td>
320
329
</tr>
321
330
</table>
322
331
323
332
<divclass="alert note">
324
333
325
-
For collections with multiple `VARCHAR` fields requiring text-to-sparse-vector conversion, add separate functions to the collection schema, ensuring each function has a unique name and `output_field_names` value.
334
+
If multiple `VARCHAR` fields require BM25 processing, define **one BM25 function per field**, each with a unique name and output field.
326
335
327
336
</div>
328
337
329
338
### Configure the index
330
339
331
-
After defining the schema with necessary fields and the built-in function, set up the index for your collection. To simplify this process, use `AUTOINDEX` as the `index_type`, an option that allows Milvus to choose and configure the most suitable index type based on the structure of your data.
340
+
After defining the schema with necessary fields and the built-in function, set up the index for your collection.
0 commit comments