Quick Start
DashText, is a vector retrieval service DashVector recommended use of sparse vector encoder (Sparse Vector Encoder), DashText can be converted to the original text through the BM25 algorithm to sparse vector (Sparse Vector) expression, through the DashText can be greatly simplify the use of the DashText can greatly simplify the use of DashVector [keyword-aware retrieval] capability.
clarification
-
You need to replace YOUR_API_KEY in the example with your api-key, and YOUR_CLUSTER_ENDPOINT in the example with your Cluster Endpoint, for the code to work properly.
-
This example only demonstrates the functionality of Sparse Vector, please set your vector dimension according to the actual situation. For simplicity, the Dense Vector is set to 4 in this example code.
Step1. Create a Collection that supports Sparse Vector.
Python Examples
import dashvector
client = (api_key='YOUR_API_KEY', endpoint='YOUR_CLUSTER_ENDPOINT')
assert client
ret = ('hybrid_collection', dimension=4, metric='dotproduct')
assert ret
collection = ('hybrid_collection')
assert collection
Java Samples
import ;
import ;
import ;
import ;
import ;
DashVectorClient client =
new DashVectorClient("YOUR_API_KEY", "YOUR_CLUSTER_ENDPOINT");
CreateCollectionRequest request = ()
.name("hybrid_collection")
.dimension(4)
.metric()
.dataType()
.build();
Response<Void> response = (request);
(response);
DashVectorCollection collection = ("hybrid_collection");
critical
Only the inner product metric (metric='dotproduct'
) Support for Sparse Vector function
Step2. Create SparseVectorEncoder
Using the built-in Encoder
Python Examples
from dashtext import SparseVectorEncoder
encoder = ()
Java Samples
import ;
SparseVectorEncoder encoder = ();
clarification
Built-in Encoder UsageChinese Wiki corpusConduct training usingJiebaPerform Chinese Segmentation
Create Encoder based on own corpus
Python Examples
from dashtext import SparseVectorEncoder
encoder = SparseVectorEncoder()
# (All) own corpus
corpus = [
"Vector retrieval service DashVector is based on AliCloud's self-developed high-efficiency vector engine Proxima kernel, and provides cloud-native, fully managed vector retrieval service with horizontal expansion capability.", "DashVector provides a fully managed vector retrieval service with horizontal expansion capability,
"DashVector delivers its powerful vector management, vector query, and other diverse capabilities through a simple and easy-to-use SDK/API interface for rapid integration by upper tier AI applications", "DashVector provides a powerful and scalable vector retrieval service for AI applications including large modeling, vector management, and vector query.
"It provides efficient vector retrieval capabilities for a wide range of application scenarios, including large model ecology, multimodal AI search, and molecular structure analysis", "Simple, flexible, and out-of-the-box, it is easy to integrate into upper-level AI applications".
"Schema-free design, through the Schema to realize the combination of any conditions under the filter query", "Schema-free design, through the Schema to realize the combination of any conditions under the filter query"
]
# Train Encoder based on its own corpus.
(corpus)
Java Samples
import ;
import . *; ;import .
SparseVectorEncoder encoder = new SparseVectorEncoder();
//(All) own corpus
List<String> corpus = (
"Vector retrieval service DashVector is based on AliCloud's own high-efficiency vector engine Proxima kernel, providing a cloud-native, fully managed vector retrieval service with horizontal expansion capabilities",
"DashVector delivers its powerful vector management, vector query, and other diverse capabilities through a simple and easy-to-use SDK/API interface for rapid integration by upper tier AI applications", "DashVector provides a powerful and scalable vector retrieval service for AI applications including large modeling, vector management, and vector query.
"It provides efficient vector retrieval capabilities for a wide range of application scenarios, including large model ecology, multimodal AI search, and molecular structure analysis", "Simple, flexible, and out-of-the-box, it is easy to integrate into upper-level AI applications".
"Schema-free design, through the Schema to realize the combination of any conditions under the filter query"
);
// Train Encoder based on its own corpus.
(corpus).
clarification
-
The built-in Encoder can be used directly without the need to use the original corpus for training, which makes it more convenient to use, and at the same time has stronger generalization ability. However, the accuracy is weak in the face of scenarios such as more jargons in the original corpus
-
To create an Encoder based on its own corpus, it needs to be pre-trained based on the (entire) original corpus, with higher accuracy, and the specific use of the Encoder created from its own corpus can be found in the Advanced Use.
-
Users need to choose Encoder according to their own business scenarios and needs, for specific domains (a large number of exclusive vocabulary) scenarios recommended to create Encoder based on their own corpus.
Step3. Insert Doc with Sparse Vector
Python Examples
from dashvector import Doc
document = "Vector retrieval service DashVector is based on AliCloud's self-developed high-efficiency vector engine Proxima kernel, providing cloud-native, fully managed vector retrieval services with horizontal expansion capabilities."
doc_sparse_vector = encoder.encode_documents(document)
print(doc_sparse_vector)
# Output based on built-in Encoder.
# {380823393: 0.7262431704356519, 414191989: 0.7262431704356519, 565176162: 0.7262431704356519, 904594806: 0.7262431704356519, 1005505802. 0.7262431704356519, 1169440797: 0.8883757984694465, 1240922502: 0.7262431704356519, 1313971048: 0.7262431704356519, 1317077351. 0.7262431704356519, 1490140460: 0.7262431704356519, 1574737055: 0.7262431704356519, 1760434515: 0.7262431704356519, 2045788977. 0.8414146776926797, 2141666983: 0.7262431704356519, 2509543087: 0.7262431704356519, 3180265193: 0.7262431704356519, 3845702398. 0.7262431704356519, 4106887295: 0.7262431704356519}
(Doc(
id='A',
vector=[0.1, 0.2, 0.3, 0.4],
sparse_vector=doc_sparse_vector
))
Java Samples
String document = "Vector retrieval service DashVector is based on Dharma Institute's self-developed high-efficiency vector engine Proxima kernel, providing cloud-native, fully managed vector retrieval services with horizontal expansion capabilities." ;
Map<Long, Float> sparseVector = (document).
(sparseVector);
// Output based on built-in Encoder.
// {380823393: 0.7262431704356519, 414191989: 0.7262431704356519, 565176162: 0.7262431704356519, 904594806: 0.7262431704356519, 1005505802. 0.7262431704356519, 1169440797: 0.8883757984694465, 1240922502: 0.7262431704356519, 1313971048: 0.7262431704356519, 1317077351. 0.7262431704356519, 1490140460: 0.7262431704356519, 1574737055: 0.7262431704356519, 1760434515: 0.7262431704356519, 2045788977. 0.8414146776926797, 2141666983: 0.7262431704356519, 2509543087: 0.7262431704356519, 3180265193: 0.7262431704356519, 3845702398. 0.7262431704356519, 4106887295: 0.7262431704356519}
Vector vector = ().value((0.1f, 0.2f, 0.3f, 0.4f)).build();
// Build Doc with Sparse Vector
Doc doc = ()
.id("28")
.sparseVector(sparseVector)
.vector(vector)
.vector(vector) .build();
// Insert Doc with Sparse Vector
Response<Void> response = (().doc(doc).build());
Step4. keyword-aware vector search
Python Examples
query = "What is a vector retrieval service?"
sparse_vector = encoder.encode_queries(query)
print(sparse_vector)
# Output based on built-in Encoder.
# {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}
docs = (
vector=[0.1, 0.1, 0.1, 0.1],
sparse_vector=sparse_vector
)
Java Samples
String query = "What is Vector Search Service?";
Map<Long, Float> sparseVector = (query);
(sparseVector);
// built-inEncoder(used form a nominal expression)output:
// {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}
Vector vector = ().value((0.1f, 0.2f, 0.3f, 0.4f)).build();
// construct (sth abstract)QueryDocRequest
QueryDocRequest request = ()
.vector(vector)
.sparseVector(sparseVector)
.topk(100)
.includeVector(true)
.build();
Response<List<Doc>> response = (request);
(response);
Step5. weight-based keyword+semantic retrieval
Python Examples
from dashtext import combine_dense_and_sparse
query = "What is Vector Search Service?"
sparse_vector = encoder.encode_queries(query)
# weighting factor
alpha = 0.7
dense_vector = [0.1, 0.1, 0.1, 0.1]
scaled_dense_vector, scaled_sparse_vector = combine_dense_and_sparse(dense_vector, sparse_vector, alpha)
docs = (
vector=scaled_dense_vector,
sparse_vector=scaled_sparse_vector
)
Java Samples
String query = "What is Vector Search Service?";
Map<Long, Float> sparseVector = (query);
(sparse_vector);
// built-inEncoder(used form a nominal expression)output:
// {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}
Vector denseVector = ().value((0.1f, 0.2f, 0.3f, 0.4f)).build();
// pass (a bill or inspection etc)alphaFactorization of dense and sparse vectors
float alpha = 0.1;
sparse_vector.forEach((key, value) -> sparse_vector.put(key, value * (1 - alpha)));
denseVector = ().value(
().stream().map(number -> () * alpha).collect(())
).build();
// construct (sth abstract)QueryDocRequest
QueryDocRequest request = ()
.vector(denseVector)
.sparseVector(sparseVector)
.topk(100)
.includeVector(true)
.build();
Response<List<Doc>> response = (request);
(response);
clarification
The parameter alpha is a weighting factor that controls the weighting of the dense vector distance and the sparse vector distance. alpha=0.0 means that only sparse vectors are used for the distance metric, and alpha=1.0 means that only dense vectors are used for the distance metric.
API Reference
More DashText API details are available:
- Python SDK:/project/dashtext/