DashText - Quick Start

Quick Start

DashText, is a vector retrieval service DashVector recommended use of sparse vector encoder (Sparse Vector Encoder), DashText can be converted to the original text through the BM25 algorithm to sparse vector (Sparse Vector) expression, through the DashText can be greatly simplify the use of the DashText can greatly simplify the use of DashVector [keyword-aware retrieval] capability.

clarification

You need to replace YOUR_API_KEY in the example with your api-key, and YOUR_CLUSTER_ENDPOINT in the example with your Cluster Endpoint, for the code to work properly.
This example only demonstrates the functionality of Sparse Vector, please set your vector dimension according to the actual situation. For simplicity, the Dense Vector is set to 4 in this example code.

Step1. Create a Collection that supports Sparse Vector.

Python Examples

import dashvector

client = (api_key='YOUR_API_KEY', endpoint='YOUR_CLUSTER_ENDPOINT')
assert client

ret = ('hybrid_collection', dimension=4, metric='dotproduct')
assert ret

collection = ('hybrid_collection')
assert collection

Java Samples

import ;
import ;
import ;
import ;
import ;

DashVectorClient client = 
  new DashVectorClient("YOUR_API_KEY", "YOUR_CLUSTER_ENDPOINT");

CreateCollectionRequest request = ()
            .name("hybrid_collection")
            .dimension(4)
            .metric()
            .dataType()
            .build();
      
Response<Void> response = (request);
(response);

DashVectorCollection collection = ("hybrid_collection");

critical

Only the inner product metric (metric='dotproduct') Support for Sparse Vector function

Step2. Create SparseVectorEncoder

Using the built-in Encoder

Python Examples

from dashtext import SparseVectorEncoder

encoder = ()

Java Samples

import ;

SparseVectorEncoder encoder = ();

clarification

Built-in Encoder UsageChinese Wiki corpusConduct training usingJiebaPerform Chinese Segmentation

Create Encoder based on own corpus

Python Examples

from dashtext import SparseVectorEncoder

encoder = SparseVectorEncoder()

# (All) own corpus
corpus = [
    "Vector retrieval service DashVector is based on AliCloud's self-developed high-efficiency vector engine Proxima kernel, and provides cloud-native, fully managed vector retrieval service with horizontal expansion capability.", "DashVector provides a fully managed vector retrieval service with horizontal expansion capability,
    "DashVector delivers its powerful vector management, vector query, and other diverse capabilities through a simple and easy-to-use SDK/API interface for rapid integration by upper tier AI applications", "DashVector provides a powerful and scalable vector retrieval service for AI applications including large modeling, vector management, and vector query.
    "It provides efficient vector retrieval capabilities for a wide range of application scenarios, including large model ecology, multimodal AI search, and molecular structure analysis", "Simple, flexible, and out-of-the-box, it is easy to integrate into upper-level AI applications".
    
    
    "Schema-free design, through the Schema to realize the combination of any conditions under the filter query", "Schema-free design, through the Schema to realize the combination of any conditions under the filter query"
]

# Train Encoder based on its own corpus.
(corpus)

Java Samples

import ;
import . *; ;import .

SparseVectorEncoder encoder = new SparseVectorEncoder();

//(All) own corpus
List<String> corpus = (
  "Vector retrieval service DashVector is based on AliCloud's own high-efficiency vector engine Proxima kernel, providing a cloud-native, fully managed vector retrieval service with horizontal expansion capabilities",
  "DashVector delivers its powerful vector management, vector query, and other diverse capabilities through a simple and easy-to-use SDK/API interface for rapid integration by upper tier AI applications", "DashVector provides a powerful and scalable vector retrieval service for AI applications including large modeling, vector management, and vector query.
  "It provides efficient vector retrieval capabilities for a wide range of application scenarios, including large model ecology, multimodal AI search, and molecular structure analysis", "Simple, flexible, and out-of-the-box, it is easy to integrate into upper-level AI applications".
  
  
  "Schema-free design, through the Schema to realize the combination of any conditions under the filter query"
);

// Train Encoder based on its own corpus.
(corpus).

clarification

The built-in Encoder can be used directly without the need to use the original corpus for training, which makes it more convenient to use, and at the same time has stronger generalization ability. However, the accuracy is weak in the face of scenarios such as more jargons in the original corpus
To create an Encoder based on its own corpus, it needs to be pre-trained based on the (entire) original corpus, with higher accuracy, and the specific use of the Encoder created from its own corpus can be found in the Advanced Use.
Users need to choose Encoder according to their own business scenarios and needs, for specific domains (a large number of exclusive vocabulary) scenarios recommended to create Encoder based on their own corpus.

Step3. Insert Doc with Sparse Vector

Python Examples

from dashvector import Doc

document = "Vector retrieval service DashVector is based on AliCloud's self-developed high-efficiency vector engine Proxima kernel, providing cloud-native, fully managed vector retrieval services with horizontal expansion capabilities."
doc_sparse_vector = encoder.encode_documents(document)

print(doc_sparse_vector)
# Output based on built-in Encoder.
# {380823393: 0.7262431704356519, 414191989: 0.7262431704356519, 565176162: 0.7262431704356519, 904594806: 0.7262431704356519, 1005505802. 0.7262431704356519, 1169440797: 0.8883757984694465, 1240922502: 0.7262431704356519, 1313971048: 0.7262431704356519, 1317077351. 0.7262431704356519, 1490140460: 0.7262431704356519, 1574737055: 0.7262431704356519, 1760434515: 0.7262431704356519, 2045788977. 0.8414146776926797, 2141666983: 0.7262431704356519, 2509543087: 0.7262431704356519, 3180265193: 0.7262431704356519, 3845702398. 0.7262431704356519, 4106887295: 0.7262431704356519}

(Doc(
    id='A',
    vector=[0.1, 0.2, 0.3, 0.4],
    sparse_vector=doc_sparse_vector
))

Java Samples

String document = "Vector retrieval service DashVector is based on Dharma Institute's self-developed high-efficiency vector engine Proxima kernel, providing cloud-native, fully managed vector retrieval services with horizontal expansion capabilities." ;
Map<Long, Float> sparseVector = (document).

(sparseVector);
// Output based on built-in Encoder.
// {380823393: 0.7262431704356519, 414191989: 0.7262431704356519, 565176162: 0.7262431704356519, 904594806: 0.7262431704356519, 1005505802. 0.7262431704356519, 1169440797: 0.8883757984694465, 1240922502: 0.7262431704356519, 1313971048: 0.7262431704356519, 1317077351. 0.7262431704356519, 1490140460: 0.7262431704356519, 1574737055: 0.7262431704356519, 1760434515: 0.7262431704356519, 2045788977. 0.8414146776926797, 2141666983: 0.7262431704356519, 2509543087: 0.7262431704356519, 3180265193: 0.7262431704356519, 3845702398. 0.7262431704356519, 4106887295: 0.7262431704356519}

Vector vector = ().value((0.1f, 0.2f, 0.3f, 0.4f)).build();

// Build Doc with Sparse Vector
Doc doc = ()
  .id("28")
  .sparseVector(sparseVector)
  .vector(vector)
  .vector(vector) .build();

// Insert Doc with Sparse Vector
Response<Void> response = (().doc(doc).build());

Step4. keyword-aware vector search

Python Examples

query = "What is a vector retrieval service?"
sparse_vector = encoder.encode_queries(query)

print(sparse_vector)
# Output based on built-in Encoder.
# {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}

docs = (
    vector=[0.1, 0.1, 0.1, 0.1],
    sparse_vector=sparse_vector
)

Java Samples

String query = "What is Vector Search Service？";

Map<Long, Float> sparseVector = (query);

(sparseVector);
// built-inEncoder(used form a nominal expression)output:
// {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}

Vector vector = ().value((0.1f, 0.2f, 0.3f, 0.4f)).build();
      	
// construct (sth abstract)QueryDocRequest
QueryDocRequest request = ()
  .vector(vector)
  .sparseVector(sparseVector)
  .topk(100)
  .includeVector(true)
  .build();

Response<List<Doc>> response = (request);
(response);

Step5. weight-based keyword+semantic retrieval

Python Examples

from dashtext import combine_dense_and_sparse

query = "What is Vector Search Service？"
sparse_vector = encoder.encode_queries(query)

# weighting factor
alpha = 0.7
dense_vector = [0.1, 0.1, 0.1, 0.1]
scaled_dense_vector, scaled_sparse_vector = combine_dense_and_sparse(dense_vector, sparse_vector, alpha)

docs = (
    vector=scaled_dense_vector,
    sparse_vector=scaled_sparse_vector
)

Java Samples

String query = "What is Vector Search Service？";

Map<Long, Float> sparseVector = (query);

(sparse_vector);
// built-inEncoder(used form a nominal expression)output:
// {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}

Vector denseVector = ().value((0.1f, 0.2f, 0.3f, 0.4f)).build();

// pass (a bill or inspection etc)alphaFactorization of dense and sparse vectors
float alpha = 0.1;
sparse_vector.forEach((key, value) -> sparse_vector.put(key, value * (1 - alpha)));
denseVector = ().value(
            ().stream().map(number -> () * alpha).collect(())
    ).build();

// construct (sth abstract)QueryDocRequest
QueryDocRequest request = ()
  .vector(denseVector)
  .sparseVector(sparseVector)
  .topk(100)
  .includeVector(true)
  .build();

Response<List<Doc>> response = (request);
(response);

clarification

The parameter alpha is a weighting factor that controls the weighting of the dense vector distance and the sparse vector distance. alpha=0.0 means that only sparse vectors are used for the distance metric, and alpha=1.0 means that only dense vectors are used for the distance metric.

API Reference

More DashText API details are available:

Python SDK：/project/dashtext/