Search index
Search index
FAISS and ElasticSearch enables searching for examples in a dataset. This can be useful when you want to retrieve specific examples from a dataset that are relevant to your NLP task. For example, if you are working on a Open Domain Question Answering task, you may want to only return examples that are relevant to answering your question.
This guide will show you how to build an index for your dataset that will allow you to search it.
FAISS
FAISS retrieves documents based on the similarity of their vector representations. In this example, you will generate the vector representations with the DPR model.
Download the DPR model from π Transformers:
Copied
>>> from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
>>> import torch
>>> torch.set_grad_enabled(False)
>>> ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
>>> ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")Load your dataset and compute the vector representations:
Copied
>>> from datasets import load_dataset
>>> ds = load_dataset('crime_and_punish', split='train[:100]')
>>> ds_with_embeddings = ds.map(lambda example: {'embeddings': ctx_encoder(**ctx_tokenizer(example["line"], return_tensors="pt"))[0][0].numpy()})Create the index with Dataset.add_faiss_index():
Copied
Now you can query your dataset with the
embeddingsindex. Load the DPR Question Encoder, and search for a question with Dataset.get_nearest_examples():
Copied
You can access the index with Dataset.get_index() and use it for special operations, e.g. query it using
range_search:
Copied
When you are done querying, save the index on disk with Dataset.save_faiss_index():
Copied
Reload it at a later time with Dataset.load_faiss_index():
Copied
ElasticSearch
Unlike FAISS, ElasticSearch retrieves documents based on exact matches.
Start ElasticSearch on your machine, or see the ElasticSearch installation guide if you donβt already have it installed.
Load the dataset you want to index:
Copied
Build the index with Dataset.add_elasticsearch_index():
Copied
Then you can query the
contextindex with Dataset.get_nearest_examples():
Copied
If you want to reuse the index, define the
es_index_nameparameter when you build the index:
Copied
Reload it later with the index name when you call Dataset.load_elasticsearch_index():
Copied
For more advanced ElasticSearch usage, you can specify your own configuration with custom settings:
Copied
Last updated