Hello everyone

At a glance

The community member is struggling to combine two data loaders, competitor_index (using BeautifulSoupWebReader) and index (using YoutubeTranscriptReader), into a single index to be able to query both at the same time. The community members suggest two approaches:

1. Append all the data into one list and create a single index:

index = GPTSimpleVectorIndex([])
for doc in documents:
    index.insert(doc)
for doc in competitor_documents:
    index.insert(doc)

2. Use a graph index, which is designed to be "an Index on top of Indices". This allows keeping the indices separate and using a summary on top of each index, querying the separate indices when needed, rather than having all the data in one index.

The community members discuss the advantages of the graph index approach, noting that it provides a better structure than having all the data in one index.

Useful resources

mmeeffe

Hello everyone,

I strugle with combining 2 data loaders into index. How to merge competitor_index with index to be able to query both at the same time? competitor_index uses bs4 data connector, index uses youtube data connector

Plain Text

from llama_index import (
    LLMPredictor,
    PromptHelper,
    ServiceContext,
    GPTSimpleVectorIndex,
    download_loader
)
from langchain.chat_models import ChatOpenAI
import os

os.environ["OPENAI_API_KEY"] = 'xxx'

max_input_size = 4096
num_output = 512
max_chunk_overlap = 200
temperature = 0

# define prompt helper
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)

# define LLM
llm_predictor = LLMPredictor(
    llm=ChatOpenAI(temperature=temperature, model_name="gpt-3.5-turbo", max_tokens=num_output))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
BeautifulSoupWebReader = download_loader("BeautifulSoupWebReader")
competitor_loader = BeautifulSoupWebReader()
competitor_documents = competitor_loader.load_data(
    urls=['https://url1.com', 'https://url2.com', 'https://url3.com'])
competitor_index = GPTSimpleVectorIndex.from_documents(competitor_documents, service_context=service_context)

YoutubeTranscriptReader = download_loader("YoutubeTranscriptReader")
loader = YoutubeTranscriptReader()
documents = loader.load_data(ytlinks=['https://www.youtube.com/watch?v=xxx',
                                      'https://www.youtube.com/watch?v=xxx',
                                      'https://www.youtube.com/watch?v=xxx'])
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

combined_competitor_index_and_index = ???

7 comments

ppikachu8887867

Hi! Maybe what you want is to first append all your data into one list and then create a single index?

mmeeffe

or perhaps this?

index = GPTSimpleVectorIndex([])
for doc in documents:
index.insert(doc)
for doc in competitor_documents:
index.insert(doc)

ppikachu8887867

yes, something like this. Now sure If you can call insert function directly on GPTSimpleVectorIndex tho.

ppikachu8887867

another approach might be using a graph. It is designed to be "an Index on top of Indices". So you can create separate indices (like you are doing now) and then combine them inside a graph index

mmeeffe

whats the advantages over current solution?

ppikachu8887867

In graph, you can keep your indices separated and use summary on top of each index and have a description on top graph itself, to use it only when you need to query across the indices and query separate index instead. I think this is a better structure, than having all your data in one index?

mmeeffe

good point! thanks

Add a reply

Find answers from the community

Hello everyone