Find answers from the community

Updated 4 months ago

Hello everyone

At a glance

The community member is struggling to combine two data loaders, competitor_index (using BeautifulSoupWebReader) and index (using YoutubeTranscriptReader), into a single index to be able to query both at the same time. The community members suggest two approaches:

1. Append all the data into one list and create a single index:

index = GPTSimpleVectorIndex([]) for doc in documents: index.insert(doc) for doc in competitor_documents: index.insert(doc)

2. Use a graph index, which is designed to be "an Index on top of Indices". This allows keeping the indices separate and using a summary on top of each index, querying the separate indices when needed, rather than having all the data in one index.

The community members discuss the advantages of the graph index approach, noting that it provides a better structure than having all the data in one index.

Useful resources
Hello everyone,

I strugle with combining 2 data loaders into index. How to merge competitor_index with index to be able to query both at the same time? competitor_index uses bs4 data connector, index uses youtube data connector

Plain Text
from llama_index import (
    LLMPredictor,
    PromptHelper,
    ServiceContext,
    GPTSimpleVectorIndex,
    download_loader
)
from langchain.chat_models import ChatOpenAI
import os

os.environ["OPENAI_API_KEY"] = 'xxx'

max_input_size = 4096
num_output = 512
max_chunk_overlap = 200
temperature = 0

# define prompt helper
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)

# define LLM
llm_predictor = LLMPredictor(
    llm=ChatOpenAI(temperature=temperature, model_name="gpt-3.5-turbo", max_tokens=num_output))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
BeautifulSoupWebReader = download_loader("BeautifulSoupWebReader")
competitor_loader = BeautifulSoupWebReader()
competitor_documents = competitor_loader.load_data(
    urls=['https://url1.com', 'https://url2.com', 'https://url3.com'])
competitor_index = GPTSimpleVectorIndex.from_documents(competitor_documents, service_context=service_context)

YoutubeTranscriptReader = download_loader("YoutubeTranscriptReader")
loader = YoutubeTranscriptReader()
documents = loader.load_data(ytlinks=['https://www.youtube.com/watch?v=xxx',
                                      'https://www.youtube.com/watch?v=xxx',
                                      'https://www.youtube.com/watch?v=xxx'])
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

combined_competitor_index_and_index = ???
p
m
7 comments
Hi! Maybe what you want is to first append all your data into one list and then create a single index?
or perhaps this?

index = GPTSimpleVectorIndex([])
for doc in documents:
index.insert(doc)
for doc in competitor_documents:
index.insert(doc)
yes, something like this. Now sure If you can call insert function directly on GPTSimpleVectorIndex tho.
another approach might be using a graph. It is designed to be "an Index on top of Indices". So you can create separate indices (like you are doing now) and then combine them inside a graph index
whats the advantages over current solution?
In graph, you can keep your indices separated and use summary on top of each index and have a description on top graph itself, to use it only when you need to query across the indices and query separate index instead. I think this is a better structure, than having all your data in one index?
good point! thanks
Add a reply
Sign up and join the conversation on Discord