I'm not a contributor here but I've used gpt-index quite a bit for large knowledgebases, maybe I can provide some insight
What sort of things are you looking to get out of the news articles?
Today my goal is to get anything out of it i can.
Long term - i want a self writing newspaper
So my goal is to get an out put that is a news article. Focused on the difference between sources. So that it reads as glass half full vs half empty mainly showing how differently the news is presnted between sources
Your end goal is to write an article over these existing articles?
So you want this app to use the existing 100+ news articles that you have indexed to generate new news content?
So this is my teach-my-self AI project from a year ago when all i really knew were some buzz words so im not sure how achievable all of it is yet but it looks close. If you look at a news topic between say FOX and CNN. They will both say there is a glass of water in the room but fox feels it half empty because of ABC and CNN thinks it half full because of XYZ.
That was the initial idea i was thinking of at the start. There are other use cases IF, i can get a "Smart News DB" to work. Fact checking, accuracy stats, a knock knock joke ticker based on current news headlines per Jon Oliver's request.
no thats just small sample i use to see if code works, long term it gets hooked up to news API or a scraper
Hmm, to be honest I'm not sure if using an indexing based engine underlying is the best approach for something like content writing, indexes like these are excellent for knowledgebase activities but not great for creativity and for re-writing after sampling the index
Ultimately what you'll have to be able to do is have your app understand what a news article looks like from those various samples, but when you query with an index, there's only so much information you can get a query-time
But what you said afterwards sounds like a good starting point to continue with this
Fact checking, accuracy stats,
This is pretty much exactly what GPTSimpleVectorIndex is great for, given the news articles indexed, you can use the index to get question answering
what would the difference be between the GPTSimpleVectorIndex and langchain's VectorStoreIndex be at this point for my project.
@Kren GPTSimpleVectorIndex is happening in memory here, where you are chunking your data and storing it. I guess with with VectorStoreIndex in langchain you will be storing the chunks in external vector stores such as Pinecone, FIASS. Alternatively GPTIndex (LlamaIndex) also has connectors. Correct me if I am wrong here @jerryjliu0
yeah vector store abstractions are supported in langchain + other toolkits too. creating a vector index in gpt index allows you to plug it in with the rest of the index ecosystem (defining different types of indices, composability, etc.). we also optimize the experience to be really simple - just load in a bunch of documents, we can chunk it up under the hood and give you responses given your query
what format should my data be in for best results? As of now its csv for convince. Would this work better if it was individual documents, a KG or Json objects?
you could try two things:
1) use the csv reader, load into an unstructured index (e.g. vector index)
2) load data into a sql db, use our sql index