Find answers from the community

Updated 3 months ago

Modify documents

hey team - would love some help here...

the airtable loader is giving my documents like this.

', 'Areas of Improvement': ['Making and changing plans'], 'Source': 'Elon Musk by Ashlee Vance\n\n', 'Quotes': 'Musk also trained employees to make the right trade-offs between spending money and productivity… ‘He would say that everything we did was a function of our burn rate and that we were burning through a hundred thousand dollars per day… Sometimes he wouldn’t let you buy a part for two thousand dollars because he expected you to find it cheaper or invent something cheaper. Other times, he wouldn’t flinch at renting a plane for ninety thousand dollars to get something to Kwaj because it saved an entire workday, so it was worth it. He would place this urgency that he expected the revenue in ten years to be ten million dollars a day and that every day we were slower to achieve our goals was a day of missing out on that money.’\n', 'People (Raw)': ['Elon Musk']}},

I want to get each node to be just the quote. Anyone got any idea how to do that in python? I am trying to do it but am being told documents is not subscriptable..
L
t
22 comments
If you have a list of documents, you can do something like this

Plain Text
import json

for doc in documents:
  text = json.loads(doc.text)
  doc.text = text['Quotes']


Not 100% sure that will work, but something like that maybe? If the json.loads doesn't work you might have to do some raw string parsing instead, using split(), or regex
Hmm I get when trying..

json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 3 (char 2)
Yeaaaa so it's not valid json then 🥲
So what is it ..?
Just a string haha
So you'll need to use either a bunch of splits (kinda risky) or a regex (probably safer) to extract the section you want from the string
Gosh OK. So the documents variable is just a string?
document.text is just a string yea 😅

if it was formatted as proper json, the loads() thing would work to convert the string, but seems like its not proper json format
Right OK...

What i don't understand is that when I try to use regex on the docs variable, I am getting this:

TypeError: expected string or bytes-like object, got 'list'

i can't see in the code snippet you linked the info on the documents variable & its structure?
Right! Woop woop! I've done this like this

pattern = r"'Quotes': '(.*?)'" # Iterate over the documents for document in documents: # Extract the text from the document text = document.text # Find all matches of the pattern in the text matches = re.findall(pattern, text) # Print the extracted quotes for quote in matches: print(quote) print("\n")


Now I need to make every quote a node... I'm a bit confused how I should be treating doc id here.. Wdyt?
Nice progress with the regex! 😎

Doc id in the node can be blank I think (it should get auto generated right?)
If the quotes are all shorter than your chunk size limit, you can just pass the documents into the index and it'll all happen automatically too
Thanks... what do you mean re being able to pass documents into index?

Where I've got to now is that I've done the below to now assign all the quotes to nodes. When I print the nodes, I get this:

Node(text='Outsiders are not merely free but compelled to make things that are cheap and lightweight. And both are good bets for growth: cheap things spread faster, and lightweight things evolve faster.\n', doc_id='2d4e9bd2-a5d3-43b5-9d57-91d9edbc1c6c', embedding=None, doc_hash='52501fee8b2820128f569d0d6b4d6d14411e94a3aba77ffb53c2ab1b7bae905b', extra_info=None, node_info=None, relationships={})

This is what I wanted because now I just have the text, rather than all the other bits of info the airtable loader was returning.

But now when I try to query the nodes, nothing happens. It just continuously loading.

Is this because I am not setting the node relationships in my code?

pattern = r"'Quotes': '(.*?)'" for document in documents: text = document.text matches = re.findall(pattern, text) nodes = [] for i, quote in enumerate(matches): node_name = f"node{i + 1}" exec(f"{node_name} = Node(text=quote)") nodes.append(eval(node_name)) print(nodes)
Oh hold on actually I think there was just a problem with OpenAI...

But do I need to set the relationships?
In my case there actually isn't any relationship between each quote. They are all random. So I'm quite fine for the query not to take account.
No need to set the relationships 👌

And when I said pass the documents into the index, you could just do index = GPTVectorStoreIndex.from_documents(documents)

If you create the nodes, then it will look like index = GPTVectorStoreIndex(nodes)
Thanks. But the issue with doing documents is that I was getting all of the columns from airtable which was far more than just the quotes themselves?
Right, but you can change the text in the documents using that regex loop you had
Or you can create new documents too (which is easier than nodes tbh)
Got it thank you. I have no tech background so sometimes require double explanations!

Ty for all your help! Mega kind & not taken for granted
Haha no worries! Happy to help 👌🫡
Add a reply
Sign up and join the conversation on Discord