Modify documents

At a glance

The community member is trying to extract just the quotes from a document that is being loaded from Airtable. The document structure is not in a valid JSON format, so the community member tries using regex to extract the quotes. After some trial and error, the community member is able to extract the quotes and create nodes for each quote.

The community members discuss various approaches, including passing the documents directly into an index, and whether there is a need to set relationships between the nodes. The consensus is that since the quotes are independent, there is no need to set relationships, and the community member can simply pass the nodes into the index.

There is no explicitly marked answer, but the community members provide helpful guidance and suggestions to the original poster.

Useful resources

tthomoliverz

hey team - would love some help here...

the airtable loader is giving my documents like this.

', 'Areas of Improvement': ['Making and changing plans'], 'Source': 'Elon Musk by Ashlee Vance\n\n', 'Quotes': 'Musk also trained employees to make the right trade-offs between spending money and productivity… ‘He would say that everything we did was a function of our burn rate and that we were burning through a hundred thousand dollars per day… Sometimes he wouldn’t let you buy a part for two thousand dollars because he expected you to find it cheaper or invent something cheaper. Other times, he wouldn’t flinch at renting a plane for ninety thousand dollars to get something to Kwaj because it saved an entire workday, so it was worth it. He would place this urgency that he expected the revenue in ten years to be ten million dollars a day and that every day we were slower to achieve our goals was a day of missing out on that money.’\n', 'People (Raw)': ['Elon Musk']}},

I want to get each node to be just the quote. Anyone got any idea how to do that in python? I am trying to do it but am being told documents is not subscriptable..

22 comments

LLogan M

If you have a list of documents, you can do something like this

Plain Text

import json

for doc in documents:
  text = json.loads(doc.text)
  doc.text = text['Quotes']

Not 100% sure that will work, but something like that maybe? If the json.loads doesn't work you might have to do some raw string parsing instead, using split(), or regex

LLogan M

The document structure is detailed here in the code

https://github.com/jerryjliu/llama_index/blob/975b34621a5fa66e337996a24a671f929fd5cdc2/llama_index/schema.py#L25

tthomoliverz

Hmm I get when trying..

json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 3 (char 2)

LLogan M

Yeaaaa so it's not valid json then 🥲

tthomoliverz

So what is it ..?

LLogan M

Just a string haha

LLogan M

So you'll need to use either a bunch of splits (kinda risky) or a regex (probably safer) to extract the section you want from the string

tthomoliverz

Gosh OK. So the documents variable is just a string?

LLogan M

document.text is just a string yea 😅

if it was formatted as proper json, the loads() thing would work to convert the string, but seems like its not proper json format

tthomoliverz

Right OK...

What i don't understand is that when I try to use regex on the docs variable, I am getting this:

TypeError: expected string or bytes-like object, got 'list'

i can't see in the code snippet you linked the info on the documents variable & its structure?

tthomoliverz

Right! Woop woop! I've done this like this

pattern = r"'Quotes': '(.*?)'"

# Iterate over the documents
for document in documents:
    # Extract the text from the document
    text = document.text
    
    # Find all matches of the pattern in the text
    matches = re.findall(pattern, text)
    
    # Print the extracted quotes
    for quote in matches:
        print(quote)
        print("\n")

Now I need to make every quote a node... I'm a bit confused how I should be treating doc id here.. Wdyt?

LLogan M

Nice progress with the regex! 😎

Doc id in the node can be blank I think (it should get auto generated right?)

LLogan M

If the quotes are all shorter than your chunk size limit, you can just pass the documents into the index and it'll all happen automatically too

tthomoliverz

Thanks... what do you mean re being able to pass documents into index?

Where I've got to now is that I've done the below to now assign all the quotes to nodes. When I print the nodes, I get this:

Node(text='Outsiders are not merely free but compelled to make things that are cheap and lightweight. And both are good bets for growth: cheap things spread faster, and lightweight things evolve faster.\n', doc_id='2d4e9bd2-a5d3-43b5-9d57-91d9edbc1c6c', embedding=None, doc_hash='52501fee8b2820128f569d0d6b4d6d14411e94a3aba77ffb53c2ab1b7bae905b', extra_info=None, node_info=None, relationships={})

This is what I wanted because now I just have the text, rather than all the other bits of info the airtable loader was returning.

But now when I try to query the nodes, nothing happens. It just continuously loading.

Is this because I am not setting the node relationships in my code?

pattern = r"'Quotes': '(.*?)'"
for document in documents:
    text = document.text
    matches = re.findall(pattern, text)
    
nodes = []
for i, quote in enumerate(matches):
    node_name = f"node{i + 1}"
    exec(f"{node_name} = Node(text=quote)")
    nodes.append(eval(node_name))

print(nodes)

tthomoliverz

Oh hold on actually I think there was just a problem with OpenAI...

But do I need to set the relationships?

tthomoliverz

In my case there actually isn't any relationship between each quote. They are all random. So I'm quite fine for the query not to take account.

LLogan M

No need to set the relationships 👌

And when I said pass the documents into the index, you could just do index = GPTVectorStoreIndex.from_documents(documents)

If you create the nodes, then it will look like index = GPTVectorStoreIndex(nodes)

tthomoliverz

Thanks. But the issue with doing documents is that I was getting all of the columns from airtable which was far more than just the quotes themselves?

LLogan M

Right, but you can change the text in the documents using that regex loop you had

LLogan M

Or you can create new documents too (which is easier than nodes tbh)

tthomoliverz

Got it thank you. I have no tech background so sometimes require double explanations!

Ty for all your help! Mega kind & not taken for granted

LLogan M

Haha no worries! Happy to help 👌🫡

Add a reply

Find answers from the community

Modify documents