LLMs for Advanced Question-Answering over Tabular/CSV/SQL Data (Building Advanced RAG, Part 2)

Рет қаралды 44,423

LlamaIndex

Күн бұрын

Пікірлер: 45

@afizs 9 ай бұрын

This is absolutely brilliant. Thank you very much for this very informative video.

@sakilansari9511 9 ай бұрын

This is great video and thanks for providing such fantastic content, can we use open source llm like llama2 for Question-Answering over Tabular/CSV/ considering above approach?

@falven 9 ай бұрын

So clever! Thank you so much for sharing.

@austinmw89 9 ай бұрын

Would love to see similar example with sqllite instead of using dfs and with batching / async optimizations

@Miionu 9 ай бұрын

Great video! I really love your way of presenting. I am looking forward to the next one :) I have a few questions though: 1. Would it be possible to not embed twice the user query to optimize the time to response? It seems like the embedding is done twice, for both looking into the table and the row indexes. 2. What would be a good strategy if the initial user query requires two different SQL queries t be answered? Or if the SQL is a bit more complex 3. What if the datasets have million of rows? It look likes you are building the row index based on all rows, which might take both a lot of time and memory to be done. And it might slow down the time to answer in the long run no?

@THEGlab 2 ай бұрын

I want to know this as weel

@chrismaley6676 8 ай бұрын

Wonderful video. Thank you Would the Text to Panda technique be a good approach when you only have access to stored procedures?

@zhang73 7 ай бұрын

This is great. Thanks for sharing Jerry. Quick question: as I understand, for df, you are inputting it into the API, and for sql, you are giving some rows to simulate the code, right? Just wondering if there is an option to “mask” or “use dummy” option for the tubular data going into the cloud? Tables and dbs often contain company sensitive data. You can’t take any into the cloud. Obviously, we can “scramble” ourselves first, but if there is an option to dummy our data before sending it into OpenAI, that would make it easier for enterprise app design. I don’t think gpt needs real data to simulate the codes. I could be wrong though.

@MutumaIan Ай бұрын

Did you get a solution for this?

@JoyAwad-tn9ge 7 ай бұрын

Thank you for the clarity! but i have a question, should the data be preprocessed?

@GowriKS-yl6lc 6 ай бұрын

Thank you for a very insightful tutorial! I have a scenario where in some tables I have multiple columns that are related to date or datetime. When a question with respect to date is asked, correct date column is not picked up. Is there a way to strategize the correct column selection step?

@chrisronaldo9554 5 ай бұрын

Will this method work for tables with interlinking data, with primary and foreign keys , where the final result is achieved by having a join between 2 or more tables? Asking from the POV of financial data containing user info , employee info, product info, all interlinked and the retrieval is through a fairly complex SQL query. Thanks in advance

@kingzao9046 5 ай бұрын

For the last part, if you already retrieved example rows from the vector DB, why go look again in the SQL database ? Or maybe the vector DB may not always return the correct row ?

@charlene1581 5 ай бұрын

Might be a dumb question but: are the built in SQLTableRetrievalEngine basically a query pipeline already prepared and made? Is there some way to customize some of the components in the built in pipeline or do I need to make a custom query pipeline myself? Dealing with very large database with too many tables that are too large… trying to find a better way for table retrieval it improve the sql query made.

@jonashinrichs9825 9 ай бұрын

how can i use this with a custom LLM from huggingface ?

@chidy7570 3 ай бұрын

For the text to pandas part: Is there a way how I can prevent the answering of generic not related questions like "What is the meaning of life?"?

@kingzao9046 5 ай бұрын

Also, does the panda pipeline also lack the robustness on the names like B.I.G ?

@santhoshvaraprasaduppuluri6587 4 ай бұрын

Can we implement this in databricks serving model chatbot

@mdsohrabakhtaremam9555 3 ай бұрын

can we upload multiple csv files when we going with text to pandas approach?

@KetakiGadgil_15 7 ай бұрын

1. How can I upload more than 1 data frame such that they interact with each other using a common column between them? 2. And also how can I customize the instructions according to the data frame?

@sharankumar31 4 ай бұрын

i have less number of rows and single csv file is that good to go with Pandas method it self >?? or SQL is best ?? what you recommend ?

@NA-so5by 9 ай бұрын

is there a wa i can pass a csv or xml fiile as input to llm and get a new column which represent sentiment score from a column in the csv, looking for a way where i can pass the csv file and make one llm call to avoid cost of multiple calls for each comment

@kelvinadungosi1579 6 ай бұрын

This is awesome. How to intergrate with a chat engine??

@brijeshsingh07 8 ай бұрын

My Question on row indexing, you are indexing entire row into vector database which is again very costly operation. what if the table contains billion of record it would create lot of overhead to storage. Do you think this is the only option or is there any better way to do this.

@MichelHabib 7 ай бұрын

It is important to optimize your pipeline according to your use case. It is good to generalize, but i think getting 90% of the answers is a great win. if you have tables with billions of records that are all available to querying, that sounds like a serious problem that requires relevant hardware and solution. having billions of rows means you have a business scale much bigger than few dollars budget for embeddings. now, on to the solutions, you could do a number of things: 1- filter only the data you expect to query, tables, rows, columns. 2- summarize data and aggregate if possible. 3- index only what you need. you can index primary/foreign keys, this is something you already need to do for querying billions of records on SQL/NOSQL db. 4- when indexing key values, you can get unique values first, then index them. 5- use your own embedding model instead of sourcing it out to external apis, faster and cheaper.

@austinmw89 9 ай бұрын

Not sure if I understand the text-to-sql prompt. Why does it ask the LLM to generate the SQLQuery, SQLResult, and Answer all at the same time? Shouldn't it generate the query first, run it to get a result, then generate the answer given the query result?

@oguzhanakkurt3947 8 ай бұрын

when the code is runing, it is doing it by like you said. For example I tried in my local machine and my dataset. > Running module input with input: query: What is the service with the highest number of customers in the first month of 2024? > Running module table_retriever with input: input: What is the service with the highest number of customers in the first month of 2024? > Running module table_output_parser with input: query_str: What is the service with the highest number of customers in the first month of 2024? table_schema_objs: [SQLTableSchema(table_name='monthly_customer_service_data', context_str='The table provides information about different services including Google, Apple, Spotify, etc. for the month of October 2021. I... > Running module text2sql_prompt with input: query_str: What is the service with the highest number of customers in the first month of 2024? schema: Table 'monthly_customer_service_data' has columns: Date (VARCHAR), Service (VARCHAR), Customer_Quantity_Monthly (INTEGER), Turnover_per_Customer_Monthly (INTEGER), Category (VARCHAR), and foreign keys... > Running module text2sql_llm with input: messages: Given an input question, first create a syntactically correct sqlite query to run, then look at the results of the query and return the answer. You can order the results by a relevant column to return... > Running module sql_output_parser with input: response: assistant: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01' SQLResult: ('TV+', 1205392) Answer: The service with the highest number of custom... > Running module sql_retriever with input: input: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01' > Running module response_synthesis_prompt with input: query_str: What is the service with the highest number of customers in the first month of 2024? sql_query: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01' context_str: [NodeWithScore(node=TextNode(id_='6c280c68-69bb-495b-9a6b-911cf4a42849', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="[('Lifebox... > Running module response_synthesis_llm with input: messages: Given an input question, synthesize a response from the query results. Query: What is the service with the highest number of customers in the first month of 2024? SQL: SELECT Service, MAX(Customer_Qua...

@deveshmishra5245 7 ай бұрын

where can i find part 1 of this tutorial

@ragsAI 9 ай бұрын

Nice!!

@straylight7116 7 ай бұрын

Same strategies will work with a claude 3 backend?

@straylight7116 7 ай бұрын

it worked

@wanyingqian5065 7 ай бұрын

the colab file doesn't work. In the download data part, the csv dataset link shows "404: Not Found".

@renaudgg 9 ай бұрын

you are indexing all tables, but what if 1-2 of my 4 tables has 2 milions rows each? its take so long... any tips here?

@LlamaIndex 9 ай бұрын

for vector indexing, you can try indexing batches/chunks of rows instead of each row individually. really the goal here is to find some coarse few-shot examples that help the text-to-sql prompt

@renaudgg 9 ай бұрын

yes I agree and understand , but is my problem that fact im trying to index a SQL table ? because you had your data in tons of CSV files, is that the trick should I save my SQL table into CSvs first ?

@lloganm1234 9 ай бұрын

@@renaudgg you can also increase the batch size on your embeddings model. For example embed_model = OpenAIEmbedding(.., embed_batch_size=1000)

@B0tch0 9 ай бұрын

Avoid using OpenAI and run your model "locally". Although this video/code doesn't touch on any of that.

@ayushgoel7772 9 ай бұрын

While running objectindex, it is throwing an error of openai.notfounderror

@LlamaIndex 9 ай бұрын

what version are you on? this was made before our v10 release today, so make sure you `pip install llama-index==0.9.45.post1` (colab: colab.research.google.com/drive/1fRkgSn2PSlXSMgLk32beldVnLMLtI1Pc?usp=sharing)

@ayushgoel7772 9 ай бұрын

@@LlamaIndex i am still getting the same error even after using the same version that you specified