This is absolutely brilliant. Thank you very much for this very informative video.
@sakilansari95119 ай бұрын
This is great video and thanks for providing such fantastic content, can we use open source llm like llama2 for Question-Answering over Tabular/CSV/ considering above approach?
@falven9 ай бұрын
So clever! Thank you so much for sharing.
@austinmw899 ай бұрын
Would love to see similar example with sqllite instead of using dfs and with batching / async optimizations
@Miionu9 ай бұрын
Great video! I really love your way of presenting. I am looking forward to the next one :) I have a few questions though: 1. Would it be possible to not embed twice the user query to optimize the time to response? It seems like the embedding is done twice, for both looking into the table and the row indexes. 2. What would be a good strategy if the initial user query requires two different SQL queries t be answered? Or if the SQL is a bit more complex 3. What if the datasets have million of rows? It look likes you are building the row index based on all rows, which might take both a lot of time and memory to be done. And it might slow down the time to answer in the long run no?
@THEGlab2 ай бұрын
I want to know this as weel
@chrismaley66768 ай бұрын
Wonderful video. Thank you Would the Text to Panda technique be a good approach when you only have access to stored procedures?
@zhang737 ай бұрын
This is great. Thanks for sharing Jerry. Quick question: as I understand, for df, you are inputting it into the API, and for sql, you are giving some rows to simulate the code, right? Just wondering if there is an option to “mask” or “use dummy” option for the tubular data going into the cloud? Tables and dbs often contain company sensitive data. You can’t take any into the cloud. Obviously, we can “scramble” ourselves first, but if there is an option to dummy our data before sending it into OpenAI, that would make it easier for enterprise app design. I don’t think gpt needs real data to simulate the codes. I could be wrong though.
@MutumaIanАй бұрын
Did you get a solution for this?
@JoyAwad-tn9ge7 ай бұрын
Thank you for the clarity! but i have a question, should the data be preprocessed?
@GowriKS-yl6lc6 ай бұрын
Thank you for a very insightful tutorial! I have a scenario where in some tables I have multiple columns that are related to date or datetime. When a question with respect to date is asked, correct date column is not picked up. Is there a way to strategize the correct column selection step?
@chrisronaldo95545 ай бұрын
Will this method work for tables with interlinking data, with primary and foreign keys , where the final result is achieved by having a join between 2 or more tables? Asking from the POV of financial data containing user info , employee info, product info, all interlinked and the retrieval is through a fairly complex SQL query. Thanks in advance
@kingzao90465 ай бұрын
For the last part, if you already retrieved example rows from the vector DB, why go look again in the SQL database ? Or maybe the vector DB may not always return the correct row ?
@charlene15815 ай бұрын
Might be a dumb question but: are the built in SQLTableRetrievalEngine basically a query pipeline already prepared and made? Is there some way to customize some of the components in the built in pipeline or do I need to make a custom query pipeline myself? Dealing with very large database with too many tables that are too large… trying to find a better way for table retrieval it improve the sql query made.
@jonashinrichs98259 ай бұрын
how can i use this with a custom LLM from huggingface ?
@chidy75703 ай бұрын
For the text to pandas part: Is there a way how I can prevent the answering of generic not related questions like "What is the meaning of life?"?
@kingzao90465 ай бұрын
Also, does the panda pipeline also lack the robustness on the names like B.I.G ?
@santhoshvaraprasaduppuluri65874 ай бұрын
Can we implement this in databricks serving model chatbot
@mdsohrabakhtaremam95553 ай бұрын
can we upload multiple csv files when we going with text to pandas approach?
@KetakiGadgil_157 ай бұрын
1. How can I upload more than 1 data frame such that they interact with each other using a common column between them? 2. And also how can I customize the instructions according to the data frame?
@sharankumar314 ай бұрын
i have less number of rows and single csv file is that good to go with Pandas method it self >?? or SQL is best ?? what you recommend ?
@NA-so5by9 ай бұрын
is there a wa i can pass a csv or xml fiile as input to llm and get a new column which represent sentiment score from a column in the csv, looking for a way where i can pass the csv file and make one llm call to avoid cost of multiple calls for each comment
@kelvinadungosi15796 ай бұрын
This is awesome. How to intergrate with a chat engine??
@brijeshsingh078 ай бұрын
My Question on row indexing, you are indexing entire row into vector database which is again very costly operation. what if the table contains billion of record it would create lot of overhead to storage. Do you think this is the only option or is there any better way to do this.
@MichelHabib7 ай бұрын
It is important to optimize your pipeline according to your use case. It is good to generalize, but i think getting 90% of the answers is a great win. if you have tables with billions of records that are all available to querying, that sounds like a serious problem that requires relevant hardware and solution. having billions of rows means you have a business scale much bigger than few dollars budget for embeddings. now, on to the solutions, you could do a number of things: 1- filter only the data you expect to query, tables, rows, columns. 2- summarize data and aggregate if possible. 3- index only what you need. you can index primary/foreign keys, this is something you already need to do for querying billions of records on SQL/NOSQL db. 4- when indexing key values, you can get unique values first, then index them. 5- use your own embedding model instead of sourcing it out to external apis, faster and cheaper.
@austinmw899 ай бұрын
Not sure if I understand the text-to-sql prompt. Why does it ask the LLM to generate the SQLQuery, SQLResult, and Answer all at the same time? Shouldn't it generate the query first, run it to get a result, then generate the answer given the query result?
@oguzhanakkurt39478 ай бұрын
when the code is runing, it is doing it by like you said. For example I tried in my local machine and my dataset. > Running module input with input: query: What is the service with the highest number of customers in the first month of 2024? > Running module table_retriever with input: input: What is the service with the highest number of customers in the first month of 2024? > Running module table_output_parser with input: query_str: What is the service with the highest number of customers in the first month of 2024? table_schema_objs: [SQLTableSchema(table_name='monthly_customer_service_data', context_str='The table provides information about different services including Google, Apple, Spotify, etc. for the month of October 2021. I... > Running module text2sql_prompt with input: query_str: What is the service with the highest number of customers in the first month of 2024? schema: Table 'monthly_customer_service_data' has columns: Date (VARCHAR), Service (VARCHAR), Customer_Quantity_Monthly (INTEGER), Turnover_per_Customer_Monthly (INTEGER), Category (VARCHAR), and foreign keys... > Running module text2sql_llm with input: messages: Given an input question, first create a syntactically correct sqlite query to run, then look at the results of the query and return the answer. You can order the results by a relevant column to return... > Running module sql_output_parser with input: response: assistant: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01' SQLResult: ('TV+', 1205392) Answer: The service with the highest number of custom... > Running module sql_retriever with input: input: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01' > Running module response_synthesis_prompt with input: query_str: What is the service with the highest number of customers in the first month of 2024? sql_query: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01' context_str: [NodeWithScore(node=TextNode(id_='6c280c68-69bb-495b-9a6b-911cf4a42849', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="[('Lifebox... > Running module response_synthesis_llm with input: messages: Given an input question, synthesize a response from the query results. Query: What is the service with the highest number of customers in the first month of 2024? SQL: SELECT Service, MAX(Customer_Qua...
@deveshmishra52457 ай бұрын
where can i find part 1 of this tutorial
@ragsAI9 ай бұрын
Nice!!
@straylight71167 ай бұрын
Same strategies will work with a claude 3 backend?
@straylight71167 ай бұрын
it worked
@wanyingqian50657 ай бұрын
the colab file doesn't work. In the download data part, the csv dataset link shows "404: Not Found".
@renaudgg9 ай бұрын
you are indexing all tables, but what if 1-2 of my 4 tables has 2 milions rows each? its take so long... any tips here?
@LlamaIndex9 ай бұрын
for vector indexing, you can try indexing batches/chunks of rows instead of each row individually. really the goal here is to find some coarse few-shot examples that help the text-to-sql prompt
@renaudgg9 ай бұрын
yes I agree and understand , but is my problem that fact im trying to index a SQL table ? because you had your data in tons of CSV files, is that the trick should I save my SQL table into CSvs first ?
@lloganm12349 ай бұрын
@@renaudgg you can also increase the batch size on your embeddings model. For example embed_model = OpenAIEmbedding(.., embed_batch_size=1000)
@B0tch09 ай бұрын
Avoid using OpenAI and run your model "locally". Although this video/code doesn't touch on any of that.
@ayushgoel77729 ай бұрын
While running objectindex, it is throwing an error of openai.notfounderror
@LlamaIndex9 ай бұрын
what version are you on? this was made before our v10 release today, so make sure you `pip install llama-index==0.9.45.post1` (colab: colab.research.google.com/drive/1fRkgSn2PSlXSMgLk32beldVnLMLtI1Pc?usp=sharing)
@ayushgoel77729 ай бұрын
@@LlamaIndex i am still getting the same error even after using the same version that you specified