Summertime Sadness ft. PandasAI: Deep-Dive
Last Updated on July 17, 2023 by Editorial Team
Author(s): Soumyadip Mal
Originally published on Towards AI.
PandasAI β a potential game-changer for Data Engineers or just a gimmick? Letβs find out!
Spent the entire day with this nifty little library, dismantled it bit by bit, and, here are my 2 cents on it.
Iβm not particularly good at prose-so Iβll just approach my first βtech blogβ, the way I know – without any frills. Also, itβs well past 3 am at the time of writing this, so you know. First things first:
1) What Kind of LLMs fuelling it? OpenAI-based models, Huggingface LLM(falcon), Starcoder and the PALM suit of models from Google. Have solely tried with OpenAI for now. For OpenAI, the defaults are:
βtemperatureβ: 0,
βmax_tokensβ: 512,
βtop_pβ: 1,
βfrequency_penaltyβ: 0,
βpresence_penaltyβ: 0.6,
βmodelβ: βgpt-3.5-turboβ
2)Is it free? The base usage remains free of course, but the conversational flavor with OpenAI LLM would require an API key and hence comes at a cost, the cost of GPT3.5, to be precise.
3)Under the hood β At a very abstract level, it can either work with single dataframes or a single list of DataFrames. Taking the example of a single data frame, it takes into consideration only the head with 5 rows. Internally forms a hardcoded prompt and sends it through the chat completion endpoint to get the Python code representation of the input query/question. And then, it runs the Python code on the entire of the original dataframe to get the desired result. The underlying assumption here being-you don't really need to see the entire dataset to form a general query; the column names are quite enough. Also, this saves on the token usage front. As simple as that.
4)Data Privacy? Thereβs an option to init call with βenforce_privacyβ flag set to False, so that the df.head() doesnβt show up in the prompt. And since it doesnβt upload your entire table anyway, just the headers only, there shouldnβt be any concern related to data privacy.
5)Is the output conversational? Yes, thereβs another init argument called βconversationalβ that defaults to false but can be set to True to give a more conversational type of answer.
6) Middleware? Those coming from Django would be aware of this, something that we used to do with monkey-patching in yesteryears. PandasAI also supports sending in a list of middleware classes as input param.
Diving In:
Working with a single dataframe β
With conversational=True(set the token in line 15 as your API key)
With conversational=False(set the token in line 15 as your API key)
Prompts formed under the hood for the default non-conversational flavor with single df, with enforce_privacy set to False:
Prompts formed under the hood for the default non-conversational flavor with single df, with enforce_privacy set to True:
As visible, only the col. names are there in the prompt with the privacy turned on, actual data is discarded.
Working with a list of dataframes β
Notice the additional data frame in line 14, which is being passed in a list along with the original df in line 28. Have intentionally removed the happiness index col from the new df and modified the GDP of the United States. Also, the query now says βUSAβ instead of βUnited Statesβ. Here in lies the magic of LLMs, it was able to still form the query.
Prompts formed with enforce_privacy set to True:
Prompts formed with enforce_privacy set to False:
The privacy thing failed for multiple dfβs due to a small bug in the code (I have sent a pull request, letβs see).
Optimisation Proposed:
Leaving aside the multiple df case for now. For single df, with privacy enabled, prompts are a little shorter(owing to the lack of example data from the head) and thus less costly, but for the sake of better prompting practice, letβs keep it set to False for now.
pandas_ai = PandasAI(llm,verbose=True,conversational=False,enforce_privacy=False,enable_cache=False)
Input df:
Input query: Which is the happiest country?
Input request body: (Clearly, has way too many new line characters and other special chars, that contribute to the token-count and essentially, the cost incurred for each api call)
Response
So, 376 total tokens for a df with 10 rows and 3 cols. GPT3.5-Turbo is priced at $0.002 / 1K tokens. So, for 10,000 such hits, weβd get around 3760k tokens and it could cost us 7.52 USD.
However, if we slightly tweak the input prompt to hold the df in the form of a comma-separated format(csv) instead of a string representation of the head(5), hereβs what happens:
Input request body:(Notice the comma separated fields)
So much more compact.
Response:
So, a total of 340 tokens. Thatβs a 9.5% cost reduction overall!
For now, have just made the change for the single data-frame input. Will put in a PR for the multi-df list too.
Well, that was it, folks. Will try to spend some time with Scikit-LLM next. I quite liked the idea here, just send the df skeleton to LLM and get the pandas equivalent code, and then run it on the actual dataset for the result. Text to SQL and vice-versa have been there for a while; this just takes it up a notch β Text to Pandas. This also opens up a lot of other avenues β SQL query optimization, performance tuning of code, etc. Now, pandas query optimization could very well be a thing too! Other Python libs, I imagine, would follow this trend too.
β . β β β β -.. -β¦ -. β . ( Morse for something, idk)
Thanks for reading!
Your feedback and questions are highly appreciated. You can find me on LinkedIn and Instagram or connect with me via Twitter @soumyadip_mal
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI