Unlocking Context-Aware Insights in Tabular Data with LLMs and LangChain

Ameer Hakme
13 min readApr 20, 2023

--

Build a Basic App for Interacting with Numerical Tabular Data through Natural Language Processing using LangChain agents and LLMs

Image generated by Stable Diffusion 2.1 base

In our data-centric world, gaining valuable insights from the massive amounts of information available is essential. Tabular data, a common format for organizing and presenting data across various fields, can be difficult to analyze, particularly for non-technical individuals. While Large Language Models (LLMs) have transformed natural language processing (NLP), they often lack context when working with custom data.

LangChain tackles these challenges by linking LLMs with custom data sources, allowing for more informed, context-aware responses. In this blog post, we’ll explore how LangChain and its integrated agents offer context-aware insights for tabular data using OpenAI’s text-davinci-003 language model and LangChain’s Pandas Dataframe Agent. We’ll focus on AWS Cost and Usage report data, showcasing how to gain insights into cloud expenses via natural language queries. For the purpose of this blog we are using OpenAI’s text-davinci-003 LLM but keep in mind that other LLMs can be used as well. Amazon SageMaker JumpStart has a Machine learning (ML) hub with foundation models from popular providers such as Amazon, Cohere, AI21, and Stability AI that you can deploy with just a few clicks.

By integrating LLMs with LangChain, organizations can efficiently examine and analyze AWS Cost and Usage Reports (CUR). This approach enables users to pose natural language questions and obtain insights into AWS spending patterns. With LLMs and LangChain, organizations can streamline AWS cost monitoring and make well-informed decisions about resource allocation and usage optimization.

OpenAI API

The OpenAI API is a powerful tool for natural language processing tasks, allowing users to generate text, answer questions, and perform other language-related tasks. To access the API, users must obtain an API key, which can be obtained by signing up for OpenAI’s API service on their website. Once signed up, users can generate an API key to use with the API. The key can be used to access a range of language models, including GPT-4 and other models tailored to specific use cases.

LangChain

LangChain is a framework that improves LLMs for diverse applications. Although LLMs handle various tasks effectively, they may struggle in specialized domains like finance or engineering, where deep knowledge is essential. LangChain overcomes this limitation with a flexible set of tools, enabling developers to create tailored solutions that combine LLM strengths with domain expertise. This results in more accurate, nuanced responses for specialized tasks, making it a powerful and effective language-based tool.

What is a LangChain Agent?

Some applications may require more than just a predetermined series of calls to large language models or other tools, but instead a flexible chain that depends on the user’s input. In these chains, an “agent” is available, having access to a variety of tools. Based on the user input, the agent can decide which of these tools, if any, should be used. LangChain have created purpose built agents applied to a particular use case such as Pandas Dataframe Agent which uses Python REPL tool to execute commands provided by the LLM.

What is LangChain Python REPL tool?

REPL stands for Read-Eval-Print Loop, an interactive programming environment that allows users to execute single lines of code and receive immediate feedback. In some cases, for complex calculations, it can be more efficient to have an LLM generate the code to compute the answer rather than produce the answer directly. To facilitate this, LangChain offers a simple Python REPL that executes commands, allowing users to obtain the results they need with ease.

How Do LangChain Agents Work with LLMs Behind the Scenes?

When a LangChain Agent receives input in the form of a question, it initiates an “Agent Executor” chain. Agent Executors manage the agent’s tools, determining which ones to call and in what sequence. In this case, the agent is the Pandas DataFrame Agent, and the tool is the Python REPL.

The Agent Executor chain comprises Thoughts, Actions, and Observations. Thoughts are driven by the LLM, which then provides an Action for the agent to access the Python REPL tool and execute a specific command. After the agent executes the command using the tool, an Observation is passed back to the LLM for evaluation. The LLM generates another Thought, followed by an Action, until it predicts that the agent and the tool have produced the correct answer.

Agent Executors provide flexibility in managing agents. For instance, they can limit the number of steps an agent takes, ensuring that agents do not go out of control and perform excessive actions. This capability is valuable in maintaining efficiency and preventing potential issues in the process.

Flow Chart shows how LangChain agents interact with LLMs

Let’s start experimenting, shall we?

First, create a Conda environment with Python 3.9 and activate it:

conda create -n langchain_llm_venv python=3.9
conda activate langchain_llm_venv

Install the required dependencies:

pip install langchain 
pip install pandas

Now, import the necessary libraries:

from langchain.llms import OpenAI
from langchain.agents import create_pandas_dataframe_agent
import pandas as pd

Define a function to call the OpenAI model while loading the data from the CSV file into a dataframe:

def run(api_key: str, prompt: str, local_file_path: str) -> str:
df = pd.read_csv(local_file_path)

agent = create_pandas_dataframe_agent(OpenAI(temperature=0, openai_api_key=api_key, model_name="text-davinci-003"), df, verbose=True)

response = agent.run(prompt)
print(response)

return response

The cost_report.csv contains dummy data about AWS Cost for all services used (columns) in a specific timeframe (rows), with each row representing the cost per day per service. You can manually download the cost report from the AWS Console by following the steps here

Snapshot of the AWS Cost and Usage Report data stored in a cost_report.csv

We can now ask the LLM about a cost of one of the services. Change the prompt to “what is the total cost of Sagemaker?”

# Call the run function
api_key = <OpenAI API Key> # Replace this with your actual API key
prompt = "what is the total cost of Sagemaker?" # Replace this with the prompt you want to use
local_file_path = "cost_report.csv" # Replace this with the path to your local CSV file

response = run(api_key, prompt, local_file_path)

You can see from the output below how the agent has gone through iterations (chains) of “thoughts,” “observations,” and “actions” with the help of an LLM, in this case using the OpenAI text-davinci-003 model. The agent returns with a finished chain or an answer, stating that the total cost of Sagemaker is 17219.9127291594

> Entering new AgentExecutor chain...
Thought: I need to find the column with the cost of Sagemaker
Action: python_repl_ast
Action Input: df.columns
Observation: Index(['Date', 'AWS Amplify', 'AWS CloudShell', 'AWS CloudTrail', 'AWS Config',
'AWS Firewall Manager',
.......
'Amazon SageMaker',
'CloudWatch Events', 'EC2 - Other', 'Total'],
dtype='object')
Thought:I need to find the total cost of Sagemaker
Action: python_repl_ast
Action Input: df['Amazon SageMaker'].sum()
Observation: 17219.9127291594
Thought:I now know the final answer
Final Answer: The total cost of Sagemaker is 17219.9127291594.

> Finished chain.
The total cost of Sagemaker is 17219.9127291594.

The Pandas dataframe agent is primarily optimized to query numeric tabular data with a predefined prompt template. As a result, it may not return accurate answers if the question is unclear, such as when the user mistypes or mismatches the name of the service listed in the report (e.g., “what is the cost of sagmaker on 03–2–2023?”). In this case, the agent returned a 0 value even though Amazon Sagemaker had an associated cost for that particular date/ row.

> Entering new AgentExecutor chain...
Thought: I need to find the cost of sagemaker on 03-2-2023
Action: python_repl_ast
Action Input: df.loc[df['Date'] == '2023-03-02', 'AmazonSageMaker']
Observation: 'AmazonSageMaker'
Thought: I now know the final answer
Final Answer: 0.0

> Finished chain.
0.0
(langchain_llm_venv) ahhakme@a483e7793a6b Desktop % python3 agent.py


> Entering new AgentExecutor chain...
Thought: I need to find the cost of sagmaker on 03-2-2023
Action: python_repl_ast
Action Input: df.loc[df['Date'] == '2023-03-02', 'Sagemaker']
Observation: 'Sagemaker'
Thought: I need to find the cost of sagmaker on 03-2-2023
Action: python_repl_ast
Action Input: df.loc[df['Date'] == '2023-03-02', 'Sagemaker'].values[0]
Observation: 'Sagemaker'
Thought: I now know the final answer
Final Answer: The cost of Sagemaker on 03-2-2023 is 0.0.

> Finished chain.
The cost of Sagemaker on 03-2-2023 is 0.0.

Below is the predefined LangChain Panda Dataframe Agent. It’s important to note that the prefix only provides instructions to the LLM about working with a pandas dataframe, without any additional context about the data itself.

def create_pandas_dataframe_agent(llm: langchain.llms.base.BaseLLM,
df: Any,
callback_manager: Optional[langchain.callbacks.base.BaseCallbackManager] = None,
prefix: str = '\nYou are working with a pandas dataframe in Python. The name of the dataframe is `df`.\nYou should use the tools below to answer the question posed of you:',
suffix: str = '\nThis is the result of `print(df.head())`:\n{df}\n\nBegin!\nQuestion: {input}\n{agent_scratchpad}',
input_variables: Optional[List[str]] = None,
verbose: bool = False,
return_intermediate_steps: bool = False,
max_iterations: Optional[int] = 15,
max_execution_time: Optional[float] = None,
early_stopping_method: str = 'force',
**kwargs: Any) -> None:

To make the agent more optimized for our use case, we can modify the prompt template by informing the agent about its role as an AWS services expert and hinting at the nature of the dataframe, which, in this case, represents the cost of AWS services. We can achieve this by passing a modified prefix and suffix to the agent.

PREFIX = '''
You are working with a pandas dataframe in Python that contains AWS Cost Explorer data. The name of the dataframe is `df`.

The data includes various AWS services and their usage costs on different dates. You are an expert in AWS and know all service names

Keep in mind that columns names maybe different from input i.e Column name is "Amazon SageMaker" but input could be "Sagemaker" or "Amazon SageMaker" etc.

You should use the tools below to answer the question posed of you:

python_repl_ast: A Python shell. Use this to execute python commands. Input should be a valid python command. When using this tool, sometimes output is abbreviated - make sure it does not look abbreviated before using it in your answer.
'''

SUFFIX = '''
Use the following format:

Question: the input question you must answer

Thought: you should always think about what to do
Action: the action to take, should be one of [python_repl_ast]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

This is the result of `print(df.head())`:
{df}

Begin!

Question: {input}
{agent_scratchpad}
'''

Now modify the run function by passing the prefix and suffix to the agent

def run(api_key: str, prompt: str, local_file_path: str) -> str:
df = pd.read_csv(local_file_path)

agent = create_pandas_dataframe_agent(OpenAI(temperature=0, openai_api_key=api_key, model_name="text-davinci-003"), df, verbose=True, prefix=PREFIX, suffix=SUFFIX)

response = agent.run(prompt)
print(response)

return response

Let’s now ask the agent the same question: “what is the cost of sagmaker on 03–2–2023?”. This time, the agent returned an accurate answer regarding the Amazon SageMaker cost on 2023–03–02, which is 190.0521831555.


> Entering new AgentExecutor chain...
Thought: I need to find the cost of sagemaker on 03-2-2023
Action: python_repl_ast
Action Input: df[df['Date'] == '2023-03-02']['Amazon SageMaker'].values[0]
Observation: 190.0521831555
Thought:I now know the final answer
Final Answer: 190.0521831555

> Finished chain.
190.0521831555

Let’s take it one step further and ask the agent about the cost of a group of AWS services. For example, “what is the total cost of AWS security-related services?” Notice how the agent is able to perform the sum of multiple services under the AWS security-related services category, with the LLM providing the agent an input that includes all AWS security services.

> Entering new AgentExecutor chain...
Thought: I need to find the sum of all the columns related to security services
Action: python_repl_ast
Action Input: df[['AWS CloudTrail', 'AWS Config', 'AWS Config Rules', 'AWS Firewall Manager', 'AWS Inspector', 'AWS Key Management Service', 'AWS Organizations', 'AWS Shield', 'AWS WAF']].sum().sum()
Observation: "['AWS Config Rules', 'AWS Inspector', 'AWS Organizations', 'AWS Shield'] not in index"
Thought:I need to check the column names and make sure I am using the correct names
Action: python_repl_ast
Action Input: df.columns
Observation: Index(['Date', 'AWS Amplify', 'AWS CloudShell', 'AWS CloudTrail', 'AWS Config',
'AWS Firewall Manager', 'AWS Global Accelerator', 'AWS Glue',
.......
'CloudWatch Events', 'EC2 - Other', 'Total'],
dtype='object')
Thought:I need to use the correct column names
Action: python_repl_ast
Action Input: df[['AWS CloudTrail', 'AWS Config', 'AWS Firewall Manager', 'AWS Inspector', 'AWS Key Management Service', 'AWS Organizations', 'AWS Shield', 'AWS WAF']].sum().sum()
Observation: "['AWS Inspector', 'AWS Organizations', 'AWS Shield'] not in index"
Thought:I need to check the column names again and make sure I am using the correct names
Action: python_repl_ast
Action Input: df.columns
Observation: Index(['Date', 'AWS Amplify', 'AWS CloudShell', 'AWS CloudTrail', 'AWS Config',
'AWS Firewall Manager', 'AWS Global Accelerator', 'AWS Glue',
.......
'CloudWatch Events', 'EC2 - Other', 'Total'],
dtype='object')
Thought:I need to use the correct column names
Action: python_repl_ast
Action Input: df[['AWS CloudTrail', 'AWS Config', 'AWS Firewall Manager', 'AWS IoT Device Defender', 'AWS Key Management Service', 'AWS Security Hub', 'AWS WAF']].sum().sum()
Observation: 4593.5568078292
Thought:I now know the final answer
Final Answer: The cost of AWS Security related services is 4593.5568078292.

> Finished chain.
The cost of AWS Security related services is 4593.5568078292.

Putting it all together


from langchain.llms import OpenAI
from langchain.agents import create_pandas_dataframe_agent
import pandas as pd



PREFIX = '''
You are working with a pandas dataframe in Python that contains AWS Cost Explorer data. The name of the dataframe is `df`.

The data includes various AWS services and their usage costs on different dates. You are an expert in AWS and know all service names

Keep in mind that columns names maybe different from input i.e Column name is "Amazon SageMaker" but input could be "Sagemaker" or "Amazon SageMaker" etc.

You should use the tools below to answer the question posed of you:

python_repl_ast: A Python shell. Use this to execute python commands. Input should be a valid python command. When using this tool, sometimes output is abbreviated - make sure it does not look abbreviated before using it in your answer.
'''

SUFFIX = '''
Use the following format:

Question: the input question you must answer

Thought: you should always think about what to do
Action: the action to take, should be one of [python_repl_ast]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

This is the result of `print(df.head())`:
{df}

Begin!

Question: {input}
{agent_scratchpad}
'''

def run(api_key: str, prompt: str, local_file_path: str) -> str:
df = pd.read_csv(local_file_path)

agent = create_pandas_dataframe_agent(OpenAI(temperature=0.3, openai_api_key=api_key, model_name="text-davinci-003"), df, verbose=True, prefix=PREFIX, suffix=SUFFIX)
#agent = create_pandas_dataframe_agent(OpenAI(temperature=0, openai_api_key=api_key, model_name="text-davinci-003"), df, verbose=True)

response = agent.run(prompt)
print(response)

return response

# Call the run function
api_key = "<OpenAI API Key>" # Replace this with your actual API key
session_id = None
prompt = "what is the cost of AWS Security related services?" # Replace this with the prompt you want to use
local_file_path = "cost_report_dummy.csv" # Replace this with the path to your local CSV file

response = run(api_key, prompt, local_file_path)

Let’s bundle this together in a Docker container

Now that we’ve covered how LangChain can enhance LLMs to provide context-aware insights for tabular data, it’s time to put this into action. By bundling the code in a Docker container, we can deploy and run it to read your AWS Cost and Usage Reports (CUR) file from an S3 bucket, making it accessible through AWS Elastic Container Service (ECS). You can then setup a local Streamlit app to connect to the docker container and be able to interact with the agent using a user-friendly interface

Follow these steps to set up your environment and deploy the Docker container:

1. Clone the example repository:

https://github.com/ameerhakme/langchain-aws-cur-example

2. Install the dependencies:

This creates a Conda env named `langchain-aws-service-openai` and activates it:

conda deactivate
conda env create -f environment.yml # only needed once
conda activate langchain-aws-service-openai

3. Build and push the Docker image to your AWS ECR repository:

  • Retrieve an authentication token and authenticate your Docker client to your registry using the AWS CLI (assuming you have one created already), check this link for details on how to create an ECR repo
aws ecr get-login-password — region <AWS_region> | docker login — username AWS — password-stdin <account_id>.dkr.ecr.<AWS_region>.amazonaws.com/llm-cur:latest
  • Build your Docker image using the following command
docker build -t llm-cur .
  • After the build completes, tag your image so you can push the image to this repository:
docker tag llm-cur:latest <account_id>.dkr.ecr.<AWS_region>.amazonaws.com/llm-cur:latest
  • Run the following command to push this image to your newly created AWS repository:
docker push <account_id>.dkr.ecr.<AWS_region>.amazonaws.com/llm-cur:latest

Deploy the Docker container to AWS ECS:

For more details please visit the AWS documentation or you can follow the steps in the Github Repo

Run the local Streamlit UI

  • Move to the webapp directory
cd webapp
  • Update the ECS task public IP in api.py: Open the api.py file in a text editor and replace <your-api-endpoint> with the public IP address of your ECS task.
  • Start the web application
streamlit run app.py
Demo to show the app in action

Considerations

While LangChain, LLMs, and agents provide powerful tools for extracting insights from tabular data, there are some limitations to be aware of:

  1. Limited context: LLMs and agents may not always be able to understand the full context of a dataset or specific domain knowledge. Providing more context via customized prefixes and suffixes can help but may not always be sufficient.
  2. Ambiguity in queries: If the user’s query is ambiguous or contains typos, the agent may return incorrect or irrelevant results. It’s important to ask clear and well-formed questions to get the best results.
  3. Scalability: When working with large datasets or complex queries, the performance of the agent may degrade. The agent may take longer to process the query, or may struggle to find the best approach for answering the question.
  4. Dependence on quality of data: The accuracy and usefulness of the insights generated by LLMs and agents depend on the quality of the input data. If the data is incomplete, inconsistent, or contains errors, the generated insights may be less reliable.
  5. Model limitations: LLMs are continually being improved, but they are not perfect. They may generate plausible-sounding but incorrect answers or may not fully understand certain nuances in the data or questions.
  6. Customization effort: To optimize the agent for a specific use case, users may need to invest time in customizing the prefix, suffix, and other parameters. This may require a deeper understanding of the problem domain and the agent’s inner workings.
  7. In this example, we utilize OpenAI’s text-davinci-003 language model. It is recommended to experiment with other LLMs as well, but this specific model can be substituted with any other LLM that suits your requirements.

Conclusion

Despite these limitations, the combination of LangChain and LLMs offers a promising approach to extracting valuable insights from tabular data, and continued development and refinement of these tools will likely address many of these challenges.

References and credits

  1. LangChain GitHub Repo: https://github.com/hwchase17/langchain
  2. Streamlit: https://streamlit.io/
  3. LangChain Docs: https://langchain.readthedocs.io/en/latest/index.html
  4. LangChain Querying Tabular Data: https://python.langchain.com/en/latest/use_cases/tabular.html
  5. LangChain Panda Dataframe Agent: https://python.langchain.com/en/latest/modules/agents/toolkits/examples/pandas.html
  6. Getting started with LangChain — A powerful tool for working with Large Language Models: https://medium.com/@avra42/getting-started-with-langchain-a-powerful-tool-for-working-with-large-language-models-286419ba0842

Thank you for reading!

--

--

Ameer Hakme
Ameer Hakme

Written by Ameer Hakme

Sr. Solutions Architect at Amazon Web Services

No responses yet