Mastering AI-Powered Web Search: Integrating Google Search with LangChain for Intelligent Query Handling

4 min readJun 15, 2024

In today’s information-rich world, harnessing the vast resources of the internet is essential. Imagine a system that seamlessly connects to Google, retrieves relevant information, and uses an advanced language model to deliver precise answers all automatically. With the LangChain framework, this is possible. In this tutorial, we’ll show you how to integrate Google Search with LangChain, transforming simple queries into comprehensive, intelligent responses. Let’s get started!

To access Google search content programmatically, you’ll need two crucial pieces: a Search Engine ID and a Google Search API Key. These elements enable your application to interact with Google’s search infrastructure securely and efficiently.

Steps to follow to get search engine ID and google search API key

If you don’t already have a Google account, sign up. — If you have never created a Google APIs Console project, create a project in the Google API Console — https://console.cloud.google.com/.
Enable the Custom Search API :

Navigate to the APIs & Services → Dashboard panel in Cloud Console. — Click Enable APIs and Services.
Search for Custom Search API and click on it. Click Enable.
URL for it: https://console.cloud.google.com/apis/library/customsearch.googleapis .com

3. To create an API key:

Navigate to the APIs & Services → Credentials panel in Cloud Console.
Select Create credentials, then select API key from the drop-down menu.
The API key created dialog box displays your newly created key. You now have an API key

4. Setup Custom Search Engine so you can search the entire web

Create a custom search engine here: https://programmablesearchengine.google.com/.
In What to search to search, pick the Search the entire Web option. After search engine is created, you can click on it and find Search engine ID

Now you have the API key and Search Engine ID. Now to use them in the code you need to have a package installed.

pip install google-api-python-client

import os

os.environ["GOOGLE_CSE_ID"] = "<your-google-cse-id>"
os.environ["GOOGLE_API_KEY"] = "<your-google-api-key>"
os.environ["OPENAI_API_KEY"] = "<your-openai-api-key>"

Step-by-Step Guide

Initialize the Language Model (LLM):

We start by setting up our language model using OpenAI’s GPT-3.5 turbo.

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0.7,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

2. Configure the Prompt Template:

We’ll use a predefined prompt from the LangChain hub to structure our agent’s responses.

from langchain import hub
from langchain.agents import create_openai_tools_agent

prompt = hub.pull("hwchase17/openai-tools-agent")

3. Utility Functions:

These functions handle text cleaning, web searches, and processing of search results. The GoogleSearchAPIWrapper uses API key and search engine ID from the environment variables and initiates search. UnstructuredURLLoader scrapes the content from the extracted URLs from the search.

import re
from langchain_community.utilities import GoogleSearchAPIWrapper
from langchain_community.document_loaders import UnstructuredURLLoader
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

def clean_text(raw_text):
    cleaned_text = re.sub(r'\n+', ' ', raw_text)
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text)
    cleaned_text = ''.join(char for char in raw_text if ord(char) < 128)
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text)
    return cleaned_text.strip()

search = GoogleSearchAPIWrapper(k=6)

def top5_results(query):
    return search.results(query, 6)

def get_urls_of_results(results):
    return [result['link'] for result in results]

def get_page_contents_from_url(urls):
    loader = UnstructuredURLLoader(urls=urls)
    data = loader.load()
    return [each_data.page_content for each_data in data]

4. Document Preparation:

Clean and split the content into manageable chunks for processing.

def get_documents(page_contents):
    cleaned_text = " ".join(clean_text(content) for content in page_contents)
    documents = [Document(page_content=cleaned_text)]
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=80)
    return text_splitter.split_documents(documents)

5. Retriever Tool:

Set up the retriever tool to fetch relevant information from the documents. We use OpenAIEmbeddings to store the documents and I have chosen FAISS as my vector store for this code.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.tools.retriever import create_retriever_tool

openai_embeddings = OpenAIEmbeddings()

def retriever_tool(documents):
    retriever = FAISS.from_documents(documents, openai_embeddings).as_retriever(search_kwargs={"k": 7})
    tool = create_retriever_tool(
        retriever,
        "fetch_data_from_documents",
        "Retrieve the most relevant information from the provided documents.",
    )
    return [tool]

6. Agent Executor:

Create an agent executor to handle user queries using the configured LLM and tools.

from langchain.agents import AgentExecutor,create_openai_tools_agent

def agent_executor(tools):
    agent = create_openai_tools_agent(llm, tools, prompt)
    executor = AgentExecutor.from_agent_and_tools(
        agent=agent,
        tools=tools,
        return_intermediate_steps=True,
        handle_parsing_errors=True,
    )
    return executor

7. Query Execution:

Execute a query to fetch and process information. The Overall flow appears here.

query = "what are the teams qualified for t20 world cup"

results = top5_results(query)
urls = get_urls_of_results(results)
page_contents = get_page_contents_from_url(urls)
documents = get_documents(page_contents)
tools = retriever_tool(documents)
executor = agent_executor(tools)

result = executor.invoke({"input": query})
print(result['output'])

Output: 'The teams that have qualified for the T20 World Cup Super 8 stage are India, Australia, Afghanistan, South Africa, USA, and West Indies.'

Tip: Customizing the Prompt

To ensure the agent provides descriptive answers, update the prompt template:

for each in prompt[0]:
    each[-1].template = "You are a helpful AI assistant. Your task is to generate descriptive answers for the given query from the provided documents."
    break

output: 'The teams that have qualified for the T20 World Cup Super 8 stage are India, Australia, Afghanistan, South Africa, USA, and West Indies. These teams secured their places in the Super 8 by finishing in the top two positions in their respective groups during the tournament.'

Links:

Github Code Link

Conclusion

By following these steps, you’ve created a powerful AI agent capable of querying and processing information from the web using LangChain and OpenAI. This setup can be tailored further to fit various applications, making it a versatile tool for automated information retrieval and processing.

Mastering AI-Powered Web Search: Integrating Google Search with LangChain for Intelligent Query Handling

Step-by-Step Guide

Conclusion

Written by G M NITHIN SAI