Skip to main content

Deploy a Streaming RAG Chatbot with Docker, FastAPI, and a Streamlit UI

·1341 words·7 mins

This blog is a simple, practical guide for building your own smart chatbot that can answer questions using your private documents. It shows how to connect your data in AWS Bedrock with a FastAPI app written in Python, so the chatbot can search, find, and explain answers clearly — just like a well-informed helper. You’ll learn how to run it either on your computer or inside a Docker container, making it easy to use and share anywhere. In short, it’s like teaching your computer to “read your files” and talk back with real answers, step by step.

hero

This comprehensive runbook will guide you through building a complete, end-to-end RAG application. The final solution will include:

  • A streaming FastAPI backend that provides in-text citations.
  • An interactive Streamlit chatbot interface for users.
  • Integration with AWS Bedrock Knowledge Base for efficient document retrieval.
  • Instructions for running the backend with Docker or locally.

Project Structure
#

Your final project directory (rag_chatbot_api/) will contain the following files:

rag_chatbot_api/
├── main.py               # The FastAPI backend application
├── chatbot_ui.py         # The Streamlit frontend application
├── environment.yml       # Conda dependencies for both apps
├── litellm_config.yaml   # Configuration for the LLM proxy
├── Dockerfile            # Instructions to containerize the FastAPI backend
└── .env                  # Environment variables for configuration

Step 1: Prerequisites - Your Bedrock Knowledge Base
#

This is the foundation. You must have a Bedrock Knowledge Base (KB) set up and synced with your documents in S3.

Your Action Item: Navigate to the AWS Console -> Bedrock -> Knowledge bases. Create your KB if you haven’t already. Once it’s ready, find and copy the Knowledge base ID (e.g., ABC123XYZ).


Step 2: Set up the LiteLLM Proxy
#

To start using Litellm, run the following commands in a shell:

# Get the code
curl -O https://raw.githubusercontent.com/BerriAI/litellm/main/docker-compose.yml
curl -O https://raw.githubusercontent.com/BerriAI/litellm/main/prometheus.yml

# Add the master key - you can change this after setup
echo 'LITELLM_MASTER_KEY="sk-1234"' > .env

# Add the litellm salt key - you cannot change this after adding a model
# It is used to encrypt / decrypt your LLM API Key credentials
# We recommend - https://1password.com/password-generator/ 
# password generator to get a random hash for litellm salt key
echo 'LITELLM_SALT_KEY="sk-1234"' >> .env

source .env

# Start
docker compose up

For further details, please refer to litellm docs


Step 3: Prepare the FastAPI Application & Dependencies
#

Here we set up the project files for both the backend and the new Streamlit frontend.

  1. Create Project Directory:

    mkdir rag_chatbot_api
    cd rag_chatbot_api
    
  2. Define Conda Environment (environment.yml): This file now includes dependencies for both FastAPI and Streamlit.

    name: rag-env
    channels:
      - conda-forge
    dependencies:
      - python=3.9
      - fastapi
      - uvicorn-standard
      - pydantic
      - boto3
      - openai
      - streamlit  # Added for the UI
      - requests   # Added for the UI to call the backend
    
  3. Write the Backend Code (main.py): This code remains unchanged from the previous version. It is the streaming heart of our application.

    # main.py
    import os
    import boto3
    from fastapi import FastAPI, HTTPException
    from pydantic import BaseModel
    from openai import AsyncOpenAI
    from fastapi.responses import StreamingResponse
    
    # ... (The entire main.py code, including Pydantic models, create_prompt function, and the API endpoint, is identical to the previous version) ...
    # --- Configuration & Clients ---
    AWS_REGION = os.getenv("AWS_REGION", "us-east-1")
    BEDROCK_KB_ID = os.getenv("BEDROCK_KB_ID")
    
    # Client for Bedrock Knowledge Base
    bedrock_agent_client = boto3.client("bedrock-agent-runtime", region_name=AWS_REGION)
    
    # Client for LiteLLM Proxy (OpenAI SDK compatible)
    litellm_client = AsyncOpenAI(
        base_url=os.getenv("LITELLM_PROXY_URL"),
        api_key=os.getenv("LITELLM_API_KEY"),
    )
    
    app = FastAPI(title="Streaming Bedrock KB RAG API")
    
    class QueryRequest(BaseModel):
        query: str
        model: str = "claude-v2-instruct" # The alias from our LiteLLM config
        numberOfResults: int = 5
    
    def create_prompt(query: str, retrieval_results: list) -> str:
        source_map = {}
        context_for_llm = ""
        citation_counter = 1
    
        for result in retrieval_results:
            uri = result.get('location', {}).get('s3Location', {}).get('uri', 'Unknown Source')
            if uri not in source_map:
                source_map[uri] = citation_counter
                citation_counter += 1
    
            citation_num = source_map[uri]
            text_chunk = result['content']['text']
            context_for_llm += f"Source [{citation_num}]: \"{text_chunk}\"\n---\n"
    
        citation_list = "\n".join([f"[{num}] {uri}" for uri, num in sorted(source_map.items(), key=lambda item: item[1])])
    
        return (
            "You are a helpful assistant. Your task is to answer the user's question based *only* on the provided sources. "
            "Follow these instructions exactly:\n"
            "1. Synthesize a comprehensive answer using information from the sources.\n"
            "2. For every piece of information you use, you MUST cite the source by appending the corresponding source number in brackets, like `[1]`, `[2]`, etc.\n"
            "3. After you have finished the answer, list all the sources you used under a `Sources:` heading. Use the exact mapping provided below.\n"
            "4. Do not include any information that is not from the provided sources.\n\n"
            f"---SOURCES---\n{context_for_llm}\n"
            f"---SOURCE MAPPING---\n{citation_list}\n\n"
            f"---USER QUESTION---\n{query}\n\n"
            "ANSWER:"
        )
    
    @app.post("/document-chat-stream")
    async def document_chat_stream(request: QueryRequest):
        if not BEDROCK_KB_ID:
            raise HTTPException(status_code=500, detail="BEDROCK_KB_ID not configured.")
    
        try:
            response = bedrock_agent_client.retrieve(
                knowledgeBaseId=BEDROCK_KB_ID,
                retrievalQuery={'text': request.query},
                retrievalConfiguration={'vectorSearchConfiguration': {'numberOfResults': request.numberOfResults}}
            )
            retrieval_results = response.get('retrievalResults', [])
        except Exception as e:
            raise HTTPException(status_code=500, detail=f"Error retrieving from Bedrock KB: {e}")
    
        prompt = create_prompt(request.query, retrieval_results)
    
        async def stream_generator():
            try:
                stream = await litellm_client.chat.completions.create(
                    model=request.model,
                    messages=[{"role": "user", "content": prompt}],
                    stream=True
                )
                async for chunk in stream:
                    content = chunk.choices[0].delta.content
                    if content:
                        yield content
            except Exception as e:
                yield f"\n\nERROR: Could not get response from LLM. Details: {str(e)}"
    
        return StreamingResponse(stream_generator(), media_type="text/plain")
    

Step 4: Create the Streamlit Chatbot Client
#

Now, let’s create the user-facing application.

Create the Frontend Code (chatbot_ui.py): This script uses Streamlit’s chat components and the requests library to communicate with our FastAPI backend.

# chatbot_ui.py
import streamlit as st
import requests
import json

# --- Page Configuration ---
st.set_page_config(
    page_title="RAG Chatbot",
    page_icon="🤖",
    layout="wide"
)
st.title("RAG Chatbot with Citations 🤖")

# --- Constants ---
FASTAPI_URL = "http://localhost:8000/document-chat-stream"

# --- Session State Initialization ---
# Ensures that the message history is preserved across reruns
if "messages" not in st.session_state:
    st.session_state.messages = []

# --- Display Chat History ---
# Renders the existing chat messages
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# --- Main Chat Interaction Logic ---
if prompt := st.chat_input("Ask me anything about your documents..."):
    # Add user message to session state and display it
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)

    # Display assistant response in a streaming fashion
    with st.chat_message("assistant"):
        message_placeholder = st.empty()
        full_response = ""

        try:
            # Prepare the data for the POST request
            data = {"query": prompt}
            # Use requests to stream the response from the FastAPI endpoint
            with requests.post(FASTAPI_URL, json=data, stream=True) as r:
                r.raise_for_status()  # Raise an exception for bad status codes
                for chunk in r.iter_content(chunk_size=None, decode_unicode=True):
                    full_response += chunk
                    message_placeholder.markdown(full_response + "▌") # "▌" gives a typing cursor effect
            message_placeholder.markdown(full_response)
        except requests.exceptions.RequestException as e:
            st.error(f"Could not connect to the backend: {e}")
            full_response = "Sorry, I couldn't connect to the processing service."
        except Exception as e:
            st.error(f"An unexpected error occurred: {e}")
            full_response = "An unexpected error occurred."
            
    # Add the final, complete assistant response to the session state
    st.session_state.messages.append({"role": "assistant", "content": full_response})

Step 5: Run the Backend (Choose One Option)
#

You need to run the FastAPI backend so the Streamlit UI can talk to it. Open your second terminal for this.

Option A: Run with Docker (Recommended for Deployment) 🐳#

  1. Create the Dockerfile: (Same as before)

    FROM continuumio/miniconda3
    WORKDIR /app
    COPY environment.yml .
    RUN conda env create -f environment.yml
    COPY main.py .
    EXPOSE 8000
    CMD ["conda", "run", "-n", "rag-env", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
    
  2. Create the .env file for Docker:

    # .env file
    AWS_REGION=us-east-1
    AWS_ACCESS_KEY_ID=YOUR_AWS_ACCESS_KEY_ID
    AWS_SECRET_ACCESS_KEY=YOUR_AWS_SECRET_ACCESS_KEY
    BEDROCK_KB_ID=YOUR_KNOWLEDGE_BASE_ID
    LITELLM_PROXY_URL=http://host.docker.internal:4000
    LITELLM_API_KEY=sk-1234
    
  3. Build and Run the Container (Terminal 2):

    docker build -t rag-chatbot-api .
    docker run --env-file .env -p 8000:8000 rag-chatbot-api
    

Option B: Run Locally (For Development) 💻
#

  1. Create & Activate Conda Environment (Terminal 2):

    conda env create -f environment.yml
    conda activate rag-env
    
  2. Set Local Environment Variables (Terminal 2):

    • PowerShell (Windows):
      $env:LITELLM_PROXY_URL="http://localhost:4000"
      # ... set other AWS and Bedrock variables ...
      
    • Bash (macOS/Linux):
      export LITELLM_PROXY_URL="http://localhost:4000"
      # ... set other AWS and Bedrock variables ...
      
  3. Run the FastAPI Server (Terminal 2):

    uvicorn main:app --reload --host 0.0.0.0 --port 8000
    

Step 6: Run the Streamlit UI
#

Finally, let’s launch the user interface.

  1. Open a Third Terminal (Terminal 3).

  2. Activate the Conda Environment:

    conda activate rag-env
    
  3. Run the Streamlit App:

    streamlit run chatbot_ui.py
    
  4. Open Your Browser: Streamlit will provide a URL, typically http://localhost:8501. Open this link to start chatting with your documents!