Deploy a Streaming RAG Chatbot with Docker, FastAPI, and a Streamlit UI ·

This blog is a simple, practical guide for building your own smart chatbot that can answer questions using your private documents. It shows how to connect your data in AWS Bedrock with a FastAPI app written in Python, so the chatbot can search, find, and explain answers clearly — just like a well-informed helper. You’ll learn how to run it either on your computer or inside a Docker container, making it easy to use and share anywhere. In short, it’s like teaching your computer to “read your files” and talk back with real answers, step by step.

This comprehensive runbook will guide you through building a complete, end-to-end RAG application. The final solution will include:

A streaming FastAPI backend that provides in-text citations.
An interactive Streamlit chatbot interface for users.
Integration with AWS Bedrock Knowledge Base for efficient document retrieval.
Instructions for running the backend with Docker or locally.

Project Structure
#

Your final project directory (rag_chatbot_api/) will contain the following files:

rag_chatbot_api/
├── main.py               # The FastAPI backend application
├── chatbot_ui.py         # The Streamlit frontend application
├── environment.yml       # Conda dependencies for both apps
├── litellm_config.yaml   # Configuration for the LLM proxy
├── Dockerfile            # Instructions to containerize the FastAPI backend
└── .env                  # Environment variables for configuration

Step 1: Prerequisites - Your Bedrock Knowledge Base
#

This is the foundation. You must have a Bedrock Knowledge Base (KB) set up and synced with your documents in S3.

Your Action Item: Navigate to the AWS Console -> Bedrock -> Knowledge bases. Create your KB if you haven’t already. Once it’s ready, find and copy the Knowledge base ID (e.g., ABC123XYZ).

Step 2: Set up the LiteLLM Proxy
#

To start using Litellm, run the following commands in a shell:

# Get the code
curl -O https://raw.githubusercontent.com/BerriAI/litellm/main/docker-compose.yml
curl -O https://raw.githubusercontent.com/BerriAI/litellm/main/prometheus.yml

# Add the master key - you can change this after setup
echo 'LITELLM_MASTER_KEY="sk-1234"' > .env

# Add the litellm salt key - you cannot change this after adding a model
# It is used to encrypt / decrypt your LLM API Key credentials
# We recommend - https://1password.com/password-generator/ 
# password generator to get a random hash for litellm salt key
echo 'LITELLM_SALT_KEY="sk-1234"' >> .env

source .env

# Start
docker compose up

For further details, please refer to litellm docs

Step 3: Prepare the FastAPI Application & Dependencies
#

Here we set up the project files for both the backend and the new Streamlit frontend.

Create Project Directory:

mkdir rag_chatbot_api
cd rag_chatbot_api

Define Conda Environment (environment.yml): This file now includes dependencies for both FastAPI and Streamlit.

name: rag-env
channels:
  - conda-forge
dependencies:
  - python=3.9
  - fastapi
  - uvicorn-standard
  - pydantic
  - boto3
  - openai
  - streamlit  # Added for the UI
  - requests   # Added for the UI to call the backend

Write the Backend Code (main.py): This code remains unchanged from the previous version. It is the streaming heart of our application.

# main.py
import os
import boto3
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import AsyncOpenAI
from fastapi.responses import StreamingResponse

# ... (The entire main.py code, including Pydantic models, create_prompt function, and the API endpoint, is identical to the previous version) ...
# --- Configuration & Clients ---
AWS_REGION = os.getenv("AWS_REGION", "us-east-1")
BEDROCK_KB_ID = os.getenv("BEDROCK_KB_ID")

# Client for Bedrock Knowledge Base
bedrock_agent_client = boto3.client("bedrock-agent-runtime", region_name=AWS_REGION)

# Client for LiteLLM Proxy (OpenAI SDK compatible)
litellm_client = AsyncOpenAI(
    base_url=os.getenv("LITELLM_PROXY_URL"),
    api_key=os.getenv("LITELLM_API_KEY"),
)

app = FastAPI(title="Streaming Bedrock KB RAG API")

class QueryRequest(BaseModel):
    query: str
    model: str = "claude-v2-instruct" # The alias from our LiteLLM config
    numberOfResults: int = 5

def create_prompt(query: str, retrieval_results: list) -> str:
    source_map = {}
    context_for_llm = ""
    citation_counter = 1

    for result in retrieval_results:
        uri = result.get('location', {}).get('s3Location', {}).get('uri', 'Unknown Source')
        if uri not in source_map:
            source_map[uri] = citation_counter
            citation_counter += 1

        citation_num = source_map[uri]
        text_chunk = result['content']['text']
        context_for_llm += f"Source [{citation_num}]: \"{text_chunk}\"\n---\n"

    citation_list = "\n".join([f"[{num}] {uri}" for uri, num in sorted(source_map.items(), key=lambda item: item[1])])

    return (
        "You are a helpful assistant. Your task is to answer the user's question based *only* on the provided sources. "
        "Follow these instructions exactly:\n"
        "1. Synthesize a comprehensive answer using information from the sources.\n"
        "2. For every piece of information you use, you MUST cite the source by appending the corresponding source number in brackets, like `[1]`, `[2]`, etc.\n"
        "3. After you have finished the answer, list all the sources you used under a `Sources:` heading. Use the exact mapping provided below.\n"
        "4. Do not include any information that is not from the provided sources.\n\n"
        f"---SOURCES---\n{context_for_llm}\n"
        f"---SOURCE MAPPING---\n{citation_list}\n\n"
        f"---USER QUESTION---\n{query}\n\n"
        "ANSWER:"
    )

@app.post("/document-chat-stream")
async def document_chat_stream(request: QueryRequest):
    if not BEDROCK_KB_ID:
        raise HTTPException(status_code=500, detail="BEDROCK_KB_ID not configured.")

    try:
        response = bedrock_agent_client.retrieve(
            knowledgeBaseId=BEDROCK_KB_ID,
            retrievalQuery={'text': request.query},
            retrievalConfiguration={'vectorSearchConfiguration': {'numberOfResults': request.numberOfResults}}
        )
        retrieval_results = response.get('retrievalResults', [])
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Error retrieving from Bedrock KB: {e}")

    prompt = create_prompt(request.query, retrieval_results)

    async def stream_generator():
        try:
            stream = await litellm_client.chat.completions.create(
                model=request.model,
                messages=[{"role": "user", "content": prompt}],
                stream=True
            )
            async for chunk in stream:
                content = chunk.choices[0].delta.content
                if content:
                    yield content
        except Exception as e:
            yield f"\n\nERROR: Could not get response from LLM. Details: {str(e)}"

    return StreamingResponse(stream_generator(), media_type="text/plain")

Step 4: Create the Streamlit Chatbot Client
#

Now, let’s create the user-facing application.

Create the Frontend Code (chatbot_ui.py): This script uses Streamlit’s chat components and the requests library to communicate with our FastAPI backend.

# chatbot_ui.py
import streamlit as st
import requests
import json

# --- Page Configuration ---
st.set_page_config(
    page_title="RAG Chatbot",
    page_icon="🤖",
    layout="wide"
)
st.title("RAG Chatbot with Citations 🤖")

# --- Constants ---
FASTAPI_URL = "http://localhost:8000/document-chat-stream"

# --- Session State Initialization ---
# Ensures that the message history is preserved across reruns
if "messages" not in st.session_state:
    st.session_state.messages = []

# --- Display Chat History ---
# Renders the existing chat messages
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# --- Main Chat Interaction Logic ---
if prompt := st.chat_input("Ask me anything about your documents..."):
    # Add user message to session state and display it
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)

    # Display assistant response in a streaming fashion
    with st.chat_message("assistant"):
        message_placeholder = st.empty()
        full_response = ""

        try:
            # Prepare the data for the POST request
            data = {"query": prompt}
            # Use requests to stream the response from the FastAPI endpoint
            with requests.post(FASTAPI_URL, json=data, stream=True) as r:
                r.raise_for_status()  # Raise an exception for bad status codes
                for chunk in r.iter_content(chunk_size=None, decode_unicode=True):
                    full_response += chunk
                    message_placeholder.markdown(full_response + "▌") # "▌" gives a typing cursor effect
            message_placeholder.markdown(full_response)
        except requests.exceptions.RequestException as e:
            st.error(f"Could not connect to the backend: {e}")
            full_response = "Sorry, I couldn't connect to the processing service."
        except Exception as e:
            st.error(f"An unexpected error occurred: {e}")
            full_response = "An unexpected error occurred."
            
    # Add the final, complete assistant response to the session state
    st.session_state.messages.append({"role": "assistant", "content": full_response})

Step 5: Run the Backend (Choose One Option)
#

You need to run the FastAPI backend so the Streamlit UI can talk to it. Open your second terminal for this.

Option A: Run with Docker (Recommended for Deployment) 🐳
#

Create the Dockerfile: (Same as before)

FROM continuumio/miniconda3
WORKDIR /app
COPY environment.yml .
RUN conda env create -f environment.yml
COPY main.py .
EXPOSE 8000
CMD ["conda", "run", "-n", "rag-env", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Create the .env file for Docker:

# .env file
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=YOUR_AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY=YOUR_AWS_SECRET_ACCESS_KEY
BEDROCK_KB_ID=YOUR_KNOWLEDGE_BASE_ID
LITELLM_PROXY_URL=http://host.docker.internal:4000
LITELLM_API_KEY=sk-1234

Build and Run the Container (Terminal 2):

docker build -t rag-chatbot-api .
docker run --env-file .env -p 8000:8000 rag-chatbot-api

Option B: Run Locally (For Development) 💻
#

Create & Activate Conda Environment (Terminal 2):

conda env create -f environment.yml
conda activate rag-env

Set Local Environment Variables (Terminal 2):

PowerShell (Windows):

$env:LITELLM_PROXY_URL="http://localhost:4000"
# ... set other AWS and Bedrock variables ...

Bash (macOS/Linux):

export LITELLM_PROXY_URL="http://localhost:4000"
# ... set other AWS and Bedrock variables ...

Run the FastAPI Server (Terminal 2):

uvicorn main:app --reload --host 0.0.0.0 --port 8000

Step 6: Run the Streamlit UI
#

Finally, let’s launch the user interface.

Open a Third Terminal (Terminal 3).
Activate the Conda Environment:
```
conda activate rag-env
```
Run the Streamlit App:
```
streamlit run chatbot_ui.py
```
Open Your Browser: Streamlit will provide a URL, typically http://localhost:8501. Open this link to start chatting with your documents!

Project Structure#

Step 1: Prerequisites - Your Bedrock Knowledge Base#

Step 2: Set up the LiteLLM Proxy#

Step 3: Prepare the FastAPI Application & Dependencies#

Step 4: Create the Streamlit Chatbot Client#

Step 5: Run the Backend (Choose One Option)#

Option A: Run with Docker (Recommended for Deployment) 🐳#

Option B: Run Locally (For Development) 💻#

Step 6: Run the Streamlit UI#