Open-Webui Mixture of Agents part 2

July 3, 2024 - 6 minutes read - 1197 words

Youtube Video : https://youtu.be/KxT7lHaPDJ4

Introduction

The Mixture-of-Agents (MoA) methodology has demonstrated state-of-the-art performance using open-source models, as detailed in my previous blog. We have created two pipelines (Groq and Ollama) for Open-WebUI. These pipelines serve as versatile, UI-agnostic OpenAI-compatible plugin frameworks.

In this blog, we will demonstrate how MoA can be integrated into Open WebUI, an extensible, feature-rich, and user-friendly self-hosted WebUI designed to operate entirely offline. Open WebUI supports various LLM runners, including Ollama and OpenAI-compatible APIs. Our project aims to incorporate MoA into this robust platform, bringing it to state-of-the-art standards.

GitHub repository for Pipelines: https://github.com/open-webui/pipelines

GitHub repository for Open WebUI: https://github.com/open-webui/open-webui

Understanding the Mixture-of-Agents (MoA) Methodology

Definition and Concept

The Mixture-of-Agents (MoA) methodology is an innovative approach that leverages the strengths of multiple LLMs to enhance overall performance. Unlike traditional single-model approaches, MoA constructs a layered architecture where each layer consists of multiple LLM agents. Each agent processes the outputs from the previous layer’s agents as auxiliary information, iteratively refining the responses. This collaborative framework allows MoA to achieve state-of-the-art performance by combining the diverse capabilities of different LLMs.

How MoA Differs from Traditional Approaches

Traditional LLM approaches typically involve a single model trained on extensive data to handle various tasks. While effective, these models face limitations in scalability and specialization. Scaling up a single model is costly and time-consuming, often requiring retraining on massive datasets. In contrast, MoA capitalizes on the inherent strengths of multiple LLMs, distributing tasks among specialized agents and iteratively refining their outputs. This not only improves performance but also offers a cost-effective and scalable solution.

The Collaborativeness of LLMs

Concept of Collaborativeness in LLMs

A key insight driving the MoA methodology is the concept of collaborativeness among LLMs. This refers to the phenomenon where LLMs generate better responses when they can reference outputs from other models. This collaborativeness is evident even when the auxiliary responses are of lower quality than what an individual LLM could produce independently. By leveraging this phenomenon, MoA enhances the overall response quality through iterative refinement and synthesis.

Benefits of Collaborative LLM Responses

The collaborativeness of LLMs offers several benefits. Firstly, it allows for the integration of diverse perspectives and strengths, leading to more robust and comprehensive responses. Secondly, it mitigates the limitations of individual models, as the collective expertise can cover a broader range of tasks and scenarios. Lastly, this collaborative approach improves the adaptability and flexibility of LLMs, enabling them to handle complex and varied inputs more effectively.

Install our MoA pipelines for Groq

Create a folder called moa:

mkdir moa
cd moa

Step 1: Install open-webui

Clone the repository and navigate into it:

git clone https://github.com/open-webui/open-webui.git
cd open-webui

Create and activate a virtual environment:

# Windows
py -m venv .venv
.venv\Scripts\activate

# Linux
python3 -m venv venv
source venv/bin/activate

Install the required packages:
```
pip install -r requirements.txt
```
Start open-webui:
```
open-webui serve
```
Wait until it starts and go to http://localhost:8080.

Step 2: Install Pipelines

Clone the pipelines repository and navigate into it:

git clone https://github.com/open-webui/pipelines.git
cd pipelines

Install the required packages:
```
pip install -r requirements.txt
```
Create an .env file in the pipelines folder. Use the template provided and replace the placeholders with your actual keys from Groq:
.env Template
Run the appropriate script for your operating system:
```
# Windows
start.bat

# Linux
sh start.sh
```

Step 3: Configure and Use the MoA Groq Pipeline

Go to your open-webui server at http://localhost:8080.
Install from the GitHub URL:
- Navigate to http://localhost:8080/admin/settings/
- Use the URL to pull the MoA Groq code into the pipeline:
  MoA Groq Pipeline
Go to the chat and start using it! Make sure you select “MoA Groq” from the dropdown box.

Our Moa Groq code accomplishes the goal of sending a prompt to 3 reference models, aggregating their responses, and then synthesizing a robust response using an aggregator agent.

Please see youtube video what installation steps:

Key Components to Verify:

Sending the prompt to 3 reference models.
Aggregating the responses from the reference models.
Using an aggregator agent to synthesize the responses into a robust response.

Below is just a few notes on how the Moa Pipeline works. The main config is found in your .env file.

GROQ_API_BASE_1=https://api.groq.com/openai/v1
GROQ_API_KEY_1="<use your own token from groq>"
GROQ_API_BASE_2=https://api.groq.com/openai/v1
GROQ_API_KEY_2="<use your own token from groq>"
GROQ_API_BASE_3=https://api.groq.com/openai/v1
GROQ_API_KEY_3="<use your own token from groq>"
# We are using API_BASE_ for the Aggregator
GROQ_API_BASE_4=https://api.groq.com/openai/v1
GROQ_API_KEY_4="<use your own token from groq>"

GROQ_API_KEY="<use your own token from groq>"
GROQ_DEFAULT_MAX_TOKENS=4096
GROQ_DEFAULT_TEMPERATURE=0.9
GROQ_DEFAULT_ROUNDS=1
GROQ_LAYERS=1
GROQ_AGENTS_PER_LAYER=3
GROQ_MULTITURN=True
GROQ_MODEL_AGGREGATE='llama3-70b-8192'
GROQ_MODEL_AGGREGATE_API_BASE=${GROQ_API_BASE_1}
GROQ_MODEL_AGGREGATE_API_KEY=${GROQ_API_KEY_1}
GROQ_MODEL_REFERENCE_1='llama3-8b-8192'
GROQ_MODEL_REFERENCE_1_API_BASE=${GROQ_API_BASE_2}
GROQ_MODEL_REFERENCE_1_API_KEY=${GROQ_API_KEY_2}
GROQ_MODEL_REFERENCE_2='gemma-7b-it'
GROQ_MODEL_REFERENCE_2_API_BASE=${GROQ_API_BASE_3}
GROQ_MODEL_REFERENCE_2_API_KEY=${GROQ_API_KEY_3}
GROQ_MODEL_REFERENCE_3='mixtral-8x7b-32768'
GROQ_MODEL_REFERENCE_3_API_BASE=${GROQ_API_BASE_4}
GROQ_MODEL_REFERENCE_3_API_KEY=${GROQ_API_KEY_4}

Note the .env will serve both pipelines

Code Analysis:

1. Sending the Prompt to 3 Reference Models

The prompt is sent to the 3 reference models in the process_layer function:

async def process_layer(self, data, temperature=GROQ_DEFAULT_TEMPERATURE, max_tokens=GROQ_DEFAULT_MAX_TOKENS):
    logger.info(f"Processing layer with {len(self.reference_models)} agents")
    responses = []
    for i in range(len(self.reference_models)):
        model_info = self.reference_models[self.current_model_index]
        self.rotate_agents()
        logger.info(f"Agent {i+1}: Using model {model_info['name']}")
        response = await self.process_fn(
            {"instruction": data["instruction"][i]},
            model_info=model_info,
            temperature=temperature,
            max_tokens=max_tokens,
        )
        responses.append(response["output"])

    return responses

process_layer iterates through the reference_models and sends the prompt to each one using the process_fn function.
The responses from each reference model are collected in the responses list.

2. Aggregating the Responses from Reference Models

The responses are aggregated in the aggregate_responses function:

def aggregate_responses(self, responses: List[str]) -> str:
    aggregated_response = "\n".join(responses)
    return aggregated_response

This function takes the list of responses and joins them into a single aggregated response.

3. Using an Aggregator Agent to Synthesize the Responses

The aggregated responses are used to generate a robust response using the aggregator agent in the call_aggregator_model and generate_with_references functions:

async def call_aggregator_model(self, aggregated_responses, messages):
    aggregated_message = [{"role": "user", "content": aggregated_responses}]
    final_response = await self.generate_together(self.model_aggregate, aggregated_message)
    return final_response

This function sends the aggregated responses to the aggregator model and returns the synthesized response.

async def generate_with_references(self, model_info, messages, references=[], max_tokens=GROQ_DEFAULT_MAX_TOKENS, temperature=GROQ_DEFAULT_TEMPERATURE):
    if len(references) > 0:
        messages = self.inject_references_to_messages(messages, references)

    logger.info(f"Generating with references for model {model_info['name']}")
    return await self.generate_together(model_info, messages=messages, temperature=temperature, max_tokens=max_tokens)

The generate_with_references function ensures that the references are included in the messages before generating the final response.

Full Workflow Verification in `run_pipeline` Function

The run_pipeline function orchestrates the full workflow:

async def run_pipeline(self, user_message, temperature=GROQ_DEFAULT_TEMPERATURE, max_tokens=GROQ_DEFAULT_MAX_TOKENS, rounds=GROQ_DEFAULT_ROUNDS, multi_turn=GROQ_MULTITURN):
    data = {
        "instruction": [[] for _ in range(len(self.reference_models))],
        "model_info": self.reference_models,
    }

    if multi_turn:
        for i in range(len(self.reference_models)):
            data["instruction"][i].append({"role": "user", "content": user_message})
    else:
        data["instruction"] = [[{"role": "user", "content": user_message}]] * len(self.reference_models)

    self.messages.append({"role": "user", "content": user_message})

    for i_round in range(rounds):
        logger.info(f"Starting round {i_round + 1} of processing.")

        responses = await self.process_layer(data, temperature, max_tokens)

        logger.info(f"Responses after Round {i_round + 1}:")
        for i, response in enumerate(responses):
            logger.info(f"Model {self.reference_models[i]['name']}: {response[:50]}...")

    logger.info("Aggregating results & querying the aggregate model...")

    aggregated_responses = self.aggregate_responses(responses)
    output = await self.generate_with_references(
        model_info=self.model_aggregate,
        temperature=temperature,
        max_tokens=max_tokens,
        messages=self.messages,
        references=responses,
    )

    logger.info(f"Final answer from {self.model_aggregate['name']}")
    logger.info("Output received from generate_with_references:")
    logger.info(output)

    if multi_turn:
        for i in range(len(self.reference_models)):
            data["instruction"][i].append({"role": "assistant", "content": output})

    self.messages.append({"role": "assistant", "content": output})

    return output

The run_pipeline function sends the user message to the reference models.
Aggregates their responses.
Uses the aggregator model to synthesize the final response.
The synthesized response is returned and appended to the conversation history.

Conclusion

You can customer the number of reference models but it works best by sending a prompt to 3 reference models, aggregating their responses, and using an aggregator agent to synthesize a robust response.