Open-Webui Mixture of Agents part 2
- 6 minutes read - 1197 wordsYoutube Video : https://youtu.be/KxT7lHaPDJ4
Introduction
The Mixture-of-Agents (MoA) methodology has demonstrated state-of-the-art performance using open-source models, as detailed in my previous blog. We have created two pipelines (Groq and Ollama) for Open-WebUI. These pipelines serve as versatile, UI-agnostic OpenAI-compatible plugin frameworks.
In this blog, we will demonstrate how MoA can be integrated into Open WebUI, an extensible, feature-rich, and user-friendly self-hosted WebUI designed to operate entirely offline. Open WebUI supports various LLM runners, including Ollama and OpenAI-compatible APIs. Our project aims to incorporate MoA into this robust platform, bringing it to state-of-the-art standards.
GitHub repository for Pipelines: https://github.com/open-webui/pipelines
GitHub repository for Open WebUI: https://github.com/open-webui/open-webui
Understanding the Mixture-of-Agents (MoA) Methodology
Definition and Concept
The Mixture-of-Agents (MoA) methodology is an innovative approach that leverages the strengths of multiple LLMs to enhance overall performance. Unlike traditional single-model approaches, MoA constructs a layered architecture where each layer consists of multiple LLM agents. Each agent processes the outputs from the previous layer’s agents as auxiliary information, iteratively refining the responses. This collaborative framework allows MoA to achieve state-of-the-art performance by combining the diverse capabilities of different LLMs.
How MoA Differs from Traditional Approaches
Traditional LLM approaches typically involve a single model trained on extensive data to handle various tasks. While effective, these models face limitations in scalability and specialization. Scaling up a single model is costly and time-consuming, often requiring retraining on massive datasets. In contrast, MoA capitalizes on the inherent strengths of multiple LLMs, distributing tasks among specialized agents and iteratively refining their outputs. This not only improves performance but also offers a cost-effective and scalable solution.
The Collaborativeness of LLMs
Concept of Collaborativeness in LLMs
A key insight driving the MoA methodology is the concept of collaborativeness among LLMs. This refers to the phenomenon where LLMs generate better responses when they can reference outputs from other models. This collaborativeness is evident even when the auxiliary responses are of lower quality than what an individual LLM could produce independently. By leveraging this phenomenon, MoA enhances the overall response quality through iterative refinement and synthesis.
Benefits of Collaborative LLM Responses
The collaborativeness of LLMs offers several benefits. Firstly, it allows for the integration of diverse perspectives and strengths, leading to more robust and comprehensive responses. Secondly, it mitigates the limitations of individual models, as the collective expertise can cover a broader range of tasks and scenarios. Lastly, this collaborative approach improves the adaptability and flexibility of LLMs, enabling them to handle complex and varied inputs more effectively.
Install our MoA pipelines for Groq
Create a folder called moa
:
mkdir moa
cd moa
Step 1: Install open-webui
Clone the repository and navigate into it:
git clone https://github.com/open-webui/open-webui.git cd open-webui
Create and activate a virtual environment:
# Windows py -m venv .venv .venv\Scripts\activate # Linux python3 -m venv venv source venv/bin/activate
Install the required packages:
pip install -r requirements.txt
Start open-webui:
open-webui serve
Wait until it starts and go to http://localhost:8080.
Step 2: Install Pipelines
Clone the pipelines repository and navigate into it:
git clone https://github.com/open-webui/pipelines.git cd pipelines
Install the required packages:
pip install -r requirements.txt
Create an
.env
file in thepipelines
folder. Use the template provided and replace the placeholders with your actual keys from Groq:Run the appropriate script for your operating system:
# Windows start.bat # Linux sh start.sh
Step 3: Configure and Use the MoA Groq Pipeline
Go to your open-webui server at http://localhost:8080.
Install from the GitHub URL:
Navigate to
http://localhost:8080/admin/settings/
Use the URL to pull the MoA Groq code into the pipeline:
Go to the chat and start using it! Make sure you select “MoA Groq” from the dropdown box.
Our Moa Groq code accomplishes the goal of sending a prompt to 3 reference models, aggregating their responses, and then synthesizing a robust response using an aggregator agent.
Please see youtube video what installation steps:
Key Components to Verify:
- Sending the prompt to 3 reference models.
- Aggregating the responses from the reference models.
- Using an aggregator agent to synthesize the responses into a robust response.
Below is just a few notes on how the Moa Pipeline works. The main config is found in your .env file.
GROQ_API_BASE_1=https://api.groq.com/openai/v1
GROQ_API_KEY_1="<use your own token from groq>"
GROQ_API_BASE_2=https://api.groq.com/openai/v1
GROQ_API_KEY_2="<use your own token from groq>"
GROQ_API_BASE_3=https://api.groq.com/openai/v1
GROQ_API_KEY_3="<use your own token from groq>"
# We are using API_BASE_ for the Aggregator
GROQ_API_BASE_4=https://api.groq.com/openai/v1
GROQ_API_KEY_4="<use your own token from groq>"
GROQ_API_KEY="<use your own token from groq>"
GROQ_DEFAULT_MAX_TOKENS=4096
GROQ_DEFAULT_TEMPERATURE=0.9
GROQ_DEFAULT_ROUNDS=1
GROQ_LAYERS=1
GROQ_AGENTS_PER_LAYER=3
GROQ_MULTITURN=True
GROQ_MODEL_AGGREGATE='llama3-70b-8192'
GROQ_MODEL_AGGREGATE_API_BASE=${GROQ_API_BASE_1}
GROQ_MODEL_AGGREGATE_API_KEY=${GROQ_API_KEY_1}
GROQ_MODEL_REFERENCE_1='llama3-8b-8192'
GROQ_MODEL_REFERENCE_1_API_BASE=${GROQ_API_BASE_2}
GROQ_MODEL_REFERENCE_1_API_KEY=${GROQ_API_KEY_2}
GROQ_MODEL_REFERENCE_2='gemma-7b-it'
GROQ_MODEL_REFERENCE_2_API_BASE=${GROQ_API_BASE_3}
GROQ_MODEL_REFERENCE_2_API_KEY=${GROQ_API_KEY_3}
GROQ_MODEL_REFERENCE_3='mixtral-8x7b-32768'
GROQ_MODEL_REFERENCE_3_API_BASE=${GROQ_API_BASE_4}
GROQ_MODEL_REFERENCE_3_API_KEY=${GROQ_API_KEY_4}
Note the .env will serve both pipelines
Code Analysis:
1. Sending the Prompt to 3 Reference Models
The prompt is sent to the 3 reference models in the process_layer
function:
async def process_layer(self, data, temperature=GROQ_DEFAULT_TEMPERATURE, max_tokens=GROQ_DEFAULT_MAX_TOKENS):
logger.info(f"Processing layer with {len(self.reference_models)} agents")
responses = []
for i in range(len(self.reference_models)):
model_info = self.reference_models[self.current_model_index]
self.rotate_agents()
logger.info(f"Agent {i+1}: Using model {model_info['name']}")
response = await self.process_fn(
{"instruction": data["instruction"][i]},
model_info=model_info,
temperature=temperature,
max_tokens=max_tokens,
)
responses.append(response["output"])
return responses
process_layer
iterates through thereference_models
and sends the prompt to each one using theprocess_fn
function.- The responses from each reference model are collected in the
responses
list.
2. Aggregating the Responses from Reference Models
The responses are aggregated in the aggregate_responses
function:
def aggregate_responses(self, responses: List[str]) -> str:
aggregated_response = "\n".join(responses)
return aggregated_response
- This function takes the list of responses and joins them into a single aggregated response.
3. Using an Aggregator Agent to Synthesize the Responses
The aggregated responses are used to generate a robust response using the aggregator agent in the call_aggregator_model
and generate_with_references
functions:
async def call_aggregator_model(self, aggregated_responses, messages):
aggregated_message = [{"role": "user", "content": aggregated_responses}]
final_response = await self.generate_together(self.model_aggregate, aggregated_message)
return final_response
- This function sends the aggregated responses to the aggregator model and returns the synthesized response.
async def generate_with_references(self, model_info, messages, references=[], max_tokens=GROQ_DEFAULT_MAX_TOKENS, temperature=GROQ_DEFAULT_TEMPERATURE):
if len(references) > 0:
messages = self.inject_references_to_messages(messages, references)
logger.info(f"Generating with references for model {model_info['name']}")
return await self.generate_together(model_info, messages=messages, temperature=temperature, max_tokens=max_tokens)
- The
generate_with_references
function ensures that the references are included in the messages before generating the final response.
Full Workflow Verification in run_pipeline
Function
The run_pipeline
function orchestrates the full workflow:
async def run_pipeline(self, user_message, temperature=GROQ_DEFAULT_TEMPERATURE, max_tokens=GROQ_DEFAULT_MAX_TOKENS, rounds=GROQ_DEFAULT_ROUNDS, multi_turn=GROQ_MULTITURN):
data = {
"instruction": [[] for _ in range(len(self.reference_models))],
"model_info": self.reference_models,
}
if multi_turn:
for i in range(len(self.reference_models)):
data["instruction"][i].append({"role": "user", "content": user_message})
else:
data["instruction"] = [[{"role": "user", "content": user_message}]] * len(self.reference_models)
self.messages.append({"role": "user", "content": user_message})
for i_round in range(rounds):
logger.info(f"Starting round {i_round + 1} of processing.")
responses = await self.process_layer(data, temperature, max_tokens)
logger.info(f"Responses after Round {i_round + 1}:")
for i, response in enumerate(responses):
logger.info(f"Model {self.reference_models[i]['name']}: {response[:50]}...")
logger.info("Aggregating results & querying the aggregate model...")
aggregated_responses = self.aggregate_responses(responses)
output = await self.generate_with_references(
model_info=self.model_aggregate,
temperature=temperature,
max_tokens=max_tokens,
messages=self.messages,
references=responses,
)
logger.info(f"Final answer from {self.model_aggregate['name']}")
logger.info("Output received from generate_with_references:")
logger.info(output)
if multi_turn:
for i in range(len(self.reference_models)):
data["instruction"][i].append({"role": "assistant", "content": output})
self.messages.append({"role": "assistant", "content": output})
return output
- The
run_pipeline
function sends the user message to the reference models. - Aggregates their responses.
- Uses the aggregator model to synthesize the final response.
- The synthesized response is returned and appended to the conversation history.
Conclusion
You can customer the number of reference models but it works best by sending a prompt to 3 reference models, aggregating their responses, and using an aggregator agent to synthesize a robust response.