Web Navigation: Transferring LLM Skills To Smaller Models

Nov 7, 2025 by Admin 58 views

Learning to Navigate: Transferring Web Interaction Capabilities from LLMs to SLMs

Navigating the web efficiently is a crucial skill, and this project explores how we can transfer the advanced web interaction capabilities of large language models (LLMs) to smaller, more manageable models (SLMs). Think of it as teaching a student from a brilliant professor! This research falls under the Agents4Gov ecosystem, emphasizing modularity, security, and local execution.

General Description

The core idea is to investigate how web navigation agents, powered by compact LLMs (≤12B parameters), can learn to mimic the behavior of teacher agents (400–600B parameters). We'll be using two powerful techniques: In-Context Learning (ICL) and supervised fine-tuning. This approach allows the smaller models to learn from the demonstrations and experiences of their larger counterparts, ultimately creating more efficient and deployable web agents.

The research will be conducted within the Agents4Gov framework. This ensures that our agents are not only effective but also secure, modular, and capable of running locally.

Work Plan and Subtasks

Let's break down the project into manageable steps. Each of these subtasks will be converted into individual issues for better tracking and management:

1. Literature Review and State of the Art

[SOTA] Review of LLM-based Navigation Agents:

The primary objective here is to dive deep into the existing research on browser/OS agents and multimodal agents. We want to understand the current landscape and identify the most promising approaches. To kick things off, we need to thoroughly research the current state-of-the-art in LLM-based navigation agents. This involves identifying and analyzing recent publications, blog posts, and open-source projects that focus on using large language models to control web browsers and operating systems. We need to pay close attention to the architectures, training methodologies, and evaluation metrics used in these projects. Furthermore, it's crucial to understand how these agents handle different types of web tasks, such as form filling, information retrieval, and task automation. A key aspect of this review is to identify any limitations or challenges faced by existing agents, such as scalability issues, robustness concerns, and the ability to generalize to new tasks. This analysis will help us to identify gaps in the current research and inform the design of our own agent. Ultimately, the goal is to build upon the existing knowledge base and contribute to the advancement of web navigation agents that can seamlessly interact with online environments. This comprehensive review forms the foundation for our project, providing a clear understanding of the current state-of-the-art and guiding our research efforts. The final step involves synthesizing the findings from the literature review into a concise and informative document. This document will serve as a valuable resource for the entire team, providing a shared understanding of the current state-of-the-art in LLM-based navigation agents. It will also be used to inform the design and development of our own agent, ensuring that we build upon the best practices and avoid common pitfalls. The document should be well-organized, clearly written, and easy to understand. It should also include a comprehensive bibliography of all the sources consulted during the review process. By creating a high-quality literature review, we can ensure that our project is grounded in the latest research and has the best chance of success.
- Deliverables:
  - Literature review document (tools/browseragent/docs/lit_review.md)
  - Comparative table (tools/browseragent/docs/tables/sota_agents.md)
  - Code: tools/browseragent/scripts/lit_review/build_sota_table.py
[Benchmarks] Mapping MiniWoB++, WebArena, and BrowserGym:

Our goal is to map out different benchmarks that can be used to evaluate the performance of browser agents. Understanding the strengths and weaknesses of each benchmark is critical for selecting the most appropriate evaluation metrics. Benchmarks like MiniWoB++, WebArena, and BrowserGym offer diverse environments and tasks that can help us assess the capabilities of our agents. The goal is to identify the specific features of each benchmark and determine their suitability for evaluating different aspects of browser agent performance. This involves a detailed analysis of the task complexity, the types of interactions required, and the availability of evaluation metrics. We also need to consider the scalability of each benchmark and its ability to handle agents of varying sizes and architectures. Furthermore, we'll need to examine the limitations of each benchmark and identify any potential biases that could affect the evaluation results. By carefully mapping these benchmarks, we can create a comprehensive evaluation plan that provides a thorough assessment of our browser agents. Creating a mapping of benchmarks is essential for effectively evaluating the performance of browser agents. This mapping provides a clear overview of the available benchmarks, their characteristics, and their suitability for different evaluation purposes. It allows us to compare and contrast different benchmarks, identify their strengths and weaknesses, and select the most appropriate benchmarks for our specific research goals. Furthermore, the mapping serves as a valuable resource for the entire research community, providing a centralized repository of information on browser agent benchmarks. By creating and maintaining this mapping, we can contribute to the advancement of browser agent research and facilitate the development of more robust and reliable agents. It is also important to note that the mapping process should be iterative and updated regularly to reflect the latest developments in the field. As new benchmarks are introduced and existing benchmarks are refined, the mapping should be updated accordingly to ensure that it remains accurate and comprehensive. The mapping should also include information on the availability of benchmark datasets, evaluation scripts, and other resources that can facilitate the evaluation of browser agents. By providing access to these resources, we can lower the barrier to entry for researchers and encourage more widespread adoption of browser agent technology.
- Deliverables:
  - Comparative table (tools/browseragent/docs/tables/benchmarks.md)
  - Code: tools/browseragent/scripts/benchmarks/collect_bench_specs.py

2. Implementation of the Teacher Agent

[Professor] Integrating browser-use to Agents4Gov:

Our aim is to integrate the browser-use tool into the Agents4Gov framework. This integration will allow our agents to interact with web browsers in a controlled and secure environment. To make this happen, the integration of browser-use into the Agents4Gov framework is a critical step in the development of our teacher agent. browser-use provides a robust and reliable interface for interacting with web browsers, allowing our agent to perform a wide range of tasks, such as filling out forms, navigating websites, and extracting information. By integrating browser-use into Agents4Gov, we can leverage the framework's modularity and security features to ensure that our agent operates in a safe and controlled environment. The integration process involves several steps, including adapting the browser-use code to fit within the Agents4Gov architecture, creating a set of APIs that allow other agents to interact with the browser-use module, and implementing a security layer to prevent unauthorized access to the browser. We also need to ensure that the integration is seamless and that the browser-use module can be easily configured and deployed within the Agents4Gov ecosystem. Furthermore, we need to thoroughly test the integration to ensure that it is working correctly and that there are no compatibility issues. This testing should include a wide range of scenarios, such as different web browsers, operating systems, and network configurations. By carefully integrating browser-use into Agents4Gov, we can create a powerful and versatile teacher agent that can perform a wide range of web-based tasks. This integration also provides a solid foundation for future research and development in the area of browser automation and web interaction. The final step is to document the integration process and provide clear instructions on how to use the browser-use module within the Agents4Gov framework. This documentation should include detailed information on the APIs, configuration options, and security features. By providing comprehensive documentation, we can ensure that other researchers and developers can easily use and extend our work.
- Deliverables:
  - Integrated module (tools/browseragent/agents4gov_integrations/browser_use/)
  - Code: tools/browseragent/agents4gov_integrations/browser_use/setup_browser_use.py
[Professor] Running MiniWoB++ with LLM 400–600B:

We'll execute tasks from the MiniWoB++ benchmark using a large language model (400–600B parameters) and save the logs. The goal is to create a rich dataset of demonstrations from a powerful teacher agent that the smaller models can learn from. To achieve this, we need to set up the MiniWoB++ environment, configure the LLM, and implement a logging mechanism to capture the agent's actions and observations. The logs will serve as a valuable resource for training the student agent. This involves configuring the LLM to interact with the MiniWoB++ environment, providing it with the necessary instructions and context to perform the tasks. We also need to implement a mechanism for capturing the agent's actions, observations, and rewards at each step of the interaction. These logs will be used to train the student agent to mimic the behavior of the teacher agent. The key is to design a system that can reliably capture the agent's interactions with the MiniWoB++ environment and store them in a structured format. This requires careful consideration of the data format, the logging frequency, and the storage capacity. We also need to ensure that the logs are easily accessible and can be processed efficiently for training the student agent. Furthermore, we need to consider the privacy implications of collecting and storing the agent's interactions with the MiniWoB++ environment. We need to ensure that the data is anonymized and that appropriate security measures are in place to protect the privacy of the users who are interacting with the environment. By carefully designing and implementing the logging mechanism, we can create a valuable dataset for training the student agent while also protecting the privacy of the users. Finally, we need to validate the logs to ensure that they are accurate and complete. This involves manually inspecting the logs and comparing them to the agent's actual behavior. We also need to implement automated checks to detect any inconsistencies or errors in the logs. By validating the logs, we can ensure that the dataset is of high quality and can be used to train a reliable and effective student agent.
- Deliverables:
  - JSON/Markdown logs (tools/browseragent/data/teacher_logs/)
  - Code: tools/browseragent/benchmarks/miniwob/run_professor_minwob.py
[Professor] Converting Logs into a Demonstration Dataset:

This step involves converting the raw logs from the teacher agent into a structured dataset that can be used for training the student agent. To create a normalized dataset from the teacher agent's logs, we need to carefully process and structure the data. This involves extracting relevant information from the logs, such as the agent's actions, observations, and rewards, and organizing it into a format that is suitable for training the student agent. The normalized dataset should be consistent, well-documented, and easy to use. We also need to ensure that the dataset is representative of the tasks that the student agent will be expected to perform. One important consideration is the choice of data format. We need to select a format that is both efficient for storage and processing and also easy to parse and understand. Common data formats for this type of data include JSON, CSV, and Parquet. We also need to consider the size of the dataset and the amount of memory required to load and process it. If the dataset is too large to fit in memory, we may need to use techniques such as data streaming or distributed processing. Another important consideration is the handling of missing or incomplete data. We need to decide how to deal with cases where the agent's actions or observations are not fully recorded in the logs. One option is to simply discard the incomplete data. Another option is to impute the missing values using statistical techniques. The choice of approach will depend on the specific characteristics of the data and the goals of the training process. Ultimately, the goal is to create a high-quality dataset that can be used to train a student agent that can effectively mimic the behavior of the teacher agent. This requires careful attention to detail and a thorough understanding of the data. By taking the time to create a normalized dataset, we can significantly improve the performance of the student agent.
- Deliverables:
  - Normalized dataset (tools/browseragent/data/teacher_dataset/*.jsonl)
  - Code: tools/browseragent/data_prep/teacher_logs/convert_logs_to_dataset.py

3. Initial Evaluation with Compact Models

[Baseline SLM] Running MiniWoB++ without Fine-Tuning (≤12B):

We'll run the same MiniWoB++ tasks with compact models (≤12B parameters) without any prior fine-tuning. This is our zero-shot baseline, which helps us understand how well these smaller models perform out-of-the-box. This involves configuring the SLM to interact with the MiniWoB++ environment and running a set of predefined tasks. We need to carefully track the performance of the SLM on each task and record the results in a standardized format. The zero-shot performance of the SLM will serve as a baseline for evaluating the effectiveness of our fine-tuning and in-context learning techniques. It will also provide insights into the inherent capabilities of the SLM and its ability to generalize to new tasks. The execution of MiniWoB++ tasks using the SLM without any fine-tuning is a crucial step in our research project. It allows us to establish a baseline for evaluating the effectiveness of our proposed techniques. By comparing the performance of the SLM before and after fine-tuning, we can quantify the benefits of our approach. This will provide valuable evidence to support our claims and contribute to the advancement of the field of web navigation agents. Furthermore, the zero-shot performance of the SLM can also be used to identify the strengths and weaknesses of the model. This information can be used to guide our fine-tuning efforts and to optimize the model for specific tasks. By carefully analyzing the performance of the SLM on a variety of tasks, we can gain a deeper understanding of its capabilities and limitations. This knowledge can be used to improve the model and to develop more effective web navigation agents.
- Deliverables:
  - Comparative report (tools/browseragent/reports/slm_baseline.md)
  - Code: tools/browseragent/benchmarks/miniwob/run_slm_zero_shot.py
[Baseline SLM] Analyzing Reasoning and Consistency:

We'll compare the reasoning chains of the SLMs with those of the teacher agent to identify any gaps in reasoning and consistency. Understanding how the smaller models differ in their reasoning processes compared to the larger models is essential for improving their performance. This involves analyzing the intermediate steps taken by the SLM and the teacher agent while performing the MiniWoB++ tasks. We need to identify any discrepancies in their reasoning processes and to understand the underlying causes of these discrepancies. This can be achieved by examining the attention weights, the hidden states, and the output probabilities of the models at each step of the interaction. Furthermore, we need to compare the consistency of the models' decisions across different tasks. If the SLM makes inconsistent decisions, it may indicate that the model is not learning the underlying principles of the task. By analyzing the reasoning and consistency of the SLM, we can identify areas where the model needs to be improved. This information can be used to guide our fine-tuning and in-context learning techniques. It can also be used to develop new methods for improving the reasoning and consistency of SLMs. The comparison of the reasoning chains of the SLM and the teacher agent is a crucial step in our research project. It allows us to understand the differences in their reasoning processes and to identify areas where the SLM needs to be improved. By carefully analyzing the reasoning and consistency of the models, we can develop more effective techniques for transferring the knowledge and capabilities of large language models to smaller models. This will contribute to the advancement of the field of web navigation agents and will enable us to build more efficient and reliable systems.
- Deliverables:
  - Analytical report (tools/browseragent/reports/rationale_gap.md)
  - Code: tools/browseragent/analysis/error_analysis/rationale_gap_report.py

4. Development of the Student Agent

[Student] In-Context Learning (ICL) Pipeline:

We'll implement an ICL pipeline using demonstrations from the teacher agent. This pipeline will allow the student agent to learn from the examples provided by the teacher agent, improving its performance on the MiniWoB++ tasks. This pipeline will consist of several steps, including selecting relevant demonstrations from the teacher agent, formatting the demonstrations into a prompt, feeding the prompt to the SLM, and decoding the output of the SLM. We need to carefully design each step of the pipeline to ensure that it is effective and efficient. The selection of relevant demonstrations is a crucial step in the ICL pipeline. We need to develop a method for identifying the demonstrations that are most likely to improve the performance of the SLM on the current task. This can be achieved by using techniques such as similarity-based retrieval or reinforcement learning. The formatting of the demonstrations into a prompt is also an important consideration. We need to design a prompt that is informative and easy for the SLM to understand. This may involve using techniques such as natural language generation or prompt engineering. The decoding of the output of the SLM is the final step in the ICL pipeline. We need to develop a method for extracting the relevant information from the output of the SLM and using it to perform the task. This may involve using techniques such as rule-based extraction or machine learning. By carefully designing each step of the ICL pipeline, we can create a system that can effectively transfer the knowledge and capabilities of the teacher agent to the student agent. This will contribute to the advancement of the field of web navigation agents and will enable us to build more efficient and reliable systems. The key to a successful ICL pipeline is to provide the SLM with the right context and examples. By carefully selecting and formatting the demonstrations, we can help the SLM to learn the underlying principles of the task and to generalize to new situations.
- Deliverables:
  - Inference scripts (tools/browseragent/scripts/icl/)
  - Code: tools/browseragent/training/icl/run_icl_eval.py
[Student] Preparing the Dataset for SFT:

We'll prepare a subset of demonstrations for supervised fine-tuning (SFT). Supervised Fine-Tuning (SFT) is a powerful technique for improving the performance of language models. The main goal is to create a dataset that is well-suited for training the SLM to mimic the behavior of the teacher agent. This involves selecting a subset of demonstrations from the teacher agent, cleaning the data, and formatting it into a format that is compatible with the SFT training process. The selection of the subset of demonstrations is a crucial step in the process. We need to choose demonstrations that are representative of the tasks that the SLM will be expected to perform. We also need to ensure that the dataset is diverse and covers a wide range of scenarios. The cleaning of the data is also an important step. We need to remove any noise or errors from the data and to ensure that it is consistent and accurate. This may involve using techniques such as data normalization, data deduplication, and data validation. The formatting of the data is the final step in the process. We need to format the data into a format that is compatible with the SFT training process. This may involve converting the data into a specific file format, such as JSON or CSV, and adding metadata to the data. By carefully preparing the dataset, we can ensure that the SLM is trained on high-quality data that will lead to improved performance. This will contribute to the advancement of the field of web navigation agents and will enable us to build more efficient and reliable systems. The key to a successful SFT dataset is to provide the SLM with clear and consistent examples of the desired behavior.
- Deliverables:
  - SFT dataset (tools/browseragent/data/sft/)
  - Code: tools/browseragent/training/sft/prepare_sft_dataset.py
[Student] SFT Training of the SLM (≤12B):

We'll perform supervised fine-tuning on the SLM (≤12B parameters) using the prepared dataset. This step involves using the prepared dataset to train the SLM to mimic the behavior of the teacher agent. This is a crucial step in the project, as it allows us to transfer the knowledge and capabilities of the teacher agent to the SLM. During the training process, the SLM will learn to predict the actions of the teacher agent based on the input observations. The goal is to train the SLM to a point where it can perform the MiniWoB++ tasks as well as the teacher agent. The training process will involve several iterations, with each iteration consisting of a forward pass and a backward pass. During the forward pass, the SLM will process the input observations and generate predictions. During the backward pass, the SLM will update its parameters based on the difference between its predictions and the actual actions of the teacher agent. We will use a variety of techniques to optimize the training process, such as learning rate scheduling, batch normalization, and dropout. We will also monitor the performance of the SLM on a validation set to prevent overfitting. By carefully training the SLM, we can ensure that it learns the underlying principles of the task and can generalize to new situations. This will contribute to the advancement of the field of web navigation agents and will enable us to build more efficient and reliable systems. The key to successful SFT training is to provide the SLM with a large and diverse dataset and to carefully tune the training parameters.
- Deliverables:
  - Final checkpoint (tools/browseragent/checkpoints/slm_sft/)
  - Code: tools/browseragent/training/sft/run_sft.py
[Student] Integrating the SFT Agent to Agents4Gov:

The fine-tuned SLM will be integrated as the official "Web Agent" in Agents4Gov. This allows other agents within the Agents4Gov ecosystem to leverage the capabilities of our newly trained web navigation agent. To do this successfully, we'll need to ensure that the SLM is compatible with the Agents4Gov framework and that it can be easily accessed and used by other agents. This process involves several steps, including wrapping the SLM in an Agents4Gov agent, defining the agent's interface, and registering the agent with the Agents4Gov registry. The wrapping of the SLM in an Agents4Gov agent involves creating a class that inherits from the Agents4Gov agent base class and that implements the required methods. The agent's interface defines the inputs and outputs of the agent. The registration of the agent with the Agents4Gov registry allows other agents to discover and use the agent. We need to ensure that the agent is properly documented and that it is easy for other agents to use. We also need to ensure that the agent is secure and that it cannot be used to compromise the security of the Agents4Gov system. By carefully integrating the SLM into Agents4Gov, we can make its capabilities available to a wide range of other agents and applications. This will contribute to the advancement of the field of web navigation agents and will enable us to build more complex and sophisticated systems. The key to a successful integration is to carefully design the agent's interface and to ensure that it is compatible with the Agents4Gov framework.
- Deliverables:
  - Integrated module (tools/browseragent/agents/web_agent/)
  - Code: tools/browseragent/agents/web_agent/register_student_agent.py

5. Evaluation, Reports, and Publication

[Eval] Professor vs Student Comparison (Performance and Cost):

We'll evaluate the performance and operational cost of the Student agent compared to the Professor agent. This comparison will help us understand the trade-offs between model size and performance, as well as the practical implications of deploying smaller models. To make this comparison, we will need to define a set of evaluation metrics, such as task completion rate, accuracy, and latency. We will also need to define a set of cost metrics, such as memory usage, CPU usage, and energy consumption. The performance of the Student agent and the Professor agent will be evaluated on the MiniWoB++ benchmark. We will run the agents on a variety of tasks and measure their performance using the defined metrics. The cost of the Student agent and the Professor agent will be measured on a set of representative hardware platforms. We will run the agents on these platforms and measure their resource usage using the defined metrics. The results of the evaluation will be used to create a report that compares the performance and cost of the Student agent and the Professor agent. The report will also discuss the trade-offs between model size and performance and the practical implications of deploying smaller models. The comparison of the Student agent and the Professor agent is a crucial step in the project. It will help us to understand the benefits and drawbacks of using smaller models for web navigation tasks. This information will be valuable for researchers and practitioners who are interested in developing and deploying web navigation agents.
- Deliverables:
  - Graphs and tables (tools/browseragent/reports/eval_prof_vs_student.md)
  - Code: tools/browseragent/eval/compare_professor_student.py
[Writing] Generating Tables and Figures for the Article:

Automated exporting of tables and figures will be implemented to streamline the writing process. This ensures that our results are presented clearly and consistently in the final publication. For clear and effective communication, this involves developing scripts to automatically generate tables and figures from the experimental data. The tables and figures should be well-formatted and easy to understand. They should also be consistent with the style and format of the publication. The scripts should be designed to be flexible and adaptable to different types of data and different publication requirements. The goal is to automate the process of creating tables and figures as much as possible, to reduce the amount of manual effort required. This will save time and reduce the risk of errors. The automated generation of tables and figures is a crucial step in the publication process. It ensures that the results are presented clearly and consistently and that the publication is of high quality. This will contribute to the impact and visibility of the research. The scripts should be well-documented and easy to use. They should also be tested thoroughly to ensure that they produce accurate and reliable results. By automating the process of creating tables and figures, we can streamline the publication process and increase the efficiency of the research team.
- Deliverables:
  - Artifacts (tools/browseragent/paper/artifacts/)
  - Code: tools/browseragent/writing/paper/export_tables_figures.py
[Docs] README and Diagrams of Agent Modules:

We'll document the pipeline and architecture with updated README files and diagrams. Clear documentation is essential for ensuring that our work is reproducible and understandable by others. This documentation will provide a clear and concise overview of the project, its goals, and its methodology. It will also describe the different modules of the agent and how they interact with each other. The README file will provide a high-level overview of the project and will include instructions on how to set up the environment and run the code. The diagrams will provide a visual representation of the agent's architecture and will help to illustrate the flow of data through the system. The documentation should be written in a clear and concise style and should be easy to understand by both technical and non-technical audiences. It should also be kept up-to-date as the project evolves. Clear and comprehensive documentation is essential for ensuring that the project is reproducible and understandable by others. It will also help to facilitate collaboration and to promote the adoption of the agent by other researchers and practitioners. The creation of documentation is an ongoing process and should be integrated into the development workflow. The documentation should be reviewed regularly and updated as needed. By investing in documentation, we can ensure that our work has a lasting impact and that it is used by others to advance the field of web navigation agents.
- Deliverables:
  - Updated README.md (tools/browseragent/README.md)
  - Diagrams (tools/browseragent/docs/diagrams/)
  - Code: tools/browseragent/docs/milestones/build_readme_and_diagrams.py

6. Infrastructure and Privacy

[Privacy] Validation of Local Execution and Data Audit:

Ensuring local execution and data auditing are key for privacy. We'll validate that all processes can run locally and implement data audit logging. This involves implementing a set of checks to ensure that the agent does not send any data to external servers without explicit user consent. It also involves implementing a data audit logging system that records all data that is accessed and modified by the agent. The local execution validation checks will verify that the agent can be run on a local machine without requiring access to external resources. The data audit logging system will record all data that is accessed and modified by the agent, including the time of access, the user who accessed the data, and the type of access. The data audit logs will be stored securely and will be used to monitor the agent's behavior and to detect any potential privacy violations. The implementation of local execution validation and data audit logging is a crucial step in ensuring the privacy of users who use the agent. It will help to prevent the agent from collecting and sharing sensitive data without their consent. This will contribute to the building of trust in the agent and will encourage more people to use it. The validation and data audit logging system should be designed to be flexible and adaptable to different types of data and different privacy requirements. It should also be easy to use and to understand. By implementing local execution validation and data audit logging, we can ensure that our agent is privacy-preserving and that it respects the privacy of its users.
- Deliverables:
  - Compliance report (tools/browseragent/docs/privacy_audit.md)
  - Code: tools/browseragent/privacy/audit/local_exec_validator.py

General Acceptance Criteria

All scripts and deliverables must reside in tools/browseragent/.
Each task should include associated Python code and minimal usage documentation.
All results must be reproducible and auditable in local execution.
Artifacts (logs, datasets, checkpoints, figures) must be versioned and traceable.
Ultimately, the browser agent will become a tool within agents4gov.