Autonomous Agents

Autonomous Agents-research papers. Updated daily. Resources-section-section.

Research papers: 2025 (1/3)

2025 (1/3), 2025 (2/3), 2025 (3/3), 2024, 2023, Earlier

Chronological order.

24th Oct 2025

DeepAgent: A General Reasoning Agent with Scalable Toolsets

DeepAgent: introduces an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution within a single, coherent reasoning process, utilizing Reasoning LLMs, an Auxiliary LLM, a Tool Retriever, a Tool Executor, a Memory Folding Module with Episodic, Working, and Tool Memories, Scalable Toolsets, and an Environment, trained with ToolPO.
The framework addresses long-horizon interactions and context length explosion through an autonomous memory folding mechanism that compresses past interactions into structured episodic, working, and tool memories, reducing error accumulation.
DeepAgent employs ToolPO, an end-to-end reinforcement learning strategy leveraging LLM-simulated APIs and tool-call advantage attribution, to efficiently and stably teach general-purpose tool use.

REMONI: An Autonomous System Integrating Wearables and Multimodal Large Language Models for Enhanced Remote Health Monitoring

REMONI (REmote health MONItoring system): introduces an autonomous remote health monitoring system that integrates wearables, IoT, and MLLMs to collect, process, and analyze patient data, facilitating anomaly detection and natural language interaction for medical professionals.
The system utilizes wearable devices and cameras for data acquisition, edge devices for real-time anomaly detection, and cloud infrastructure for storage and computing, all orchestrated to provide timely alerts and historical data access.
Its NLP engine, powered by a General LLM and a Multimodal LLM, interprets caregiver inquiries, recognizes patient activity and emotion from visual data, and generates comprehensive responses, enhancing telehealth and reducing medical workload.

A Knowledge-Graph Translation Layer for Mission-Aware Multi-Agent Path Planning in Spatiotemporal Dynamics

Knowledge-Graph Translation Layer (KGTL): introduces a framework centered on a Knowledge Graph (KG) (central orchestrator) that functions as an intelligent translation layer, with a Data Plane (mission tensor compiler) and a Control Plane (coordination logic provider) to bridge the semantic gap between high-level mission objectives and low-level planner inputs for multi-agent path planning.
The framework compiles declarative facts into per-agent, mission-aware "worldviews" (Mission Tensors) and physics-aware traversal rules, which are then used by an Agnostic Path Planner (domain-unaware optimizer) and a Selector/Coordinator (plan deconflictor) to generate coordinated mission plans.
This architecture enables adaptive planning by allowing complex, coordinated paths to be modified simply by changing facts in the KG, supporting reactive replanning through incremental recompilation of affected artifacts.

OpenHype: Hyperbolic Embeddings for Hierarchical Open-Vocabulary Radiance Fields

OpenHype (Hyperbolic Embeddings for Hierarchical Open-Vocabulary Radiance Fields): introduces a novel framework for open-vocabulary segmentation on NeRFs, leveraging a CLIP Feature Extractor, a Hyperbolic Auto-encoder with an Encoder and Decoder, a NeRF Model with a NeRF Network, a Hyperbolic Latent Space, a Geodesic Path Traversal Module, a Text Query Prompt, and a Similarity Module to embed hierarchical structures in a continuous hyperbolic latent space.
This approach enables continuous traversal of scene hierarchies through geodesic paths, allowing for multi-scale responses to open-vocabulary queries without discrete levels or multiple rendering passes.
The framework demonstrates superior efficiency and adaptability in 3D scene understanding by naturally encoding multi-scale relationships and outperforming state-of-the-art methods on benchmarks.

HIKMA: Human-Inspired Knowledge by Machine Agents through a Multi-Agent Framework for Semi-Autonomous Scientific Conferences

HIKMA (Human-Inspired Knowledge by Machine Agents): introduces an end-to-end multi-agent framework for semi-autonomous scientific conferences, integrating AI-dataset curation, manuscript generation, peer review, revision, conference presentation, and archival dissemination.
The framework leverages LLMs, structured research workflows, and domain safeguards to support traditional scholarly practices while ensuring intellectual property protection, transparency, and integrity.
HIKMA functions as a testbed for AI-enabled scholarship, demonstrating how AI can act as an auditable partner in the entire research lifecycle, from hypothesis intake to publication.

DAO-AI: Evaluating Collective Decision-Making through Agentic AI in Decentralized Governance

DAO-AI (Decentralized Autonomous Organization - Artificial Intelligence): introduces an agentic AI framework for evaluating collective decision-making in decentralized governance, utilizing an Input Module, Data Preparation Stage, MCP Processing & Learning Layer, Decision Layer (LLM-based decision maker), Output Module, and Evaluation Layer.
The framework orchestrates multiple specialized Modular Composable Programs (MCPs) to fetch, analyze, and synthesize diverse governance data, including proposal metadata, forum discussions, voting dynamics, and market responses.
Built upon the Agentics framework, DAO-AI provides an LLM-based decision maker that interprets proposal contexts, retrieves historical data, and independently determines voting positions, offering interpretable and auditable signals for realistic DAO governance settings.

ASTABENCH: RIGOROUS BENCHMARKING OF AI AGENTS WITH A SCIENTIFIC RESEARCH SUITE

AstaBench: introduces a rigorous benchmarking suite for AI agents in scientific research, featuring a holistic measure of agentic ability, a reproducible environment with production-grade search tools, and a comprehensive suite of optimized agents and baselines.
The framework includes the Asta Environment for controlled evaluation, the agent-eval Agents Evaluation Toolkit for cost-aware reporting, and the AstaBench Leaderboard to account for confounding variables like tool usage and inference cost.
AstaBench evaluates 57 agents across 22 architectural classes on over 2400 problems spanning various scientific domains and tasks, revealing that AI still faces significant challenges in scientific research assistance.

23rd October 2025

BUILDARENA: A PHYSICS-ALIGNED INTERACTIVE BENCHMARK OF LLMS FOR ENGINEERING CONSTRUCTION

BuildArena: introduces a physics-aligned interactive benchmark for LLMs in engineering construction, comprising Task Definition (defines construction goals), LLM-based Construction (including a Spatial Geometric Computation Library and an LLM Agentic Workflow), and Simulation-based Evaluation (powered by the Besiege Simulator), where it enables LLMs to perform 3D structure construction via natural language instructions and evaluates performance within a physically constrained environment.
The benchmark provides a highly customizable framework for in-depth comparison and analysis of LLMs, supporting extendable task design strategies across static and dynamic mechanics with multiple difficulty tiers.
It includes a 3D Spatial Geometric Computation Library for supporting construction based on language instructions and a baseline LLM agentic workflow for comprehensive evaluation of diverse model capabilities.

AGENTARCEVAL: AN ARCHITECTURE EVALUATION METHOD FOR FOUNDATION MODEL BASED AGENTS

AgentArcEval: introduces a novel architecture evaluation method for Foundation Model (FM)-based agents, addressing complexities of their compound architecture, autonomous behavior, and continuous evolution, utilizing a catalogue of agent-specific general scenarios to guide architectural analysis and decision-making.
The method builds on established ATAM principles, incorporating agent-specific artifacts and guardrails into the evaluation process to support early-stage analysis of quality trade-offs through structured, context-specific scenarios.
Demonstrated through a case study on the Luna tax copilot, AgentArcEval is applicable to various agentic systems and aims to evolve as a community-driven living document.

Learning Decentralized Routing Policies via Graph Attention-based Multi-Agent Reinforcement Learning in Lunar Delay-Tolerant Networks

GAT-MARL (Graph Attention-based Multi-Agent Reinforcement Learning): introduces a decentralized routing framework for multi-robot lunar exploration missions, utilizing a CTDE paradigm with a shared policy model, Q-network, target network, and DDQN for learning optimal routing actions based on local observations and a reward function.
The framework operates within a Lunar Delay Tolerant Network (LDTN) where autonomous rovers collect data, store packets in local buffers, and relay them to a lander, navigating intermittent connectivity and dynamic topologies.
The GAT-MARL model employs a 2-layer GAT with attention heads and an MLP head to process graph-structured state information, enabling scalable and robust communication strategies without global topology updates or packet replication.

Designing Intent Communication for Agent-Human Collaboration

Design Space for Intent Communication: introduces a multidimensional design space for intent communication, structured along Transparency Level (what is communicated), Task Abstraction Level (when to communicate), and Communication Modality (how to communicate), to guide the development of generalizable, multi-modal communication strategies.
This design space is applied to three human-agent collaboration scenarios: bystander interaction, cooperative tasks, and shared control, demonstrating its capacity to generate adaptable and scalable communication strategies.
The framework bridges the gap between intent content and communication implementation, providing a foundation for designing safer, more intuitive, and transferable agent-human interactions.

ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature

ComProScanner: introduces an autonomous multi-agent framework for composition-property structured data extraction from scientific literature, utilizing CrewAI, LLMs, RAG, and specialized agents for metadata retrieval, article collection, information extraction, and evaluation.
The framework extracts, validates, classifies, and visualizes machine-readable chemical compositions, properties, and synthesis data, integrating with publisher APIs and local PDFs to build comprehensive datasets.
Evaluated across 10 LLMs using 100 journal articles, ComProScanner achieved an overall accuracy of 0.82 with DeepSeek-V3-0324, demonstrating its capability to handle complex experimental data for machine learning applications.

GHOSTEI-BENCH: DO MOBILE AGENTS RESILIENCE TO ENVIRONMENTAL INJECTION IN DYNAMIC ON-DEVICE ENVIRONMENTS?

GhostEI-Bench introduces a benchmark for mobile agents, including an Agent (mobile VLM agent), Attack vectors (adversarial threat categories), Representative Domains (diverse application contexts), Critical Risk Fields (potential security harms), Action Space (agent interaction capabilities), Judge LLM (evaluates agent behavior), Android Emulators (realistic mobile environment), Environment Controller (manages emulator, injects attacks), and Evaluation Module (assesses task outcomes).
This benchmark systematically evaluates mobile agent robustness against dynamic environmental injection attacks within fully operational Android emulators, assessing performance across critical risk scenarios.
GhostEI-Bench employs a novel LLM-based evaluation protocol for fine-grained failure analysis, identifying precise points of failure in perception, recognition, or reasoning.

UI-INS: ENHANCING GUI GROUNDING WITH MULTI-PERSPECTIVE INSTRUCTION-AS-REASONING

Instruction-as-Reasoning introduces a novel SFT+RL framework for GUI grounding, leveraging a data pipeline, vision encoder, language model, SFT stage, RL stage, and GRPO to treat instructions as dynamic analytical pathways for optimal UI element selection.
The framework addresses instruction diversity and quality issues by augmenting data with multi-perspective instructions and enabling models to dynamically select the most effective reasoning pathway.
UI-Ins models, built on this framework, achieve state-of-the-art grounding accuracy across five benchmarks and demonstrate emergent reasoning capabilities, including combining perspectives and reasoning from novel angles.

From Questions to Queries: An AI-powered Multi-Agent Framework for Spatial Text-to-SQL

AI-powered Multi-Agent Framework for Spatial Text-to-SQL: introduces a multi-agent system designed to accurately translate natural language questions into spatial SQL queries, integrating a knowledge base, context retrieval, and a collaborative pipeline of specialized LLM-powered agents.
The framework's core pipeline includes agents for entity extraction, metadata retrieval, query logic formulation, SQL generation, and a Review Agent for programmatic and semantic self-verification of generated SQL.
Supported by orchestration, memory, and a governance layer, the system enhances spatial analysis accessibility and provides a robust foundation for spatial Text-to-SQL systems, demonstrating self-improvement through recorded interactions.

22nd October 2025

BEYOND REACTIVITY: MEASURING PROACTIVE PROBLEM SOLVING IN LLM AGENTS

PROBE (Proactive Resolution of Bottlenecks): introduces a benchmark designed to test LLM agents' proactive problem-solving capabilities, encompassing searching for unspecified issues, identifying specific bottlenecks, and executing appropriate resolutions.
The benchmark evaluates agents across a pipeline including a World Model + User Datastore for information, Bottleneck identification, and Task Execution leading to Resolution, revealing that even state-of-the-art LLMs struggle with end-to-end proactive tasks.
The paper also details a data generation pipeline that constructs synthetic world models, bottlenecks, true positives, and distractors to create a realistic and challenging evaluation environment for proactive AI systems.

Review of Tools for Zero-Code LLM Based Application Development

Zero-Code LLM Platforms: introduces a comprehensive survey of recent zero-code LLM platforms, categorizing them by their LLM Backend, Interface Type, Output Type, Customization and Extensibility, Agent Support, Memory and Knowledge Integration, Workflow and Control Logic, API Integration and Tool Connectivity, and Multimodal and AI-Assisted Features.
The paper provides a taxonomy distinguishing between dedicated LLM-based app builders and general no-code platforms that integrate LLM capabilities, highlighting each platform's strengths and limitations.
While these platforms significantly lower the barrier to creating AI-powered applications, they still face challenges in flexibility, reliability, scalability, and prompt engineering skills, yet offer exciting opportunities for non-programmers.

AUTOMT: A Multi-Agent LLM Framework for Automated Metamorphic Testing of Autonomous Driving Systems

AUTOMT (A Multi-Agent LLM Framework for Automated Metamorphic Testing of Autonomous Driving Systems): introduces a multi-agent LLM framework, with M-Agent (extracts MRs from traffic rules), MR-RAG Database (stores, retrieves embedded MRs), T-Agent (analyzes test case context), and F-Agent (generates follow-up test cases), which automates MR extraction and follow-up test case generation for autonomous driving systems.
The framework leverages LLMs to extract diverse Metamorphic Relations from traffic rules, stores them in a RAG-based database, and uses vision-language models for scenario analysis and follow-up test case generation.
This modular architecture enhances test diversity, uncovers corner cases, and supports integration into industrial pipelines for systematic coverage of safety-critical scenarios in autonomous driving.

SORA-ATMAS: Adaptive Trust Management and Multi-LLM Aligned Governance for Future Smart Cities

SORA-ATMAS (Adaptive Trust Management and Multi-LLM Aligned Governance for Future Smart Cities): introduces a principled governance framework integrating decentralized agentic intelligence with centralized oversight and dual-chain anchoring, featuring an SDIoT Architecture Layer (structural backbone) comprising an Application Layer (top-level intelligence/governance) with a SORA Governance Layer (central city-wide oversight) and an Agentic Layer (domain-specific autonomous agents), a Control Layer (manages communication/security), and a Perception Layer (collects real-time data).
The framework enables heterogeneous agents (Weather, Traffic, Safety) to operate autonomously while remaining accountable to city-wide policies, utilizing multiple LLMs (GPT, Grok, DeepSeek) for semantic reasoning and risk-trust assessments.
SORA-ATMAS ensures regulation-aligned, verifiable, and context-aware decision-making for smart cities, demonstrating robustness under high-risk conditions and efficient cross-domain interoperability.

Are Large Language Models Sensitive to the Motives Behind Communication?

LMVEF: introduces a comprehensive study evaluating whether LLMs possess motivational vigilance, utilizing a rational model as a normative benchmark and assessing LLMs across three experimental paradigms, including deliberate vs. incidental information discrimination, nuanced motivational vigilance, and generalization to naturalistic online settings.
The framework employs various LLMs (e.g., GPT-4o, Claude 3.5 Sonnet), different prompting methods (CoT, Direct, Steering), and compares LLM performance against human baselines using both controlled cognitive science data and real-world YouTube sponsorship transcripts.
LMVEF reveals that while LLMs demonstrate basic motivational vigilance in controlled settings, their performance significantly degrades in complex, naturalistic environments, though simple steering prompts can partially recover vigilance by emphasizing intentions and incentives.

AgentSense: LLMs Empower Generalizable and Explainable Web-Based Participatory Urban Sensing

AgentSense: introduces a hybrid, training-free framework for web-based participatory urban sensing, integrating a Classical Planner (generates initial baseline solutions) and a Multi-agent evolution system (iteratively refines solutions) with a Disturbance Parser (converts unstructured dynamic signals) and a Multi-agent refinement loop (LLM-powered iterative updates) comprising a Solver Agent (proposes solution updates), an Eval Agent (assesses solutions/provides feedback), a Memory Agent (accumulates reusable meta-operations), a Meta-operation database (stores historical operations), and a Verifier (ensures plan validity).
The framework adaptively refines task assignments to dynamic urban conditions and heterogeneous worker preferences, generating natural language explanations for enhanced transparency and trust.
AgentSense demonstrates distinct advantages in adaptivity, explainability, and robustness over traditional methods and single-agent LLM baselines, positioning it for deploying adaptive and explainable urban sensing systems on the web.

HSCodeComp: A Realistic and Expert-level Benchmark for Deep Search Agents in Hierarchical Rule Application

HSCodeComp (Harmonized System Code Competition): introduces a realistic, expert-level e-commerce benchmark for deep search agents, including Data Collection and Diversity Control (sourcing, filtering product data), Information Gathering (collecting product details), Structured Data Extraction (extracting core features), Related Result Search (querying customs databases), Hierarchical Decision Rules Application (applying expert tariff rules), HSCode Confirmation (validating codes officially), and Human Expert Validation (quality assurance by senior experts), designed to evaluate multi-hop reasoning with hierarchical tariff rules.
The benchmark comprises 632 product entries with human-annotated 10-digit Harmonized System Codes, reflecting real-world e-commerce data and challenges like noisy descriptions and complex rule logic.
Extensive experiments reveal a significant performance gap between state-of-the-art LLMs and human experts, highlighting the difficulty of precise hierarchical rule application.

gem5 Co-Pilot: AI Assistant Agent for Architectural Design Space Exploration

gem5 Co-Pilot (AI Assistant Agent for Architectural Design Space Exploration): introduces an LLM-powered AI agent for automating computer architecture Design Space Exploration, integrating a DSE AI Agent, the gem5 Simulator/DSDB, and a Streamlit UI.
The DSE AI Agent, driven by an LLM and a state machine, dispatches gem5 configurations, analyzes simulation results, and leverages a Design Space Database for efficient exploration.
This framework significantly reduces the time and cost of identifying optimal architectural parameters by intelligently navigating design spaces and avoiding unnecessary simulations.

MODELING REALISTIC HUMAN BEHAVIOR USING GENERATIVE AGENTS IN A MULTIMODAL TRANSPORT SYSTEM: SOFTWARE ARCHITECTURE AND APPLICATION TO TOULOUSE

Generative Agent-based Multimodal Transport Simulation Framework: introduces a system for modeling realistic human mobility behavior, integrating GAMA Platform Simulation (interactive transport environment), Generative Agent (LLM-based decision-making core), LLM Model (generates context-aware plans), OpenTripPlanner (multimodal routing options), Data Exchange Pipeline (manages data flow), Population Data (agent initialization), and GTFS and Map Data (transport network information).
This framework enables generative agents to make context-aware transport decisions and form habits over time by leveraging LLMs for decision-making, GAMA for spatial simulation and visualization, and OpenTripPlanner for detailed multimodal routing.
The architecture separates spatial simulation from intelligent reasoning, allowing agents to adapt their future decisions based on evolving contexts and feedback, thereby advancing intelligent transportation systems and personalized mobility solutions.

AegisMCP: Online Graph Intrusion Detection for Tool-Augmented LLMs on Edge Devices

AegisMCP (Online Graph Intrusion Detection for Tool-Augmented LLMs on Edge Devices): introduces a protocol-level intrusion detector for Model Context Protocol (MCP)-driven smart homes, utilizing a NEBULA-Schema for representing agent activity as streaming heterogeneous temporal graphs.
The framework employs a multi-stage pipeline including data collection via MCP Proxy and network metadata, normalization, graph construction with Session DAGs, and a detector that fuses GraphSAGE-style edge behavior scores with DAG and novelty features.
Designed for edge devices, AegisMCP performs CPU-only, sub-second inference using ONNX INT8, enabling near-real-time detection of multi-step misuse and exfiltration attacks.

MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration

MSC-Bench (Multi-Server Tool Orchestration Benchmark): introduces a rigorous benchmark for evaluating LLM agents in multi-server tool orchestration, featuring an MCP Ecosystem, Servers, Tools, Equal Function Sets (EFS), and a Five-Level Curriculum.
The benchmark addresses gaps in existing evaluations by providing architectural realism, handling functional overlap with EFS, and offering a comprehensive end-to-end assessment across five complexity levels.
MSC-Bench systematically stress-tests agent capabilities from single-tool orchestration to complex cross-server planning and robustness, revealing systemic weaknesses in state-of-the-art agents and guiding future development.

MONITORING LLM-BASED MULTI-AGENT SYSTEMS AGAINST CORRUPTIONS VIA NODE EVALUATION

MAS Graph Backpropagation (Multi-Agent System Graph Backpropagation): introduces a dynamic defense paradigm for LLM-based Multi-Agent Systems, utilizing Graph Reconstruction (MAS as DAG), Connection Extraction (signed network, edge contribution score), Node Contribution Determination (backward propagation, total score calculation), Malicious Agent Detection (thresholding on node contribution scores), and Graph Repair (communication edge removal) to monitor and defend against corruption attacks.
This technique models MAS communication as an information propagation problem over a signed graph, dynamically adjusting the graph topology to disrupt malicious communications and adapt to evolving attacks.
It leverages the efficiency of the chain rule in backpropagation to accurately identify harmful nodes or edges, significantly outperforming existing MAS defense mechanisms in detection accuracy and system resilience.

AGENTICMATH: ENHANCING LLM REASONING VIA AGENTIC-BASED MATH DATA GENERATION

AgenticMath: introduces a novel agentic pipeline for generating high-quality mathematical question-answer pairs, including Seed Question Filter, Agentic Question Rephrase, Answer Augment, and Question and Answer Evaluation stages.
This multi-agent framework leverages LLMs for generation, evaluation, and coordinated decision-making, enforcing quality control at every stage of mathematical data generation to enhance LLM reasoning.
AgenticMath generates data-efficient, high-quality datasets (30K-90K samples) that achieve competitive or superior performance compared to baselines trained on much larger datasets (400K-2.3M samples).

DAMO: Data Mixing OPTIMIZER IN FINE-TUNING MULTIMODAL LLMS FOR MOBILE PHONE AGENTS

DaMo (Data Mixture Optimizer): introduces a novel solution employing a trainable network that predicts optimal data mixtures by forecasting downstream task performance for any given dataset ratio, including Data Mixing Space (all possible mixture combinations), Data Mixture Sampling (selects subset of mixtures), Small MLLM Training/Evaluation (initial model performance assessment), Downstream Task Performance Metrics (quantifies task performance), MLP-based DaMo (predicts performance from mixture), Optimal Data Mixture Extrapolation (identifies best data mixture), Larger MLLM Training (applies optimal mixture), and DaMo Extension/Alignment (adapts to other MLLMs).
The framework addresses the challenge of determining optimal training data compositions for multitask supervised fine-tuning (SFT) of MLLMs, which existing approaches struggle with.
DaMo achieves significant performance improvements on both a new specialized benchmark, PhoneAgentBench, and general benchmarks, demonstrating robust scalability and generalization across different MLLM architectures.

Learning to Make Friends: Coaching LLM Agents toward Emergent Social Ties

The Multi-agent LLM social media conversation framework: introduces a multi-agent LLM simulation platform for social media conversations, including persona creation, social media simulation, conversation room, reward structures, and tie formation mechanisms.
This framework enables LLM agents to repeatedly interact, evaluate one another, and adapt their behavior through in-context learning, accelerated by an optional coaching signal, to model human social behavior.
The framework utilizes behavioral reward functions (SOC, INF, PRE, COORD, EMO) and memory mechanisms to facilitate emergent social ties and network structures mirroring real online communities.

THEMCPCOMPANY: CREATING GENERAL-PURPOSE AGENTS WITH TASK-SPECIFIC TOOLS

TheMCPCompany: introduces a benchmark environment for evaluating general-purpose LLM agents, featuring self-hosted and Azure services, exposed through over 18,000 task-specific tools via MCP Servers, and includes MCPAgent as a baseline tool-calling agent with a Gateway MCP Server for tool retrieval and invocation.
This benchmark simulates complex enterprise environments, providing a realistic setting for studying LLM agents' ability to navigate large, heterogeneous tool collections and solve challenging real-world tasks.
The framework highlights the potential of task-specific tools for improving agent performance and reducing costs compared to browser-based agents, while also revealing challenges in tool retrieval and reasoning within complex environments.

From Specification to Service: Accelerating API-First Development Using Multi-Agent Systems

LLM-based Multi-Agent System: introduces a system that automates the API-first development of RESTful microservices, including a spec-generator agent (generates OpenAPI specification), code-generator agent (generates server code), JSON-cleaner agent (cleans JSON data), code-fixer agent (updates code with fixes), code-tester agent (manages containers, sends requests, analyzes logs), an underlying GPT-40 LLM, User interaction, and a Local Environment for execution.
The system creates OpenAPI specifications, generates server code, and refines it through a feedback loop that analyzes execution logs and error messages, enabling efficient issue detection and resolution.
This approach reduces development iterations and ensures functional, robust services by running code locally and providing context-aware feedback and automated fixes.

SheetBrain: A Neuro-Symbolic Agent for Accurate Reasoning over Complex and Large Spreadsheets

SheetBrain: introduces a neuro-symbolic dual-workflow agent framework for accurate reasoning over tabular data, including an Understanding Module (global comprehension), an Execution Module (tool-augmented reasoning), and a Validation Module (iterative self-correction).
The framework enhances LLMs' ability to understand and reason over complex spreadsheets for both question answering and manipulation tasks by integrating symbolic code execution within a Python sandbox.
SheetBrain leverages a closed-loop feedback architecture, where the validation module provides improvement feedback to the execution module, ensuring robust, accurate, and interpretable performance across diverse spreadsheet scenarios.

See, Think, Act: Online Shopper Behavior Simulation with VLM Agents

See, Think, Act (Online Shopper Behavior Simulation with VLM Agents): introduces a framework for simulating human online shopper behavior using a VLM Agent, which processes Action History and Current Screen Observation to perform Rationale Generation and Next Action Prediction within a defined Action Space.
The framework leverages vision-language models to jointly process textual HTML and visual GUI screenshots, enabling more faithful and cognitively aligned simulations compared to text-only approaches.
It employs supervised fine-tuning and reinforcement learning with a hierarchical reward structure to enhance action prediction accuracy and generate interpretable rationales for user actions.

DISROUTER: DISTRIBUTED SELF-ROUTING FOR LLM SELECTIONS

DiSRouter (Distributed Self-Router): introduces a novel distributed self-routing framework for LLM selections, featuring a Routing Procedure (query flow through agents), a Self-Awareness Training Pipeline (enhances LLM self-assessment), and Scenario Adaptability (dynamic adjustment to user preferences).
This framework empowers each LLM agent to independently assess its competence and decide whether to answer a query or route it to another agent, moving away from centralized external routers.
The system's effectiveness is driven by a two-stage training pipeline (SFT and RL) that instills self-awareness and allows agents to adapt their collective behavior based on a user-defined preference factor (α) for performance or cost.

Defending Against Prompt Injection with DataFilter

DataFilter: introduces a test-time model-agnostic defense that removes malicious instructions from untrusted data before it reaches the backend LLM, utilizing a filter LLM, prompt, untrusted data, filtered data, backend LLM, SFT dataset, prompt template, and special tokens.
The filter LLM is trained via supervised fine-tuning on simulated injections to selectively strip adversarial content while preserving benign information.
This approach consistently reduces prompt injection attack success rates to near zero while maintaining LLM utility, offering a plug-and-play deployment for black-box commercial LLMs.

Adaptive Coopetition: Leveraging Coarse Verifier Signals for Resilient Multi-Agent LLM Reasoning

AdCo (Adaptive Coopetition): introduces a novel inference-time framework where LLM agents use an adaptive, UCB-based coopetition mechanism, leveraging coarse verifier signals to decide whether to collaborate or compete and iteratively refine reasoning based on peer feedback.
The framework enhances collective reasoning robustness by integrating model knowledge diversity and reasoning trace measures, promoting uncertainty-driven exploration, and isolating low-quality feedback through a customized filter mechanism.
AdCo operates in a multi-round process, with agents exchanging information via a PubSub channel, refining solutions, and converging on a final answer through majority voting, demonstrating significant performance gains on mathematical reasoning benchmarks.

PLAGUE: PLUG-AND-PLAY FRAMEWORK FOR LIFE-LONG ADAPTIVE GENERATION OF MULTI-TURN EXPLOITS

PLAGUE (Plug-and-Play Framework): introduces a novel framework for designing multi-turn attacks, dissecting the attack lifetime into Planner, Primer, and Finisher phases, and incorporating components like Attacker LLM, Target LLM, Rubric Scorer, Summarizer, Lifelong Learner, and Evaluator Judge LLM (J).
This framework enables systematic and information-rich exploration of multi-turn attacks by maintaining goal relevance, evolving from feedback, and adaptively sampling diverse strategies.
PLAGUE achieves state-of-the-art jailbreaking results with high efficiency, significantly improving attack success rates across leading LLMs by leveraging smart initialization, context-building, and feedback incorporation.

A Tutorial on Cognitive Biases in Agentic AI-Driven 6G Autonomous Networks

Agentic System: introduces a tutorial on cognitive biases in LLM-powered 6G autonomous networks, with all LLM-empowered Agent, Perception, Digital Twin (DT), Collective Memory, Network APIs, A2A Protocol, and Model Context Protocol (MCP) components, providing a systematic overview of bias emergence, impact on agentic components, and mitigation strategies.
The paper details a taxonomy of cognitive biases, including their mathematical formulation and emergence in telecom systems, and identifies commonly impacted agentic components such as reasoning, planning, memory, negotiation, tool use, and actuation.
Two practical use-cases demonstrate the mitigation of anchoring, temporal, and confirmation biases in 6G inter-slice and cross-domain management, leading to improved latency and energy savings.

VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos

VIDEOAGENTTREK introduces a scalable pipeline that automatically mines structured computer-use trajectories from unlabeled screen-recorded videos, leveraging Video Collection and Preprocessing, VIDEO2ACTION (Inverse Dynamics Module), and Agent Training components.
The VIDEO2ACTION module, an inverse dynamics system, extracts explicit action labels and parameters from implicit video demonstrations, including action event detection, parameterization, and inner monologue generation.
This framework enables large-scale computer-use pretraining by converting passive internet videos into high-quality supervision, significantly improving task success rates and step accuracy for computer-use agents.

21st October 2025

KAT-Coder Technical Report

KAT-Coder: introduces a large-scale agentic code model trained through a multi-stage curriculum, including Mid-Term Training (enhances reasoning, planning, reflection), Supervised Fine-Tuning (SFT) (constructs diverse dataset), Reinforcement Fine-Tuning (RFT) (optimizes policy with rewards), and Reinforcement Learning (RL) (adapts to production IDEs).
The framework addresses the gap between static text-based training and dynamic real-world agentic execution by progressively enhancing cognitive and operational competence.
KAT-Coder achieves robust tool-use reliability, instruction alignment, and long-context reasoning, forming a deployable foundation for intelligent coding agents.

Fetch.ai: An Architecture for Modern Multi-Agent Systems

Fetch.ai Architecture: introduces a multi-layered architecture for modern multi-agent systems, integrating a Foundational Layer (underlying decentralized ledger, agent registry, naming service, economic token), a Development Layer (event-driven agent framework, SDK), a Deployment and Monitoring Layer (hosting, mailbox, monitoring, marketplace), and an Orchestration Layer (agentic LLM, task decomposition, search & discovery) to provide a robust, scalable, and decentralized platform.
This architecture addresses limitations of current LLM-based agent frameworks by providing on-chain trust, verifiable identities, standardized communication protocols, and economic coordination mechanisms for autonomous agents.
The framework enables the development, deployment, and operation of sophisticated multi-agent systems, allowing autonomous agents to securely discover, communicate, and transact in a decentralized marketplace.

VAPU: System for Autonomous Legacy Code Modernization

VAPU (Verifying Agent Pipeline Updater): introduces an LLM-based multi-agent system designed to autonomously modernize legacy web application code by updating files in phases, simulating a software development team.
The system employs a Manager agent, a Task pipeline (with Prompt maker and Execution agents), a Verification agent, and a Finalizer agent to process user requirements and iteratively refine code.
VAPU aims to provide a cost-effective solution for updating deprecated components, addressing challenges in legacy system maintenance, and improving code quality through self-division and self-feedback mechanisms.

Heterogeneous Adversarial Play in Interactive Environments

HAP (Heterogeneous Adversarial Play): introduces an adversarial Automatic Curriculum Learning (ACL) framework that formalizes teacher-student interactions as a minimax optimization, including a task-generating instructor, a problem-solving learner, an interactive environment, bidirectional learning feedback, an adversarial reward mechanism, student's behavioral history, a task selection distribution, and a student policy.
The framework enables a teacher agent to autonomously generate challenging tasks and adapt the curriculum based on real-time student performance, while a student agent strives to master these evolving challenges.
This co-evolutionary process dynamically balances task complexity against learner proficiency, fostering robust knowledge consolidation and effective exploration without requiring handcrafted curricula.

Joint Optimization of Cooperation Efficiency and Communication Covertness for Target Detection with AUVs

HMAPPO (Hierarchical Multi-Agent Proximal Policy Optimization): introduces a hierarchical multi-agent deep reinforcement learning framework for joint optimization of cooperation efficiency and communication covertness in underwater target detection using AUVs.
The framework decomposes the problem into macro-level AUV scheduling and micro-level AUV trajectory control, leveraging a Centralized Training with Decentralized Execution (CTDE) paradigm.
This approach enables adaptive covert cooperation while satisfying energy and mobility constraints, providing efficient and secure operation for multiple AUVs in dynamic underwater environments.

SentinelNet: Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat Detection

SentinelNet: introduces a decentralized framework for proactive threat detection and mitigation in multi-agent LLM systems, utilizing adversarial trajectory generation, contrastive learning-based detector training, and dynamic ranking with bottom-k elimination.
Each agent is equipped with a credit-based detector, trained on augmented adversarial debate trajectories, enabling autonomous evaluation of message credibility and dynamic neighbor ranking to suppress malicious communications.
The framework achieves near-perfect detection of malicious agents and recovers system accuracy from compromised baselines, demonstrating strong generalizability across domains and attack patterns.

EffiReasonTrans: RL-Optimized Reasoning for Code Translation

EffiReasonTrans: introduces a training framework for code translation, integrating a Data Synthesis Stage (generates reasoning-augmented data), a Supervised Fine-Tuning Stage (initializes model with reasoning), and an Execution-Based Reinforcement Learning Stage (optimizes for accuracy/latency) to balance accuracy and inference latency.
The framework synthesizes high-quality reasoning-augmented data (EffiReasonTrans-Data) using a powerful Reasoning LLM (DeepSeek-R1), then fine-tunes a Base LLM (DeepSeek-R1-Distill-Qwen-1.5B), and finally applies reinforcement learning with a dual-objective Reward Strategy (Execution-based Reward and Length-based Reward).
This approach consistently improves translation accuracy while reducing generated tokens and inference latency, demonstrating effectiveness in multilingual and agent-based settings.

An Encoder-Decoder Foundation Chemical Language Model for Generative Polymer Design

POLYT5 (An Encoder-Decoder Foundation Chemical Language Model for Generative Polymer Design): introduces a T5-based LLM, pre-trained on 100 million polymer structures in PSELFIES representation, enabling property prediction and targeted polymer generation, and integrated into an agentic AI framework for natural language interaction.
The framework leverages its fine-tuned property prediction models for thermal, electronic, and solubility properties, and generative design models to create hypothetical polymers conditioned on desired properties, such as glass transition temperature.
An agentic AI framework, featuring a general-purpose LLM as a controller and a Streamlit interface, enhances accessibility by automating input handling, format conversion, and model selection for both property prediction and generative design tasks.

Search Self-play: Pushing the Frontier of Agent Capability without Supervision

SSP (Search Self-play): introduces a self-evolving reinforcement learning approach for deep search agents, where a single LLM policy acts as both a question proposer and a problem solver, co-evolving their capabilities through competition and cooperation.
The proposer generates challenging search queries with verifiable ground-truth, while the solver attempts to answer them using multi-turn reasoning and external search tools.
The framework incorporates RAG verification, rule-based filtering, and a periodically reset replay buffer to ensure high-quality training tasks and stable co-evolution without human supervision.

Tokencake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications

Tokencake: introduces a KV-Cache-centric serving framework for LLM-based multi-agent applications, with Frontend API, Space Scheduler, Time Scheduler, Application Graph Definition, FuncNode, Performance Metadata, Dynamic Memory Partitioning, Hybrid Priority Metric, CPU Block Buffering, Gradual GPU Block Reservation, Event-Driven Offload, Predictive Upload, Benefit-Driven Policy, and Dynamic Forecasting Model, which co-optimizes scheduling and memory management through an agent-aware design to address KV Cache space contention and time underutilization.
The framework utilizes a Frontend API to define multi-agent workflows as a Directed Acyclic Graph, enabling specialized schedulers to manage KV Cache lifecycle with application-level context.
The Space Scheduler employs dynamic memory partitioning and a hybrid priority metric to shield critical agents from contention, while the Time Scheduler uses proactive offload and predictive upload mechanisms to repurpose GPU memory during function call stalls.

CLASP: Cost-Optimized LLM-based Agentic System for Phishing Detection

CLASP (Cost-Optimized LLM-based Agentic System for Phishing Detection): introduces a novel multi-agent system for phishing detection that leverages LLM-based agents for URL, screenshot, and HTML analysis, combining their outputs to classify websites as phishing or legitimate.
The system processes URLs or QR codes, employing specialized LLM-based agents to evaluate various web resource aspects, and utilizes a Progressive Analysis strategy for cost-effective and accurate detection.
CLASP outperforms existing commercial solutions in recall and F1 score, demonstrating a robust and scalable approach for combating evolving cybersecurity threats while maintaining low operational costs.

QuantEvolve: Automating Quantitative Strategy Discovery through Multi-Agent Evolutionary Framework

QuantEvolve: introduces an evolutionary multi-agent framework for automating quantitative trading strategy discovery, integrating a multi-dimensional Feature Map, an Island Population for parallel evolution, and a multi-agent system comprising Data, Research, Coding, and Evaluation Teams to generate and refine strategies.
The framework leverages a hypothesis-driven multi-agent system to systematically explore the strategy space through iterative generation and evaluation, ensuring diverse and high-performing strategies adaptable to market shifts and investor preferences.
QuantEvolve maintains population diversity through a Feature Map that organizes strategies by attributes and employs an Evolutionary Database and Insight Repository to store and refine knowledge across generations.

The Trust Paradox in LLM-Based Multi-Agent Systems: When Collaboration Becomes a Security Vulnerability

TVP Experimental Framework: introduces the "Trust-Vulnerability Paradox" in LLM-based multi-agent systems, empirically validating that increased inter-agent trust amplifies leakage risk, and proposes defenses.
The framework utilizes CK-Agents and SK-Agents, powered by various LLM backends and orchestration frameworks, to simulate collaboration scenarios with parameterized trust levels.
It quantifies leakage using Over-Exposure Rate and Authorization Drift metrics, and evaluates Sensitive-Information Repartitioning and Guardian-Agent enablement as mitigation strategies.

WEBDEVJUDGE: EVALUATING (M)LLMS AS CRITIQUES FOR WEB DEVELOPMENT QUALITY

WEBDEVJUDGE: introduces a systematic benchmark for assessing LLM-as-a-judge performance in web development, supporting both static and continuous interactive evaluation, and comprising data collection, rubric annotation, and a judge component with various evaluators, observations, and paradigms.
The benchmark utilizes human preference labels over paired web implementations, annotated with structured and query-grounded rubrics to establish high-quality ground truth for evaluating LLMs, MLLMs, and agentic workflows.
Experiments reveal a significant gap between LLM judges and human experts, stemming from fundamental model limitations like failures in recognizing functional equivalence, verifying task feasibility, and mitigating bias, highlighting challenges for automated evaluators in complex scenarios.

WHEN YOUR AI AGENT SUCCUMBS TO PEER-PRESSURE: STUDYING OPINION-CHANGE DYNAMICS OF LLMS

LLM-driven Network Model: introduces a framework for auditing emergent socio-cognitive behaviors of multi-agent AI systems, utilizing Experiment Setup, Random Node Selection, Prompt Construction, LLM Query & Recommendation, Update Opinion, Check For Consensus, Next Node Selection, LLM Agents, Social Network, Cognitive Commitment Spectrum, and Discursive Frames to study how peer pressure influences LLM opinions across cognitive commitments.
The research reveals that LLM agents exhibit a sigmoidal conformity pattern, with varying thresholds across models and a "persuasion asymmetry" where the cognitive effort to change an opinion depends on its initial valence and the targeted cognitive layer.
This study uncovers a "dual cognitive hierarchy" where the stability of cognitive constructs inverts based on the direction of persuasion, demonstrating that LLM decision-making is governed by a fluid, context-dependent architecture rather than static logic.

SOCIA-V: Textual Gradient Meets Multi-Agent Orchestration for Automated Simulator Generation

SOCIA-V (Simulation Orchestration for Computational Intelligence with Agents): introduces an end-to-end, agentic framework that treats simulator construction as instance optimization over code within a textual computation graph, including Data Analysis, Code Generation, Simulation Execution, Result Evaluation, and Feedback Generation agents.
The framework unifies multi-agent orchestration with a loss-aligned optimization view, converting brittle prompt pipelines into reproducible, constraint-aware simulator code generation.
It employs Textual Gradient Descent (TGD) with Momentum and Projected Gradient Descent (PGD) for iterative code repair, ensuring high-fidelity, extrapolatable simulators across diverse domains.

JAUNT: Joint Alignment of User Intent and Network State for QoE-centric LLM Tool Routing

JAUNT (Joint Alignment of User intent and Network state for QoE-centric Tool routing): introduces a framework that aligns user intent and real-time network states to maximize Quality of Experience (QoE) in LLM tool routing, utilizing Semantic Intent Inference, Network Latency Prediction, and Joint QoE-centric Tool Routing modules.
The framework addresses limitations of current routing mechanisms by interpreting user intent, including semantic ambiguity and emotional expression, and integrating dynamic network conditions for adaptive tool selection.
JAUNT employs LLM agents to construct network profiles, mapping numerical performance indicators into a semantic space to guide routing decisions and continuously updates user profiles based on QoE feedback.

EfficientNav: Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval

EfficientNav: introduces an on-device object-goal navigation system that includes a Detection Model (generates semantic/distance information), Graph-based Navigation Map (organizes semantic/spatial information), Attention-based Memory Clustering (clusters objects into groups using LLM attention), Semantics-aware Memory Retrieval (selects relevant groups, prunes redundant map info), Discrete Memory Caching (manages KV cache for groups, avoids re-computation), and an LLM Planner (determines navigation sub-goals).
The system enables efficient zero-shot object-goal navigation on local devices by addressing memory constraints and improving smaller LLM understanding of complex navigation maps.
It achieves significant improvements in success rate and real-time latency reduction over GPT-4-based baselines by optimizing KV cache management and prompt efficiency.

Crucible: Quantifying the Potential of Control Algorithms through LLM Agents

Crucible: introduces an LLM-driven framework for quantifying the Tuning Potential of control algorithms, with LLM Agent, Domain Knowledge Acquisition, Optimization Tools, Control Algorithm Interface, Action and Feedback Loop, Differential Developer Capability Simulation, Performance Characteristic Vector, Unified Environment Distance and Similarity Metric, Tuning Potential Metric, Test Environments, and Reference Algorithms, to systematically evaluate algorithmic adaptability.
The framework employs an LLM-driven multi-level expert simulation agent to emulate developer tuning processes and defines a formalized metric for quantitatively assessing an algorithm's inherent adaptability across diverse environments.
Crucible's approach moves beyond traditional performance evaluation by considering an algorithm's representational capacity and comprehensibility, guiding targeted redesign for improved performance and practical value.

LAFA: Agentic LLM-Driven Federated Analytics over Decentralized Data Sources

LAFA (Agentic LLM-Driven Federated Analytics): introduces an LLM-driven federated analytics framework that transforms natural language queries into optimized, privacy-preserving execution pipelines, utilizing a Querier, Server with Hierarchical Decomposer Agents, DAG Optimizer Agent, Aggregator, and Answerer Agent, and Target Clients performing FA Pipelines and Submission.
This system addresses challenges in LLM-agent-based analytics by enabling efficient complex query processing in natural language with privacy preservation and reducing computational overhead through a hierarchical multi-agent architecture.
LAFA ensures correct FA operation sequencing and optimizes workflows by eliminating redundant operations, making it suitable for real-world privacy-preserving data analytics over decentralized data sources.

PROBABILISTIC MODELING OF INTENTIONS IN SOCIALLY INTELLIGENT LLM AGENTS

STOM (Stochastic Theory-of-Mind): introduces a probabilistic intent modeling framework for LLM agents in multi-turn social dialogue, which includes an Intention Model (generates, updates belief distributions), a Likelihood Model (estimates action probability), a Confidence-Aware Policy (selects actions based on uncertainty), and a Belief Distribution (represents partner's latent intentions).
This framework maintains and dynamically updates a belief distribution over a partner's latent intentions, initialized from contextual priors and refined through likelihood estimation after each utterance.
The evolving belief distribution provides contextual grounding for the policy, enabling adaptive dialogue strategies under uncertainty and improving multi-dimensional social performance without additional training.

Chain-of-Conceptual-Thought: Eliciting the Agent to Deeply Think within the Response

CoCT (Chain-of-Conceptual-Thought): introduces a prompt-based paradigm that guides an LLM to first tag a concept (emotion, strategy, or topic) and then generate detailed content, facilitating deep and strategic thinking within a single utterance.
This approach leverages a CoCT Prompt (structured instruction), which includes available Concepts (emotion, strategy, topic) and uses Special Tokens (concept tags) to explicitly denote conceptual transitions, enabling the LLM to structure its responses in open-domain conversations.
The framework allows for multiple conceptual transitions within one response, mimicking human-like thinking and improving performance in tasks like emotional support conversations.

Memory-Augmented State Machine Prompting: A Novel LLM Agent Framework for Real-Time Strategy Games

MASMP (Memory-Augmented State Machine Prompting): introduces a novel framework for LLM agents in real-time strategy games, integrating state machine prompting with a strategic memory module to achieve structured, coherent decision-making, utilizing an LLM-PySC2 Observation Extractor, Obs-Text converter, Memory Module, State Machine Prompting Module (comprising Macro-Strategic State Machine, Action Implementation Behavior Tree, and Supplementary Atomic Rules), LLMs, Strategy Extractor, Text-Action converter, Action Extractor, and Action Executor.
The framework guides LLMs to emulate finite state machines and behavior trees through natural language prompts, while the memory module preserves strategic variables across decision cycles for persistent tactical coherence.
MASMP achieves a 60% win rate against StarCraft II's hardest built-in AI (Lv7), demonstrating improved interpretability, generalization, and reliability over previous LLM-based baselines by bridging LLM flexibility with rule-based systems.

InspectCoder: Dynamic Analysis-Enabled Self Repair through Interactive LLM-Debugger Collaboration

InspectCoder: introduces an agentic program repair system that empowers LLMs to actively conduct dynamic analysis via interactive debugger control, utilizing a Program Inspector agent for dynamic analysis, a Patch Coder agent for patch generation and validation, and InspectWare middleware for debugger interaction.
The framework enables strategic breakpoint placement, targeted state inspection, and incremental runtime experimentation within stateful debugger sessions, moving beyond blind trial-and-error to systematic root cause diagnosis.
InspectCoder achieves significant improvements in repair accuracy and bug-fix efficiency over baselines by adaptively inspecting and perturbing relevant intermediate states at runtime, guided by immediate debugger feedback.

Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming

Genesis: introduces an agentic red-teaming framework for LLM web agents, featuring an Attacker, Scorer, Strategist, and Strategy Library, designed to systematically discover, summarize, and evolve attack strategies.
The framework employs a genetic algorithm within the Attacker to evolve strategies, an LLM-powered Scorer for feedback, and an LLM-based Strategist to refine the continuously growing Strategy Library.
This closed-loop system automates the red-teaming process by mimicking human expert learning, enabling dynamic adaptation and transferability of attack knowledge across diverse web environments.

PROACTIVE REASONING-WITH-RETRIEVAL FRAMEWORK FOR MEDICAL MULTIMODAL LARGE LANGUAGE MODELS

MED-RWR (Multimodal Medical Reasoning-with-Retrieval framework): introduces a proactive multimodal reasoning-with-retrieval framework for medical MLLMs, leveraging its Policy Model, Medical Knowledge Base, Reference Model, Reward Design, Confidence-Driven Image Re-retrieval (CDIR), Multimodal Medical KB, Retriever, Input, and Output to enhance diagnostic accuracy by actively integrating external knowledge.
The framework employs a two-stage reinforcement learning strategy with tailored rewards, including accuracy, format, query semantic (visual and textual), and confidence gain, to stimulate effective retrieval and reasoning.
CDIR further augments the system by triggering image re-retrieval from a multimodal knowledge base during inference when low prediction confidence is detected, addressing insufficient information from initial text-based retrieval.

Food4All: A Multi-Agent Framework for Real-time Free Food Discovery with Integrated Nutritional Metadata

Food4All (Multi-Agent Framework): introduces a multi-agent framework for real-time, context-aware free food retrieval, unifying cross-platform data aggregation, a reinforcement learning algorithm, and an online feedback loop to deliver nutritionally annotated food recommendations.
The framework employs a dual-agent system with a Planner Agent for hierarchical task decomposition and an Executor Agent for tool-grounded execution, addressing limitations of existing systems like incomplete information and lack of personalization.
Food4All dynamically adapts retrieval policies to evolving user needs through an online learning loop, ensuring reliable and practical food access information for food-insecure populations.

When Old Meets New: Evaluating the Impact of Regression Tests on SWE Issue Resolution

TESTPRUNE: introduces an automated technique that leverages issue tracker reports and strategically reuses regression tests for bug reproduction and patch validation, utilizing an LLM, suspicious function localization, test file retrieval and coverage generation, and a greedy algorithm to produce minimized regression tests.
This approach addresses the challenge of large test suites exceeding LLM context limits by minimizing the regression suite to a small, highly relevant subset of tests, thereby improving efficiency and reliability in LLM-based debugging workflows.
The minimized regression tests generated by the framework enhance reproduction test generation by providing focused guidance and improve patch selection and validation by ensuring relevance to the issue, leading to increased issue reproduction and resolution rates.

SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

SafeSearch: introduces an RL-based alignment framework that jointly optimizes safety and utility for LLM-based search agents, incorporating mixed training with general QA and red-teaming datasets, and utilizing both final-output safety/utility rewards and a novel query-level shaping term.
The framework explicitly rewards policy-compliant helpfulness and penalizes unsafe queries, aiming to reduce harmful outputs while maintaining or improving QA performance.
SafeSearch significantly reduces agent harmfulness by over 70% across red-teaming datasets and matches the QA performance of utility-only finetuned agents, demonstrating the effectiveness of its query-level reward in balancing safety and utility.

Cultural Alien Sampler: Open-ended art generation balancing originality and coherence

CAS (Cultural Alien Sampler): introduces a concept-selection method that explicitly separates compositional fit from cultural typicality, with a Concept Coherence Model (scores concept co-occurrence), a Cultural Context Model (estimates concept combination typicality), and a Scoring Function (balances coherence and typicality).
The framework integrates CAS into an Open-ended Art Agent, which includes an Inspiration Module, a Prompt Compositor (GPT-40), an Image Generator (gpt-image-1), and a Novelty Score (evaluates originality/harmony using text and image embedding models).
This approach enables autonomous agents to generate ideas that maintain internal consistency while deviating from learned conventions, outperforming LLM baselines in originality and harmony and exploring a broader conceptual space.

AndroidControl-Curated: Revealing the True Potential of GUI Agents through Benchmark Purification

AndroidControl-Curated Pipeline: introduces a systematic, semi-automated benchmark purification pipeline and a novel reinforcement learning training paradigm for GUI agents, including optimized grounding, multi-model filtering, LLM review and rewrite, human expert verification, and GRPO training with Gaussian rewards and ratio optimization.
This pipeline creates AndroidControl-Curated, a refined benchmark that accurately evaluates GUI agents, and trains Magma-R1, a compact 3B model achieving state-of-the-art performance on complex GUI tasks.
The research demonstrates that benchmark quality is more critical than model scale for GUI agent evaluation, enabling on-device GUI agents to be closer to practical deployment.

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

UltraCUA (A Foundation Model for Computer Use Agents with Hybrid Action): introduces a foundation model that seamlessly integrates GUI primitives with high-level programmatic tool calls, leveraging an automated pipeline for tool acquisition, a synthetic data engine for verifiable tasks, a large-scale hybrid action trajectory collection, and a two-stage training pipeline.
This approach enables strategic alternation between low-level GUI actions and high-level programmatic tool calls, reducing error propagation and maintaining execution efficiency for computer-use agents.
UltraCUA achieves state-of-the-art performance on real-world benchmarks, demonstrating improved success rates and cross-platform generalization by effectively bridging primitive GUI interactions and programmatic intelligence.

20th October 2025

Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics

EDR (Enterprise Deep Research): introduces a multi-agent system for enterprise analytics, integrating a Master Research Agent, ToDo Manager, Specialized Agents (including search domain and enterprise workflow tools), a Reflection Mechanism, and a Research Report component, with optional human steering.
The framework enables automated report generation, real-time streaming, and seamless enterprise deployment, outperforming state-of-the-art agentic systems on open-ended benchmarks without human steering.
EDR provides transparent, steerable research through dynamic context engineering, allowing human users to guide the agent's reasoning trajectory and task management during execution.

A Brain Cell Type Resource Created by Large Language Models and a Multi-Agent AI System for Collaborative Community Annotation

BRAINCELL-AID (Brain Cell type Annotation and Integration using Distributed AI): introduces a novel multi-agent AI system for collaborative community annotation of brain cell types, utilizing an Agentic Network, Query Agent, Fine-tuned LLM, GPTON, Literature Agent, RAG Agent, Web Portal, and Neuroscience Communities / Human Feedback to generate and refine biologically grounded annotations.
The system leverages fine-tuned LLMs and retrieval-augmented generation to overcome limitations of traditional annotation methods, providing high-quality, literature-backed descriptions for over 20,000 brain cell type-specific marker gene sets.
BRAINCELL-AID enhances annotation accuracy, supports testable hypothesis generation, and fosters human-AI collaboration through its interactive web portal, advancing neuroscience discovery.

Semantic Joint Source Channel Coding for Distributed Subsurface Imaging in Multi-Agent Systems

Semantic JSCC AirComp (Semantic Joint Source Channel Coding with Over-the-Air Computation): introduces a framework that integrates semantic communication into multi-agent system (MAS) exploration, applying semantic JSCC with AirComp for distributed function computation (DFC) in cooperative subsurface imaging using the Adapt-Then-Combine Full Waveform Inversion (ATC-FWI) algorithm.
This framework employs Neural Network encoders to compress and channel code observations from neighboring agents, an AirComp channel to sum transmitted symbols, and a Neural Network semantic decoder to reconstruct a semantic variable, leveraging local side information at the receiver.
The system enhances overall task performance by adapting communication strategies to the exploration methodology, demonstrating improved bandwidth efficiency and imaging accuracy in noisy inter-agent communication links.

ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input

ImaGGen: introduces a zero-shot system for co-speech semantic gesture generation, with an Image Feature Analysis Pipeline (identifies objects), a Semantic Matching Pipeline (links text to visuals), and a Realization Engine (synthesizes gestures) to produce iconic, deictic, and beat gestures from language and image input.
The system extracts object properties like shape, symmetry, and alignment from images, matches these visual details to spoken text, and then synthesizes gestures using an inverse kinematics engine, layering them with co-generated beat gestures.
A user study demonstrated that the generated gestures significantly improved participants' ability to identify object properties in ambiguous speech scenarios, confirming their interpretability and communicative value for virtual agents.

Cybersecurity AI: Evaluating Agentic Cybersecurity in Attack/Defense CTFs

CAI (Cybersecurity AI) Parallel Execution Framework: introduces an empirical evaluation of AI agents in Attack/Defense CTF scenarios, deploying autonomous offensive and defensive agents concurrently in a shared target environment to assess their capabilities under various operational constraints.
The framework leverages LLMs to power specialized agents, enabling fine-grained control over their configuration, context, and objectives for direct comparison of attack and defense performance.
This study challenges claims of inherent AI attacker advantage by demonstrating that defensive effectiveness critically depends on success criteria, highlighting the importance of availability-preserving defense in real-world cybersecurity operations.

Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents

Industry Agent Framework: introduces a five-level capability maturity model for LLM-driven industry agents, detailing their evolution across Memory, Planning, and Tool Use pillars.
This framework categorizes agents from simple process execution systems (L1) to adaptive social systems (L5), driven by advancements in core technologies.
It provides a roadmap for understanding and building next-generation industry agents by linking technological evolution with practical applications and evaluation.

AGENTIC REINFORCEMENT LEARNING FOR SEARCH IS UNSAFE

ARLS (Agentic Reinforcement Learning for Search): introduces a study evaluating the safety of RL-trained search models, which include a Policy LLM, Search Engine, Reward Function, Reference Model, System Prompt, and specific Tokens, revealing their vulnerability to jailbreaking attacks, which are assessed by an LLM Evaluator using a Harmful Instructions Dataset, and categorized into Search Attack and Multi-search Attack.
The paper demonstrates that these models, despite inheriting refusal behavior from instruction tuning, can be easily exploited by forcing early searches, leading to cascades of harmful queries and answers.
This research highlights a critical weakness in current RL training objectives that prioritize effective query generation over safety, necessitating the development of safety-aware RL pipelines.

DIVERSE PLANNING WITH SIMULATORS VIA LINEAR TEMPORAL LOGIC

FBILTL (Forbid Behaviour IterativeLTL): introduces a diverse planner for simulation-based planning problems, leveraging its Behaviour Sorts Suite (BSS) for diversity modeling and Linear Temporal Logic (LTL) to define semantic diversity criteria, which are integrated into the search process using a modified Iterated Width (IW(i)) Planner as a BehaviourGeneratorx and a Simulator for environment interaction, with a PlanGeneratorx for additional plan generation.
This framework addresses the limitation of existing diverse planning approaches that often produce semantically identical solutions by ensuring the generation of semantically distinct plans based on user-defined diversity features.
The approach demonstrates the feasibility of semantically-guided diverse planning in complex, non-symbolic simulation-based environments, offering a significant advantage over traditional declarative models.

ALPINE: A Lightweight and Adaptive Privacy-Decision Agent Framework for Dynamic Edge Crowdsensing

ALPINE (A Lightweight and Adaptive Privacy-Decision Agent Framework for Dynamic Edge Crowdsensing): introduces a closed-loop control system that empowers terminal devices to autonomously adjust differential privacy levels in real time, balancing privacy gains, data utility, and energy cost.
The framework includes a Risk Perception Module, a Privacy Decision Module, a Privacy Execution Module, and a Performance Verification Module, operating across mobile terminal devices and an edge computing server.
It leverages a LightAE for channel risk detection and a TD3 agent for dynamic privacy budget allocation, with feedback from the edge server for continuous policy refinement.

Learning After Model Deployment

PLDA (Post-deployment Learning based on Linear Discriminant Analysis): introduces Autonomous Learning after Model Deployment (ALMD), a paradigm enabling AI agents to continuously learn new knowledge autonomously after model deployment, utilizing a pre-trained model, LDA, and incremental updates of class means.
The framework performs dynamic OOD detection using Mahalanobis distance or Relative Mahalanobis distance, and incrementally learns new classes by updating their class means while keeping a shared covariance matrix fixed, thus avoiding catastrophic forgetting.
This approach allows for efficient, online learning from streaming data without human engineers, adapting to open and dynamic environments by expanding its set of known classes.

Digitization Can Stall Swarm Transport: Commensurability Locking in Quantized-Sensing Chains

Robotic Swarm Model: introduces a minimal model for autonomous robotic swarms that self-organize spacing and follow local gradients using quantized digital sensors, investigating collective response, fractional transport, and commensurability locking.
The model incorporates stochasticity (λrand), quantized sensing (λsens) leading to motion bias, and pairwise inter-agent interactions (λint) to maintain swarm formation.
The research reveals that collective transport can stall due to commensurability locking, a number-theoretic condition, and explores how swarm topology affects transport in higher dimensions.

From AutoRecSys to AutoRecLab: A Call to Build, Evaluate, and Govern Autonomous Recommender-Systems Research Labs

AutoRecLab (Autonomous Recommender-Systems Research Lab): introduces a vision for an integrated system that automates the entire research lifecycle in recommender systems, from problem ideation to manuscript drafting, utilizing LLM-driven components and automated experimentation.
This framework aims to expand beyond current AutoRecSys tools by enabling autonomous generation of research questions, experimental designs, and manuscript writing, while maintaining rigorous provenance records.
The paper calls for the RecSys community to build prototypes, establish benchmarks, embrace AI-generated submissions, develop attribution standards, and foster interdisciplinary dialogue for responsible integration of automated research.

SPACER: SELF-PLAY ANCHORING WITH CENTRALIZED REFERENCE MODELS

SPACER (Self-Play Anchoring with Centralized Reference Models): introduces a framework that leverages a pretrained Tokenized Reference Model (πref) to guide a Decentralized Model (πθ) in self-play, with all Batched Simulations, Full Scene Context, Agents, Agent Rollouts, Loss Function (L(θ)), Task Performance (LPPO), Human-likeness Reward (r_humanlike), and Distributional Alignment (DKL) components, where the framework anchors decentralized self-play policies to human driving distributions using likelihood rewards and KL divergence for scalable, realistic multi-agent simulation.
The Tokenized Reference Model (πref) provides a human-likeness distributional signal and likelihood rewards, while the Decentralized Model (πθ) is the self-play policy trained via reinforcement learning on local observations.
The Loss Function (L(θ)) combines a Task Performance (LPPO) objective with a Human-likeness Reward (r_humanlike) and Distributional Alignment (DKL) term to balance task success with realistic, human-like behaviors.

LLM-Based Multi-Agent System for Simulating and Analyzing Marketing and Consumer Behavior

LLM-Based Multi-Agent System: introduces an LLM-powered multi-agent simulation framework, including LLM Response System (Manages LLM interactions), Memory System (Manages agent memories), Main Loop (Orchestrates simulation flow), Agents (Simulated entities), and Environment (Virtual world), designed to simulate consumer decision-making and social dynamics for marketing strategy evaluation.
The framework utilizes DeepSeek-V3 LLM-powered generative agents that plan daily schedules, manage resources, shop, converse, and make social commitments within a virtual town environment.
This approach provides marketers with a scalable, low-risk tool for pre-implementation testing, reducing reliance on time-intensive post-event evaluations and lowering the risk of underperforming campaigns.

Learning from Generalization Patterns: An Evaluation-Driven Approach to Enhanced Data Augmentation for Fine-Tuning Small Language Models

PaDA-Agent (Pattern-guided Data Augmentation Agent): introduces an evaluation-driven approach for fine-tuning SLMs, where a Central Orchestrator coordinates a Pattern Analysis Agent, Data Generation Agent, and Quality Control Agent to iteratively enhance a Fine-tuned SLM by augmenting Training Data with synthetic data derived from Validation Data insights, guided by Error Patterns, Augmentation Strategies, and Quality Control Feedback, resulting in Augmented Training Data.
The framework systematically analyzes validation failures to discover error patterns, drafts targeted augmentation strategies, and generates synthetic data with automated quality control, directly addressing generalization errors.
This multi-agent system significantly outperforms state-of-the-art LLM-based data augmentation approaches, yielding consistent performance gains for SLMs across various tasks, especially in low-data regimes.

BlueCodeAgent: A BLUE TEAMING AGENT ENABLED BY AUTOMATED RED TEAMING FOR CODEGEN AI

BlueCodeAgent: introduces an end-to-end blue teaming agent enabled by automated red teaming, integrating an Automated Red Teaming Pipeline (generates diverse risky instances and knowledge), a Knowledge Base (stores red-teaming data), and a BlueCodeAgent (main agent for defense) which includes Constitution Generation (summarizes knowledge into actionable rules), a Static Analyzer (performs initial vulnerability analysis), a Dynamic Analyzer (generates test cases for runtime verification), a Code Runner (executes code in sandbox), and a Final Analyzer (integrates analysis for final judgment).
The framework unifies automated red teaming to generate diverse risky instances, which are distilled into actionable constitutions that guide the blue teaming agent to detect unsafe textual inputs and code outputs.
BlueCodeAgent leverages dynamic testing to validate vulnerability claims, effectively reducing false positives and over-conservative judgments, thereby achieving robust and precise risk mitigation across various code-generation security tasks.

Investigating the Impact of Dark Patterns on LLM-Based Web Agents

LiteAgent: introduces a framework for evaluating LLM-based web agents against dark patterns, utilizing TrickyArena as a controlled web environment with customizable dark patterns and tasks, and LiteAgent's components for automated agent execution, interaction logging, and performance validation.
The framework captures comprehensive logs and screen-recordings of agent interactions, enabling systematic assessment of dark pattern susceptibility and task completion rates across various LLM-based generalist web agents.
The study reveals that web agents are susceptible to dark patterns, with higher-performing agents being more vulnerable, and that both LLM choice and agent architecture significantly influence susceptibility, highlighting the need for holistic defense mechanisms.

Does Reasoning Help LLM Agents Play Dungeons and Dragons? A Prompt Engineering Experiment

LLM-based D&D Action Generation Experiment: introduces an experimental setup comparing an Instruct Model (LLaMA-3.1-8B-Instruct) and a Reasoning Model (DeepSeek-R1-Distill-LLaMA-8B) for generating Dungeons & Dragons player actions as Avrae Discord bot commands, utilizing prompt engineering, the FIREBALL dataset, and various evaluation metrics.
This research investigates the impact of prompt design on LLMs' ability to predict structured actions during D&D combat, focusing on command generation for the Avrae Discord bot.
The study highlights that specific instructions in prompts significantly affect model output, concluding that instruct models are sufficient for this task and can outperform reasoning models, especially for smaller LLMs.

CompactPrompt: A Unified Pipeline for Prompt and Data Compression in LLM Workflows

CompactPrompt: introduces a unified pipeline for prompt and data compression in LLM workflows, with Initialization (sets up pipeline), Token Probability Construction (computes token likelihoods), Hybrid Scoring (LLM+Programmatic) (combines LLM and programmatic scores), Compression Engine (applies pruning, abbreviation, quantization), Semantic Similarity Analysis (evaluates compressed output fidelity), Metrics Computation (calculates performance metrics), Result Assembly (gathers compressed outputs), and Final Context (generates final compressed context), aiming to reduce token usage and inference costs while preserving output quality.
The pipeline integrates hard prompt compression, textual n-gram abbreviation for documents, and numerical quantization for structured data, addressing diverse input types without model retraining.
CompactPrompt achieves up to 60% token reduction and maintains or improves QA accuracy on benchmarks like TAT-QA and FinQA, making LLM workflows more efficient and cost-effective.

OPTAGENT: Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning

OPTAGENT (Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning): introduces a multi-agent verbal reinforcement learning framework that dynamically constructs and refines multi-agent collaboration structures, including Profiled LLM Agents (individual LLMs with distinct personas), a Multi-Agent Collaboration Graph (representing agent interactions), LLMreflect (a feedback agent), LLMact (an action agent), and a Majority Voting Strategy (for final decision-making), where it optimizes interaction patterns and communication quality for enhanced reasoning.
The framework leverages verbal reinforcement learning with meta-agents (LLMreflect and LLMact) to evaluate and adapt the collaboration graph, ensuring effective information flow and robust problem-solving across diverse reasoning tasks.
By explicitly considering communication quality and dynamically updating connection scores, OPTAGENT significantly outperforms single-agent prompting and state-of-the-art multi-agent frameworks on various reasoning tasks.

BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers?

BadScientist: introduces a framework that evaluates whether fabrication-oriented paper generation agents can deceive multi-model LLM review systems, including a Paper Agent (generates fabricated research papers), a Review Agent (evaluates papers using multiple LLMs), and an Analysis System (aggregates outcomes and calibrates thresholds).
The framework employs five presentation-manipulation strategies (TooGoodGains, BaselineSelect, StatTheater, CoherencePolish, ProofGap) to generate unsound papers without real experiments, which are then evaluated by LLM reviewers calibrated on real conference data.
Findings reveal systematic vulnerabilities where fabricated papers achieve high acceptance rates (up to 82.0%) and exhibit a "concern-acceptance conflict," indicating that LLM reviewers frequently flag integrity issues yet still assign acceptance-level scores.

FABRIC: FRAMEWORK FOR AGENT-BASED REALISTIC INTELLIGENCE CREATION WEAVING SYNTHETIC ENTERPRISE DATA FOR TRAINING AUTONOMOUS AGENTS

FABRIC (Framework for Agent-Based Realistic Intelligence Creation): introduces a unified, modular framework for generating structured, executable, and validated tool-use data from LLMs, without human supervision, to train autonomous agents.
The framework leverages four modular pipelines—RecordSynth, DAGFirstGeneration, MultiTurnDialogueSynth, and AgenticRecordRollout—to synthesize agentic data across varying granularities, from end-to-end trajectories to atomic function calls.
It integrates constrained generation formats, JSON-schema validation, and judge-based filtering to ensure logical consistency, execution fidelity, and schema validity of the generated synthetic datasets, advancing robust tool use for agentic LLMs.

Executable Knowledge Graphs for Replicating AI Research

XKG (Executable Knowledge Graphs): introduces a modular and pluggable knowledge base for AI research replication, comprising various nodes (Paper Node, Technique Node, Code Node) and edges (Structural Edge, Implementation Edge), constructed through automated processes (Corpus Curation, Technique Extraction, Code Modularization, Knowledge Filtering), and leveraged by LLM agents (BasicAgent, IterativeAgent, PaperCoder) for planning and implementation, using Query Retrieval and an LLM-based Verifier, with 04-mini and DeepSeek-V3 as core LLMs.
The framework automatically integrates technical insights, code snippets, and domain-specific knowledge extracted from scientific literature to support multi-granular retrieval and reuse.
XKG significantly enhances AI research replication by providing structured, executable knowledge, enabling agents to retrieve, reason about, and assemble precise artifacts for faithful reproduction.

Evaluating Medical LLMs by Levels of Autonomy: A Survey Moving from Benchmarks to Applications

Levels of Autonomy (L0-L3): introduces a framework for evaluating medical LLMs by categorizing their capabilities into four distinct levels, spanning informational tools to supervised agents.
This framework aligns existing benchmarks and metrics with permitted actions and associated risks at each autonomy level, providing a structured approach for evaluation and oversight.
The survey moves beyond simple score-based claims towards credible, risk-aware evidence for safe and reliable clinical deployment of LLM-based systems.

ShapeCraft: LLM Agents for Structured, Textured and Interactive 3D Modeling

ShapeCraft: introduces a multi-agent framework for text-to-3D generation, employing Parser, Coder, and Evaluator LLM agents that interact with a Graph-based Procedural Shape (GPS) representation to produce structured, textured, and interactive 3D assets.
The framework hierarchically parses user input into a GPS, iteratively refines procedural modeling and painting, and enables post-modeling interactions like shape editing and animation.
ShapeCraft demonstrates superior performance in generating geometrically accurate and semantically rich 3D assets compared to existing LLM-based methods, highlighting its potential for broader interactive applications.

SpecAgent: A Speculative Retrieval and Forecasting Agent for Code Completion

SpecAgent (Speculative Retrieval and Forecasting Agent): introduces a framework that improves code completion by proactively generating and storing speculative context blocks, which include retrieved relevant code contexts and predicted future function implementations, to be used by a Code Completion Model (LLM) for generating code completions for a target file.
This approach shifts costly context computation from inference time to asynchronous indexing time, leveraging Indexing-Time Tools to analyze the Repository, thereby significantly reducing latency and enhancing code generation quality.
The framework also addresses the "future context leakage" problem in existing benchmarks by introducing a synthetic, leakage-free evaluation environment for more realistic performance assessment.

BREAKING AND FIXING DEFENSES AGAINST CONTROL-FLOW HIJACKING IN MULTI-AGENT SYSTEMS

CONTROLVALVE: introduces a defense framework for multi-agent systems, with Orchestrator, LLM (CFG generation), Lark Parser, LLM (Rule Generation), LLM Judge, Control-Flow Graph (CFG), Edge-Specific Rules, Sub-Agents, User, Adversary, and Conversation State components, designed to prevent control-flow hijacking by generating and enforcing permitted control-flow graphs and contextual rules for agent invocations.
The framework operates by first generating a task-specific control-flow graph and edge-specific contextual rules using LLMs during a planning stage, before any untrusted content is ingested.
During execution, the Orchestrator and an LLM Judge enforce compliance with these predefined graphs and rules, blocking or replanning if agent transitions violate the established security policies.

Coinvisor: An RL-Enhanced Chatbot Agent for Interactive Cryptocurrency Investment Analysis

COINVISOR: introduces a reinforcement learning-based web chatbot for cryptocurrency investment analysis, with an RL-tuned LLM Caller, Data Analytics Tools, Report Agents, and a Reasoning Model to provide comprehensive, real-time insights.
The system employs Reinforcement Learning for multi-step tool selection, enabling adaptive analysis of dynamic web content and flexible integration of heterogeneous data sources.
COINVISOR addresses limitations of static LLM agents and fragmented data platforms by offering interactive, multi-dimensional analysis and decision support through its multi-agent framework.

WHICH LLM MULTIAGENT PROTOCOL TO CHOOSE?

ProtocolBench and ProtocolRouter: introduces a system for evaluating and dynamically selecting LLM multi-agent communication protocols, comprising a benchmark for performance and robustness, and a learned router for scenario-specific protocol assignment.
The system addresses the challenge of protocol selection by systematically comparing A2A, ACP, ANP, and Agora across four dimensions: task success, latency, overhead, and failure robustness.
ProtocolRouter enhances multi-agent system reliability and efficiency by selecting optimal protocols per module based on requirements and runtime signals, outperforming single-protocol baselines in targeted settings.

Do LLMs Recognize Your Latent Preferences? A Benchmark for Latent Information Discovery in Personalized Interaction

Tri-Agent Framework: introduces a unified benchmark for evaluating LLMs' ability to discover and utilize hidden user attributes through multi-turn interaction, featuring a User (simulates person, hidden preferences), an Assistant (LLM under evaluation, active), and a Judge (evaluates output alignment).
This benchmark spans three progressively realistic tasks: the 20 Questions game, Personalized Question Answering, and Personalized Text Summarization, designed to assess latent information discovery and personalization accuracy.
The framework enables systematic, turn-level evaluation of questioning strategies and reasoning efficiency, highlighting that effective preference inference remains an open frontier for building truly adaptive AI systems.

Semantic Intelligence: A Bio-Inspired Cognitive Framework for Embodied Agents

SIDE (Semantic Intelligence-Driven Embodied) agent framework: introduces a bio-inspired cognitive framework for embodied agents, integrating a hierarchical semantic cognition architecture with a semantic-driven decision-making process to enable contextually adaptive interaction with the physical world.
The framework operates through a closed perception-cognition-action loop, where the Semantic Cognitive Architecture builds semantic knowledge from multimodal sensor data, and the Semantic-Driven Decision Loop guides planning and action execution using this knowledge.
This approach enhances embodied agents' ability to extract, represent, reason, and apply semantics, addressing limitations of current LLM-based agents in real-world environments by facilitating flexible planning, robust execution, and interpretable behavior.

Verification-Aware Planning for Multi-Agent Systems

VERIMAP (Verification-Aware Planning for Multi-Agent Systems): introduces a framework for multi-agent collaboration with verification-aware planning, which includes a Planner (decomposes tasks, generates VFs), Executor (solves subtasks, produces outputs), Verifier (evaluates outputs, provides feedback), and Coordinator (orchestrates execution, manages context/retries).
The framework integrates planning and verification by decomposing tasks into a Directed Acyclic Graph (DAG) of subtasks, where the planner specifies Structured and Named I/O and Verification Functions (VFs) at the subtask level.
Executors produce JSON outputs verified by paired VFs, while the coordinator manages contexts, retries, and dynamic replanning to ensure reliable final results, enhancing robustness and interpretability.

Structured Debate Improves Corporate Credit Reasoning in Financial AI

KPD-MADS (Karl Popper Debate-based Multi-Agent Debate System): introduces a framework for corporate credit reasoning that formalizes adversarial verification via a ten-step structured interaction protocol, including a debate subsystem with six LLM agents and an aggregator subsystem.
The framework leverages a shared knowledge pool and web search capabilities to enable agents to reason over non-financial indicators and generate comprehensive, balanced analytical reports.
KPD-MADS demonstrates superior reasoning quality and practical applicability compared to a single-agent system (NAS) by enhancing analytical rigor through structured agent interaction and iterative refinement of arguments.

Can Transformer Memory Be Corrupted? Investigating Cache-Side Vulnerabilities in Large Language Models

MTI V.1 (Malicious Token Injection V.1): introduces a modular framework for cache-side attacks, perturbing stored key-value representations during LLM inference using Corruption Mechanisms (perturb stored key vectors), Adaptive Perturbations (optimizes injected noise), and Control Dimensions (tune perturbation characteristics), to systematically bias attention maps and alter downstream token predictions.
The framework demonstrates that even small, structured perturbations to the KV cache can significantly degrade LLM performance across various tasks, including classification, question answering, summarization, RAG, and agentic reasoning.
MTI V.1 establishes cache integrity as a critical, yet underexamined, vulnerability in LLM deployments, providing a reproducible threat model for future robustness research and highlighting the need for verifiable cache integrity.

CONSISTENT ZERO-Shot IMITATION WITH CONTRASTIVE GOAL INFERENCE

CIRL (Consistent Zero-Shot Imitation with Contrastive Goal Inference): introduces a method for pre-training interactive agents in a self-supervised fashion, combining goal-conditioned contrastive RL pre-training, automatic goal sampling, and a mean field goal inference model to enable zero-shot imitation of expert demonstrations.
The framework allows agents to autonomously propose and practice reaching their own goals during training, then at test time, infer an expert's goal from a single demonstration and execute a learned goal-conditioned policy to achieve it.
By reframing inverse RL as a goal inference problem and coupling it with contrastive RL, CIRL learns transferable goal-conditioned policies that generalize across diverse task distributions without requiring expert data or rewards during pre-training.

FineVision: Open Data Is All You Need

FineVision: introduces a meticulously collected, curated, and unified corpus of 24 million samples, utilizing a semi-automated, human-in-the-loop pipeline that includes raw sources ingestion, canonicalization, image and text cleaning, de-duplication, test-set decontamination, per-turn quality assessment, human checkpoints, LLM/VLM-as-a-judge, SSCD embeddings, LLM agents (GPT/Claude), a unified conversational schema, and a unified action space.
The framework addresses the fragmentation and contamination of public datasets for vision-language models (VLMs) by unifying over 200 sources into 185 subsets with rigorous data hygiene and quality control.
FineVision enables state-of-the-art performance for models trained on it, outperforming existing open mixtures across a broad evaluation suite, and supports novel GUI/agentic capabilities through its unified action space.

19th October 2025

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

DeepAnalyze-8B (Agentic Large Language Models for Autonomous Data Science): introduces an agentic LLM for autonomous data science, capable of end-to-end pipeline completion from data sources to analyst-grade reports, utilizing a curriculum-based agentic training paradigm and data-grounded trajectory synthesis.
The framework employs a curriculum-based agentic training paradigm that emulates human data scientists' learning trajectory, progressively acquiring and integrating multiple capabilities in real-world environments.
DeepAnalyze-8B leverages a data-grounded trajectory synthesis framework to construct high-quality training data, enabling autonomous orchestration and adaptive optimization for complex data science tasks.

Agentic Inequality

Agentic Inequality Framework: introduces "agentic inequality," defining it as potential disparities in power, opportunity, and outcomes from differential access to and capabilities of AI agents, analyzed through the dimensions of agent availability, quality, and quantity.
This framework distinguishes agentic inequality from prior technological divides by highlighting novel power asymmetries created by scalable goal delegation and direct agent-to-agent competition.
The paper further explores the technical and socioeconomic drivers shaping agentic power distribution and proposes a research agenda for governing these complex challenges.

A Comprehensive Survey on World Models for Embodied AI

Unified Framework for World Models: introduces a comprehensive survey on world models in embodied AI, proposing a three-axis taxonomy including Functionality (decision-coupled/general-purpose), Temporal Modeling (sequential simulation/global difference prediction), and Spatial Representation (global latent vector/token feature sequence/spatial latent grid/decomposed rendering).
The survey formalizes problem settings, learning objectives, systematizes data resources and metrics, and quantitatively compares state-of-the-art models.
It distills key open challenges such as data scarcity, evaluation metrics, computational efficiency, and long-horizon temporal consistency, while providing a curated bibliography.

ReclAIm: A multi-agent framework for degradation-aware performance tuning of medical imaging AI

ReclAIm (A multi-agent framework for degradation-aware performance tuning of medical imaging AI): introduces a multi-agent framework for autonomously monitoring, evaluating, and fine-tuning medical image classification models, built on an LLM core and operating through natural language interaction.
The framework includes a Master Agent, an Image classification agent, a Performance comparison agent, and a Fine-tuning agent, each equipped with specialized toolkits, an LLM Core, a System Prompt, and Memory, interacting with a User.
ReclAIm enables automated, continuous maintenance of medical imaging AI models in a user-friendly and adaptable manner, facilitating broader adoption in research and clinical environments by addressing performance degradation.

TACLA: An LLM-Based Multi-Agent Tool for Transactional Analysis Training in Education

TACLA (Transactional Analysis Contextual LLM-based Agents): introduces a novel Multi-Agent architecture designed to simulate nuanced human social dynamics with psychological depth and consistent persona behavior, integrating Parent, Adult, and Child ego states, an Orchestrator Agent, and dedicated memory for authentic responses.
Each TACLA agent is modeled as a combination of Parent, Adult, and Child Ego State Agents, each with its own Contextual Pattern Memory and TA-informed reasoning capabilities, orchestrated by an LLM-based agent that prioritizes ego state activation based on contextual triggers and life scripts.
The framework is validated in an educational scenario for teacher training, demonstrating realistic ego state shifts in Student Agents and effectively modeling conflict de-escalation and escalation based on different teacher intervention strategies, with a Feedback Agent providing expert-level analysis.

EEschematic: Multimodal-LLM Based AI Agent for Schematic Generation of Analog Circuit

EEschematic: introduces an AI agent for automatic analog schematic generation, integrating textual, visual, and symbolic modalities, few-shot substructure examples, and a Visual Chain-of-Thought strategy for iterative refinement.
The framework translates SPICE netlists into human-editable schematic diagrams by leveraging an MLLM to analyze circuit substructures, generate initial placements, and optimize wiring.
It employs a VCoT prompting loop, comparing current schematics with reference examples and using result history to continuously improve visual quality and structural correctness.

STARK: Strategic Team of Agents for Refining Kernels

STARK (Strategic Team of Agents for Refining Kernels): introduces a multi-agent LLM framework for GPU kernel optimization, systematically exploring the design space through collaborative agents, grounded instruction, dynamic context management, and strategic search.
This framework mimics expert engineer workflows, enabling LLMs to reason about hardware trade-offs, incorporate profiling feedback, and iteratively refine kernels for substantial performance improvements.
STARK achieves up to 16x faster runtime performance over baseline agents on KernelBench, demonstrating the potential of agentic LLM frameworks for automated GPU kernel optimization.

Lark: Biologically Inspired Neuroevolution for Multi-Stakeholder LLM Agents

Lark (Biologically Inspired Neuroevolution for Multi-Stakeholder LLM Agents): introduces a biologically inspired decision-making framework that couples LLM-driven reasoning with an evolutionary, stakeholder-aware Multi-Agent System, integrating plasticity, duplication and maturation, ranked-choice stakeholder aggregation, and compute awareness via token-based penalties.
The system iteratively proposes diverse strategies, applies plasticity tweaks, simulates stakeholder evaluations, aggregates preferences, selects top candidates, and performs duplication/maturation while factoring compute cost into final scores.
Lark operates in a discrete generational paradigm, evolving populations of candidate strategies through selection, mutation, and specialization, making it suitable for multi-stakeholder strategy generation that lacks sequential state transitions and immediate reward signals.

VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

VAGEN (Reinforcing World Model Reasoning for Multi-Turn VLM Agents): introduces a multi-turn reinforcement learning framework that enhances VLM agents' visual state reasoning by building internal world models through explicit State Estimation (describes current visual state) and Transition Modeling (predicts next visual state), optimized via WorldModeling Reward (dense reward for state predictions) and Bi-Level General Advantage Estimation (turn-aware credit assignment).
The framework formulates the problem as a Partially Observable Markov Decision Process (POMDP) and systematically compares five reasoning strategies, demonstrating that explicit visual state reasoning significantly improves task performance.
VAGEN achieves superior performance over untrained counterparts and proprietary models by integrating structured reasoning, task-dependent visual state representations, and hierarchical credit assignment for robust multi-turn VLM agent training.

More with Less: An Empirical Study of Turn-Control Strategies for Efficient Coding Agents

Turn-Control Strategies for Coding Agents: introduces an empirical study evaluating the impact of various turn-control strategies on the performance and cost of LLM-powered coding agents, including an unrestricted baseline, fixed-turn limits, and a novel dynamic-turn strategy.
The study identifies a "sweet spot" for fixed-turn limits at the 75th percentile, significantly reducing costs with minimal impact on solve rates, and demonstrates the superiority of the dynamic-turn strategy in balancing efficacy and economic efficiency.
This research provides practical guidelines for deploying powerful yet economically viable coding agents by intelligently managing resource allocation through turn control.

A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications

RL-based Agentic Search: introduces a paradigm where LLMs act as autonomous decision-making agents, capable of planning, retrieving, and reflecting through multi-step interaction with search environments, leveraging reinforcement learning for adaptive and self-improving search behavior.
This survey comprehensively overviews RL-based agentic search, categorizing its functional roles, optimization strategies, and application scopes, while highlighting key components like the agent, environment, tools, and memory.
The approach addresses LLM limitations such as static knowledge and factual hallucinations by enabling dynamic query refinement, adaptive retrieval strategies, and integration with diverse external knowledge sources and tools.

Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI

This survey introduces the paradigm shift in agentic AI from the Pipeline-based Agentic AI Paradigm, where planning, tool use, and memory are externally orchestrated, to the Model-Native Agentic AI Paradigm, where these capabilities are internalized within the model's parameters, driven by Reinforcement Learning (RL).
The Model-Native Agentic AI Paradigm reframes LLMs as autonomous decision-makers that learn to generate plans, invoke tools, and manage memory as intrinsic behaviors, enhancing adaptability and robustness in open environments.
The paper systematically reviews the evolution of core capabilities and agent applications, such as Deep Research and GUI agents, and discusses emerging model-native capabilities like multi-agent collaboration and reflection.

AN AGENTIC FRAMEWORK WITH LLMS FOR SOLVING COMPLEX VEHICLE ROUTING PROBLEMS

AFL (Agentic Framework with LLMs): introduces a fully automated, self-contained framework for solving complex Vehicle Routing Problems (VRPs) end-to-end, utilizing its Problem Description, Code Generation, and Solution Derivation subtasks, along with Generation, Judgment, Revision, and Error Analysis Agents, a Buffer, VRP Instance input, generated Code, Python execution, and Error handling.
The framework extracts domain knowledge from raw VRP instance inputs to guide self-contained code generation, eliminating reliance on handcrafted modules or external solvers.
AFL's specialized LLM agents collaborate to ensure cross-functional consistency, logical soundness, and constraint satisfaction, achieving high code reliability and solution feasibility.

18th October 2025

Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety

Quitting Mechanism: introduces a behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence, leveraging the ToolEmu framework with various prompting strategies, safety, and helpfulness evaluators.
The mechanism evaluates LLM agents across 144 multi-turn scenarios, comparing baseline agents against simple and specified quit-enabled variants to assess safety-helpfulness trade-offs.
Results indicate that explicit quit instructions, particularly the specified quit prompt, significantly improve agent safety with minimal impact on helpfulness, establishing quitting as a first-line defense.

Explainability Requirements as Hyperproperties

YLTL (whY Linear-time Temporal Logic): introduces a formal framework for specifying and verifying explainability requirements in multi-agent systems, combining Lewis' counterfactuals (causal dependencies), Linear-time temporal logic (temporal reasoning), Knowledge modality (agent knowledge), Past-operators (past-time reasoning), and a Similarity relation (agent-specific causal models).
The framework enables automated verification of explainability requirements through a Model-checking algorithm, which relies on a Translation function mapping YLTL formulas to Extended Monadic First-Order Logic (FO[<,E]) for hyperproperty specification.
This approach allows for formalizing various notions of explainability, such as internal, external, general, and weak counterfactual explainability, and proves the decidability of the model-checking problem for finite-state multi-agent systems.

ATA: A Neuro-Symbolic Approach to Implement Autonomous and Trustworthy Agents

ATA (Autonomous Trustworthy Agents): introduces a generic neuro-symbolic approach that decouples tasks into offline knowledge ingestion and online task processing, addressing LLM limitations in trustworthiness for high-stakes domains.
The knowledge ingestion phase uses an LLM to translate informal problem specifications into a formal, symbolic knowledge base, which human experts can verify and refine for correctness and domain alignment.
The task processing phase encodes incoming natural language input into the formal language via an LLM, enabling a symbolic decision engine to derive reliable, transparent, and auditable results using the formal knowledge base.

Unleashing Diverse Thinking Modes in LLMs through Multi-Agent Collaboration

DiMo (Multi-Agent Collaboration Framework for Diverse Thinking Modes): introduces a multi-agent debate framework that enhances LLM performance and interpretability by simulating structured debate among specialized LLM agents, including a Generator, Evaluator, Divergent Thinking Mode, Knowledge Supporter, Reasoning Path Provider, Logical Thinking Mode, Refiner, and Judger.
The framework incorporates two distinct thinking modes—Divergent for commonsense reasoning and Logical for mathematical reasoning—to optimize problem-solving based on task requirements.
DiMo generates explicit, human-auditable reasoning paths, improving LLM interpretability and transparency by externalizing hypotheses, supportive knowledge, and step-wise refinements.

CODECRDT: OBSERVATION-DRIVEN COORDINATION FOR MULTI-AGENT LLM CODE GENERATION

CodeCRDT (Observation-Driven Coordination for Multi-Agent LLM Code Generation): introduces an observation-driven coordination pattern for multi-agent LLM code generation, where agents coordinate by monitoring a shared CRDT state for lock-free, conflict-free concurrent code generation, leveraging an Inference Service, Yjs Document, Outliner Agent, Implementation Agents, Evaluator Agent, TODO Observer, Frontend, WebSocket, and Hocuspocus WebSocket relay.
This approach leverages Conflict-Free Replicated Data Types (CRDTs) to ensure strong eventual consistency and deterministic convergence, enabling parallel execution without explicit message passing among LLM agents.
Empirical evaluation demonstrates that CodeCRDT achieves parallel speedups for most tasks by normalizing for code volume, while also revealing emergent behaviors like code inflation and semantic conflicts.

Prompt Optimization via Retrieved Reasoning Assets and Multi-Agent Analysis

MA-SAPO (Multi-Agent Score-Aware Prompt Optimization): introduces a multi-agent framework for prompt optimization that explicitly links evaluation outcomes with structured reasoning to guide systematic edits, utilizing Metric Explainer, Diagnostician, and Action Synthesizer Agents in a Reasoning Phase, and Retriever, Analyzer, and Refiner Agents in a Test Phase.
The framework operates in two phases: a Reasoning Phase where agents collaboratively explain scores, diagnose weaknesses, and synthesize targeted refinements into reusable reasoning assets, and a Test Phase where agents retrieve these assets to analyze and apply evidence-grounded edits to new prompts.
This approach enhances interpretability, auditability, and controllability of prompt refinements by transforming evaluation signals into explicit reasoning chains, leading to consistent performance improvements while reducing computational costs.

Ripple Effect Protocol: Coordinating Agent Populations

REP (Ripple Effect Protocol): introduces a coordination protocol for LLM-based agents, enabling them to share decisions and lightweight textual sensitivities that ripple through local networks, facilitating faster and more stable alignment than decision-only communication.
The protocol formalizes message schemas and aggregation rules, decoupling agent cognition from coordination, and supports diverse LLM architectures and hybrid rule-based systems.
REP significantly improves coordination accuracy and efficiency across domains like supply chain, resource allocation, and preference aggregation, providing scalable infrastructure for the Internet of Agents.

LANPO: Bootstrapping Language and Numerical Feedback for Reinforcement Learning in LLMs

LANPO (Language-And-Numerical Policy Optimization): introduces a framework that synergistically bootstraps language and numerical feedback to enhance LLM learning efficiency, utilizing a Policy LLM, Reward Model, Critic, Experience Pool, Inter-Sample Exploration Module with Retrieval and Relevance Evaluation, Intra-Sample Exploration Module with Reward-Agnostic Reflection, Feedback Summarizer, Parameter Update, and Context Update.
The framework addresses information leakage and behavior collapse by separating feedback roles: language guides exploration via context updates, while numerical rewards drive robust policy optimization through parameter updates.
LANPO builds a dynamic experience pool from past trials and employs Reward-Agnostic Reflection for safe intra-sample self-correction and Relevant Abstraction for generalizable inter-sample lessons.

Declarative Techniques for NL Queries over Heterogeneous Data

siwarex: introduces a declarative system for handling natural language queries over heterogeneous data sources, leveraging its Abstract Schema, API Mapping Schema, DB Table View, Text2SQL module, Query Rewriter, User-Defined Functions, and LLM to unify database tables and APIs within a SQL framework.
The framework translates natural language questions into SQL queries by representing both database tables and external APIs as virtual tables in a unified relational schema, which are then processed by a query rewriter to invoke APIs via UDFs.
This approach significantly outperforms imperative code generation and agent-based methods in coping with data source heterogeneity, as demonstrated on two new benchmarks.

RGMem: Renormalization Group-based Memory Evolution for Language Agent User Profile

RGMem (Renormalization Group-based Memory Evolution for Language Agent User Profile): introduces a self-evolving memory framework for LLM-based conversational systems, with D_raw, f_seg, f_synth, D_L0, A_fact, A_base, A_rel, f_extract, G, V, V_abs, V_gen, V_inst, E, E_cls, E_evt, G(t), RK1, LLM, θ_inf, RK2, P, S, θ_sum, RK3, Synergy/Tension Analysis, Dirty-flag Propagation Mechanism, G*, Σ, Δ, Query, f_retr, Context Aggregation & Output, Macroscopic Theory (Σ, Δ), Mesoscopic Theory (T(1)), and Microscopic Evidence (A_fact, A_base) components, which organizes dialogue history across multiple scales to form a dynamically-evolved user profile.
The framework leverages renormalization group principles to extract semantics and user insights from episodic fragments, progressively forming a multi-scale user profile through hierarchical coarse-graining and rescaling operations.
This approach enables multi-granularity retrieval, coordinating detailed and abstract memories to boost cross-session continuity and personalized interactive capabilities for Language Agents.

Integrating LLM and Diffusion-Based Agents for Social Simulation

LLM-empowered Hybrid Simulation Agent Framework: introduces a hybrid simulation approach for social information diffusion prediction, integrating LLM-based agents (simulates core users) for semantic reasoning with diffusion model-based agents (predicts remaining users) for efficient population-level prediction.
This framework employs LLM-based agents to simulate a core subset of users with rich semantic reasoning, while a diffusion model handles the remaining population efficiently, both incorporating user personalization, social influence, and content awareness.
The modular design enables a topic-aware, personalized, and collaborative simulation, addressing computational costs of LLMs at scale and cold-start problems of traditional diffusion models.

WHAT LIMITS AGENTIC SYSTEMS EFFICIENCY?

SpecCache: introduces a caching framework augmented with speculative execution to mitigate web environment latency in web-interactive agentic systems, including a Model Input, Reasoning (Target Model), Action (Target Model), Observation, Candidate Actions (Draft Model), Cache Pool, Cache Hit, and Cache Miss.
The framework decouples and overlaps model reasoning with environment interaction by using a draft model to predict future actions and proactively populate an action-observation cache.
This approach significantly reduces wall-clock latency and web environment overhead without compromising task success rates, achieving up to 58x improvement in cache hit rate and 3.2x reduction in web environment overhead.

Branch-and-Browse: Efficient and Controllable Web Exploration with Tree-Structured Reasoning and Action Memory

Branch-and-Browse: introduces a fine-grained web agent framework that unifies structured reasoning-acting, contextual memory, and efficient execution, including a subtask manager, tree exploration, nearest-URL state replay, background reasoning, and page action memory.
This framework employs explicit subtask management with tree-structured exploration for controllable multi-branch reasoning and efficient backtracking, while leveraging web state replay and background reasoning to accelerate exploration.
A page action memory mechanism further enhances efficiency by sharing explored actions and contextual information across branches and sessions, reducing redundancy and improving decision-making.

17th October 2025

Agentic AI for Ultra-Modern Networks: Multi-Agent Framework for RAN Autonomy and Assurance

Multi-Agent Framework for RAN Autonomy and Assurance: introduces a distributed multi-agent architecture for RAN autonomy and assurance, featuring an Orchestrator Agent (manages workflow, resolves conflicts), Data Collector Agent (collects, validates raw telemetry), Preprocessor and Feature Agent (cleans, engineers data features), Model Trainer Agent (trains, optimizes AI/ML models), Model Validator Agent (evaluates, approves AI/ML models), Predictor Agent (forecasts KPIs, simulates scenarios), Policy Generator Agent (formulates network policies), Simulator/Baseline Agent (generates reference KPI trajectories), Verifier Agent (compares policies, enforces safety), Drift Detector Agent (detects model drift, triggers retraining), Deployment Agent (deploys verified network policies), Audit and Explainability Agent (generates audit reports, explanations), Security Agent (secures inter-agent communications), and Inter-Agent Communication (enables agent coordination).
This framework replaces centralized RIC-based control with specialized, collaborative agents to ensure autonomy, resilience, explainability, and system-wide safety in Beyond 5G/6G networks.
The architecture prevents unsafe policy deployments by incorporating independent verification and assurance stages, safeguarding global network health against model drift and unforeseen conditions.

POLYSKILL: LEARNING GENERALIZABLE SKILLS THROUGH POLYMORPHIC ABSTRACTION

PolySkill (Polymorphism-Guided Agent Skill Induction): introduces a novel framework enabling web agents to learn generalizable and compositional skills by decoupling abstract goals from concrete implementations, utilizing an LM Policy, Working Memory, Dynamic Skill Library, Abstract Classes, Concrete Subclasses, an LLM-based Induction Module, and an LM Judge.
The framework organizes skills into a domain-driven hierarchy, where abstract classes define common interfaces for categories like shopping sites, and concrete subclasses provide website-specific implementations, promoting skill reuse and cross-domain generalization.
PolySkill enhances continual learning by guiding agents to discover and refine skills autonomously in task-free settings, leading to improved task success rates and reduced execution steps across diverse web environments.

PAPER2WEB: LET'S MAKE YOUR PAPER ALIVE!

PWAGENT (Paper-to-Web Agent): introduces a multi-agent framework for transforming academic papers into interactive, multimedia-rich project homepages, utilizing Docling (PDF to Markdown converter), an LLM (extracts metadata/structures content), Construct (combines decomposed assets), an MCP Resource Repository (stores structured paper assets), an MLLM as Orchestrator (assesses webpage/invokes tools), and MCP tool use (accesses repository/edits webpage).
This framework addresses limitations of current methods by decomposing papers into structured assets, ingesting them into a resource repository, and iteratively refining webpage content and layout through an MLLM-orchestrated process.
PWAGENT achieves state-of-the-art cost efficiency and high presentation quality, outperforming baselines in academic webpage generation while maintaining low cost.

VISTA: A Test-Time Self-Improving Video Generation Agent

VISTA (Video Iterative Self-improvemenT Agent): introduces a novel multi-agent system that autonomously improves text-to-video generation by refining prompts in an iterative loop, including Structured Video Prompt Planning (transforms user input), Pairwise Tournament Selection (identifies best video-prompt pair), Multi-Dimensional Multi-Agent Critiques (MMAC) (generates nuanced critiques), and Deep Thinking Prompting Agent (DTPA) (refines prompt iteratively).
The framework decomposes user ideas into structured temporal plans, identifies the best video through a robust pairwise tournament, critiques it using specialized agents focusing on visual, audio, and contextual fidelity, and then synthesizes feedback to enhance prompts for subsequent generation cycles.
VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines and demonstrating scalability with increased test-time computation.

AURA: An Agent Autonomy Risk Assessment Framework

AURA (Agent aUtonomy Risk Assessment): introduces a unified framework designed to detect, quantify, and mitigate risks from agentic AI, incorporating an LLM Parser, LLM Dimensions, LLM Scorer, LLM Mitigator, Memory Unit, HITL, and A2H Control to provide robust risk assessment and mitigation.
The framework supports both synchronous and autonomous modes, enabling agents to self-assess and mitigate risks during operation, while also allowing human oversight and intervention.
AURA balances risk assessment accuracy with computational efficiency through gamma-based scoring and memory-driven optimization, ensuring governable and transparent AI agent deployment.

Multi-dimensional Data Analysis and Applications Basing on LLM Agents and Knowledge Graph Interactions

Multi-dimensional Data Analysis Framework: introduces a dynamic, collaborative analytical ecosystem that integrates LLM agents and Knowledge Graphs (KGs) for multi-dimensional data analysis, featuring a Data Preparation Module, Knowledge Representation Module, Visualization and Interaction Module, and Intelligent Analysis Module.
The framework enables LLM agents to automatically extract product data, construct and visualize KGs in real-time, and supports users in deep exploration and analysis of graph nodes through an interactive platform.
This approach achieves bidirectional dynamic interaction between LLM agents and KGs, where agents build and enrich the KG, and the visualized KG provides context for the agents' in-depth analysis.

Build Your Personalized Research Group: A Multiagent Framework for Continual and Interactive Science Automation

freephdlabor: introduces a multiagent framework for continual and interactive science automation, featuring a ManagerAgent, IdeationAgent, ExperimentationAgent, ResourcePreparationAgent, WriteupAgent, ReviewerAgent, Shared Workspace, Workspace System, Prompt Optimization Mechanisms, Context Compaction, Memory Persistence, and Real-Time Human Intervention, enabling dynamic workflows and robust communication for scientific discovery.
The framework addresses limitations of existing agentic systems by providing fully dynamic workflows determined by real-time agent reasoning and a modular architecture for seamless customization and human-in-the-loop capabilities.
It provides comprehensive infrastructure for automatic context compaction, workspace-based communication to prevent information degradation, memory persistence across sessions, and non-blocking human intervention mechanisms, transforming automated research into continual programs.

SHARE: Scene-Human Aligned Reconstruction

SHARE (Scene-Human Aligned REconstruction): introduces a framework that reconstructs human motion and the surrounding environment from monocular videos, leveraging scene geometry for accurate 3D human placement.
The framework operates in three stages: initialization of point maps, human meshes, and masks; reconstruction of the background scene; and optimization of human meshes by grounding them to scene points.
SHARE achieves improved 3D human positioning and scene reconstruction, outperforming existing methods in quantitative metrics and demonstrating strong qualitative performance on diverse video data.

Foundation Models for Scientific Discovery: From Paradigm Enhancement to Paradigm Transition

Three-Stage Framework for FM-driven Scientific Evolution: introduces a conceptual model describing the progressive integration of FMs into scientific discovery, encompassing Meta-Scientific Integration, Hybrid Human-AI Co-Creation, and Autonomous Scientific Discovery stages.
The framework posits that FMs transition from backend tools, to interactive collaborators, and finally to independent agents capable of end-to-end scientific discovery.
This evolution redefines scientific paradigms, shifting from human-guided processes to increasingly autonomous AI-driven knowledge generation.

PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold

PokeeResearch-7B: introduces a 7B-parameter deep research agent, trained with Reinforcement Learning from AI Feedback (RLAIF) using LLM-based reward signals, and featuring a robust chain-of-thought-driven multi-call reasoning scaffold with self-verification and adaptive recovery for tool-augmented research.
The agent operates through iterative research-verification cycles, leveraging specialized web searching and reading tools, and is built upon a Qwen2.5-7B-Instruct backbone LLM.
This approach achieves state-of-the-art performance on ten deep research benchmarks by optimizing for human-salient answer quality dimensions and maintaining robustness through verifiable reasoning.

Self-evolving expertise in complex non-verifiable subject domains: dialogue as implicit meta-RL

Dialectica: introduces a framework where LLM agents engage in structured dialogue on defined topics, augmented by Agent Memory, Agent Reflection, and Context Evolution, with an Orchestrator managing the dialogue and an optional Facilitator guiding the discussion.
The framework views discussion as an implicit meta-reinforcement learning process, enabling agents to develop expertise and refine their prompt contexts through conversational feedback and self-reflection in non-verifiable domains.
This approach allows agents to improve their capabilities and produce more sophisticated outputs by iteratively updating their internal context based on dialogue experiences, without explicit reward signals.

ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations

ProofOptimizer: introduces an LLM-based system for simplifying Lean proofs without human demonstrations, integrating a symbolic Lean linter, a finetuned 7B parameter language model, and an iterative inference-time algorithm.
The system is trained using expert iteration and online reinforcement learning, leveraging the Lean compiler for verification and reward signals, and employs inference-time techniques like Test-Time RL and proof repair.
ProofOptimizer significantly reduces proof length on various benchmarks, improving conciseness, execution speed, and downstream prover performance for AI-generated formal proofs.

SQuAI: Scientific Question-Answering with Multi-Agent Retrieval-Augmented Generation

SQuAI (Scientific Question-Answering with Multi-Agent Retrieval-Augmented Generation): introduces a scalable and trustworthy multi-agent RAG framework for scientific QA, which includes a Decomposer (decomposes complex queries into sub-questions), Hybrid Retrieval (selects top-k documents using sparse/dense models), a Generator (generates initial Q-A-E triplets), a Judge (evaluates Q-A-E triplets for relevance), and an Answer Generator (synthesizes final answer with citations).
The framework addresses key limitations of existing RAG systems in scholarly domains by enabling accurate answers, explicit claims with citations, and retrieval across millions of scientific documents.
SQuAI improves faithfulness, answer relevance, and contextual relevance by decomposing complex questions, adaptively filtering documents, and providing fine-grained in-line citations for transparent verification.

The Spark Effect: On Engineering Creative Diversity in Multi-Agent AI Systems

Spark agents: introduces a system of persona-conditioned LLM agents, instantiated through a library of role-inspired system prompts, to intentionally diversify agent behavior within a multi-agent workflow.
The system includes a Spark agent automation pipeline for data collection and retrieval-augmented grounding, and an LLM-as-a-judge protocol for evaluating creative diversity against human gold standards.
This approach achieved a mean diversity gain of +4.1 points on a 1-10 scale, significantly narrowing the gap to human experts and improving client-facing outputs.

KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models

KITE (Korean Instruction-following Task Evaluation): introduces a comprehensive benchmark for evaluating LLMs' instruction-following capabilities in Korean, encompassing both general and Korean-specific instructions, validated through automated metrics and human assessments.
The benchmark includes KITE General, derived from translated English datasets, and KITE Korean, featuring specialized instructions like Acrostic Poem and Honorifics, designed to capture unique linguistic and cultural nuances.
This framework provides insights into LLM performance across diverse NLP tasks and models, aiming to foster research on culturally and linguistically inclusive LLM development for underrepresented languages.

THE ROAD LESS TRAVELED: ENHANCING EXPLORATION IN LLMS VIA SEQUENTIAL SAMPLING

SESA (SEquential SAmpling): introduces a two-stage framework for enhancing exploration in LLMs, including PromptSketch (generates sketch prompt), Policy (πθ) (samples sketches/solutions), History of Sketches (S) (stores generated sketches), PromptSolve (generates solution prompt), Reward Function (R) (computes solution reward), All Candidates (Y) (stores solutions, rewards), Advantage Computation (calculates policy advantages), Loss Computation (computes policy loss), and Policy Update (adjusts policy parameters), which mitigates entropy collapse by sequentially generating diverse solution sketches before expanding them into full reasoning paths.
This approach conditions each new output on previous ones, promoting diversity and preventing policy collapse, leading to broader exploration and improved performance in RL-trained LLMs.
SESA consistently outperforms traditional RL methods in path diversity and recovery from collapse, significantly boosting success rates on agent benchmarks and real-world tasks.

CORE: Reducing UI Exposure in Mobile Agents via Collaboration Between Cloud and Local LLMs

CORE (Collaborative framework): introduces a collaborative framework that combines cloud and local LLMs to reduce UI exposure in mobile agents, including layout-aware block partitioning (groups UI elements), co-planning (collaboratively identifies sub-task), and co-decision-making (collaboratively selects UI elements).
The framework leverages the cloud LLM's strong reasoning with limited UI access and the local LLM's basic reasoning with full UI visibility to achieve a balance between task accuracy and privacy.
CORE significantly reduces sensitive UI element uploads to the cloud by up to 70.49% while maintaining task success rates comparable to cloud-only agents.

Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning

EARL (Evidence-Aware Reinforcement Learning): introduces an evidence-prioritized adaptive pixel-space video reasoning framework, with a Video LLM, Visual Encoder/Text Tokenizer, Merger Projector, Think + Frames Selection Function, Key-frame based Localized Re-sampling Module, and a Multi-component Reward System, to dynamically select relevant frames and perform localized re-sampling for fine-grained temporal detail.
This framework transforms passive video processing into an active evidence interrogation process, guided by a novel multi-component reward system that enforces evidence purity and strategically manages visual context selection.
The dynamic adjustment mechanism within the reward system ensures stable convergence by balancing exploration and purity requirements throughout training, leading to superior reasoning accuracy.

ADAPTIVE MINDS: EMPOWERING AGENTS WITH LORA-AS-TOOLS

Adaptive Minds: introduces an agentic system that treats LoRA adapters as domain-specific tools, empowering a base LLM to act as a semantic router for dynamically selecting the most relevant LoRA tool to handle each query.
The system employs a modular multi-agent design orchestrated by LangGraph, combining flexible multi-agent orchestration with parameter-efficient fine-tuning to deliver accurate, specialized responses while preserving conversational ability.
Its AI-semantic routing, which leverages the base LLM's understanding, significantly outperforms keyword-based methods in accuracy and achieves a 3.1x average speedup compared to a baseline monolithic model.

MARS: REINFORCING MULTI-AGENT REASONING OF LLMS THROUGH SELF-PLAY IN STRATEGIC GAMES

MARS (Reinforcing Multi-Agent Reasoning of LLMs through Self-play in Strategic Games): introduces an end-to-end RL framework that incentivizes multi-agent reasoning in LLMs through self-play in both cooperative and competitive games.
The framework incorporates a turn-level advantage estimator for fine-grained credit assignment and agent-specific advantage normalization to stabilize multi-agent training.
MARS agents, trained on a diverse portfolio of strategic games, develop strong strategic abilities that generalize to held-out games and improve performance in multi-agent reasoning benchmarks.

Accelerating Mobile Language Model Generation via Hybrid Context and Hardware Coordination

CoordGen: introduces a mobile inference framework that integrates speculative decoding with dynamic hardware scheduling to accelerate context-aware text generation on mobile devices, utilizing adaptive execution scheduling, context-aligned drafting, and hardware-efficient draft extension.
The framework addresses high latency and limited hardware utilization in on-device LLMs by offloading retrieval-based speculative decoding to NPUs, employing progressive graph scheduling, in-context distribution calibration, and NPU-optimized draft reuse.
CoordGen achieves significant speedup and energy efficiency improvements on smartphones across various tasks and LLMs by optimizing compute graph management and draft generation for NPU acceleration.

WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM-based Web Generation and Evaluation

WebGen-V Bench: introduces a new benchmark and framework for instruction-to-HTML generation, with a Crawling Module (data acquisition and preprocessing), Processor (transforms raw data into structured representation), Structured Data (section-level metadata, UI screenshots, JSON text/image assets, instructions), Gen (HTML generation model), Evaluation Module (section-wise assessment of model outputs), Evaluator (multimodal LLM for scoring and feedback), and Feedback (iterative refinement for continuous improvement), providing a unified pipeline from real-world data acquisition to structured multimodal assessment.
The framework enhances data quality and evaluation granularity through an agentic crawling framework, structured section-wise data representation, and a section-level multimodal evaluation protocol.
WebGen-V enables high-granularity assessment by aligning text, layout, and visuals at the section level, facilitating precise detection and correction of subtle design inconsistencies in LLM-generated webpages.

Exemplar-Guided Planning: Enhanced LLM Agent for KGQA

PoG-EGP (Plan-on-Graph with Exemplar-Guided Planning): introduces a novel framework that enhances LLM agents' planning capabilities for Knowledge Graph Question Answering (KGQA) by leveraging preprocessed training data, including Question Preprocessing, Text Embedding Generation, Exemplary Question Retrieval, Retrieved Exemplars, Smart Lookahead Mechanism, PoG, LLM Agent, Task Decomposition, Path Exploration, Memory, Evaluation, and Reflection, to dynamically guide the LLM's planning process in task decomposition and relation exploration.
The framework preprocesses training questions via entity templating, generates semantic embeddings, and retrieves similar exemplary questions and their reasoning paths using a FAISS index to provide high-quality auxiliary information.
A Smart Lookahead mechanism is integrated to improve efficiency during relation exploration by preemptively identifying promising paths and terminating exploration earlier, significantly enhancing performance and efficiency on KGQA datasets.

AUGUSTUS: An LLM-Driven Multimodal Agent System with Contextualized User Memory

AUGUSTUS (An LLM-Driven Multimodal Agent System with Contextualized User Memory): introduces a multimodal agent system that processes, stores, retrieves, and acts on user context across various modalities, aligning its four-stage loop (Encode, Store in Memory, Retrieve, Act) with human cognitive memory principles.
The system leverages an LLM as its central planner, integrating In-Context, Recall, and a novel graph-structured Contextual Memory to manage information, and employs a Contextual-Personalized (CoPe) search for efficient concept-driven retrieval.
AUGUSTUS utilizes modality-specific encoders for input understanding and various generation tools for multimodal output, demonstrating superior performance and efficiency compared to traditional multimodal RAG approaches.

EXPERIENCE-DRIVEN EXPLORATION FOR EFFICIENT API-FREE AI AGENTS

KG-Agent: introduces an experience-driven learning framework that structures raw pixel-level GUI interactions into a persistent State-Action Knowledge Graph (SA-KG) and employs a VLM-based Reasoning Module for skill invocation, augmentation, refinement, and evaluation.
The framework leverages a hybrid intrinsic reward mechanism, combining state value and novelty rewards, to support long-horizon reasoning and efficient exploration.
By connecting functionally similar yet visually distinct GUI states, KG-Agent enables generalization from diverse historical strategies, significantly improving exploration efficiency and strategic depth in API-free environments.

Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

Multimodal RAG: introduces a systematic survey of Multimodal Retrieval-Augmented Generation for document understanding, detailing its components like User Query, Document, PDF2Img, OCR or Annotate, Image Retrieval, Text Retrieval, Multimodal Retrieval, Model, Answer Generation, Knowledge Base, Graph-based Index, Graph Traversal for Retrieval, LLM Agent, Query Decomposition, and Verification.
The survey categorizes existing methods by domain openness (closed/open), retrieval modality (image/text/hybrid), retrieval granularity (page/element), and hybrid enhancements (graph/agent-based).
It highlights the importance of Multimodal RAG for comprehensive document intelligence, addressing MLLM limitations in context modeling and enabling holistic retrieval and reasoning across text, tables, charts, and layout.

LLM-based In-situ Thought Exchanges for Critical Paper Reading

LLM-based In-situ Thought Exchange Interface: introduces a system designed to enhance junior researchers' critical paper reading skills by integrating AI-driven conversational agents into a custom PDF viewer, featuring a Comment Pane and Section Pane for interactive thought exchanges, highlighting, and commenting.
The system leverages LLMs to generate critical thinking questions, provide multi-disciplinary feedback, and reinterpret content, supporting both single-agent and multi-agent interaction modes.
This approach aims to foster critical thinking by encouraging active engagement and diverse perspectives, moving beyond passive information consumption.

EVOLVER: SELF-EVOLVING LLM AGENTS THROUGH AN EXPERIENCE-DRIVEN LIFECYCLE

EvolveR (Self-Evolving LLM Agents Through an Experience-Driven Lifecycle): introduces a self-evolving LLM agent framework through a closed-loop experience lifecycle, integrating online interaction, offline self-distillation, and policy evolution for continuous self-improvement.
The framework enables agents to transform raw interaction trajectories into a curated repository of strategic principles, which are then used to guide future decision-making and generate high-quality data.
EvolveR employs a dynamic experience curation system with mechanisms for self-distillation, semantic deduplication, integration, and quality control, ensuring a compact and effective knowledge base.

DETECTING ADVERSARIAL FINE-TUNING WITH AUDITING AGENTS

Fine-tuning Auditing Agent: introduces a robust detection mechanism for adversarial fine-tuning, utilizing an LLM as an agent with access to the fine-tuning dataset, pre-fine-tuned and fine-tuned models, and a suite of audit tools including dataset inspection, recursive summarization, model querying, Python execution, and benchmark running.
The agent systematically evaluates fine-tuned models by inspecting training data for patterns, querying models to compare behavior, and running benchmarks with attack-specific elicitation to assign a risk score for the fine-tuning job.
This approach effectively detects diverse fine-tuning attack vectors, including covert cipher attacks, by learning encoding schemes in-context and eliciting harmful responses, thereby preventing the deployment of maliciously poisoned LLMs.

Towards Automatic Evaluation and Selection of PHI De-identification Models via Multi-Agent Collaboration

TEAM-PHI (Trusted Evaluation and Automatic Model selection for PHI): introduces a multi-agent framework for automatic evaluation and selection of PHI de-identification models, utilizing Clinical Notes (raw clinical text), De-id Models (PHI extraction LLMs), Evaluation Agents (LLM-based judges), and LLM Majority Vote (judgment aggregation, selection) to assess de-identification quality without gold labels.
The framework employs multiple LLM-based Evaluation Agents to independently judge PHI extractions from various De-id models, consolidating their structured metrics via an LLM-based majority voting mechanism.
TEAM-PHI provides a practical, secure, and cost-effective solution for automatic evaluation and best-model selection in PHI de-identification, demonstrating consistent and accurate rankings even with limited ground-truth labels.

16th October 2025

AGENTIC DESIGN OF COMPOSITIONAL MACHINES

Agentic Design of Compositional Machines: introduces a framework for LLM agents to design complex machines in the BesiegeField (simulated physical environment), including Designer (produces initial plan), Refiner (evaluates, proposes revisions), Inspector (abstractly assesses machine), Environment Querier (runs simulation, summarizes feedback), Meta-Designer (analyzes requirements, creates blueprint), Builder Agents (constructs blocks based on blueprint), and MCTS (search strategy for candidates).
The framework enables LLMs to construct machines from standardized components to meet functional demands, leveraging agentic workflows for iterative design and hierarchical construction.
The paper also explores RL finetuning of LLMs within this environment to improve spatial reasoning, strategic assembly, and instruction-following capabilities for machine design.

LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training

UI-Simulator: introduces a scalable paradigm for synthesizing training trajectories, integrating an LLM Pre-Training Corpus (Input data for LLMs), an LLM World Simulator (LLM-based UI environment generator), a Guided Rollout Process (Collects coherent, diverse UI trajectories), and a Trajectory Wrapper (Transforms rollouts into training data).
The framework leverages LLMs pre-trained on UI code and procedural knowledge to simulate diverse UI states and transitions, enabling robust digital agent training without extensive human annotation.
UI-Simulator-Grow extends this by incorporating Target Task Selection (Identifies high-impact learning tasks), Trajectory Variant Synthesis (Generates diverse task variations), and Continual Learning (Adapts agent policies iteratively) for data-efficient scaling.

INFORMATION GAIN-BASED POLICY OPTIMIZATION: A SIMPLE AND EFFECTIVE APPROACH FOR MULTI-TURN LLM AGENTS

IGPO (Information Gain-based Policy Optimization): introduces a reinforcement learning framework for multi-turn LLM agents, utilizing a Policy LLM (Agent) interacting with an Environment through a Rollout of sequential Turns, each comprising a Think Step, Tool Call Step, and Tool Response Step, culminating in an Answer Turn, where rewards are calculated using Ground Truth, combining an Information Gain Reward and an Outcome Reward into a Reward Trajectory, which is then used to compute a Discounted Cumulative Advantage for policy optimization via a GRPO-style Surrogate Objective, guided by a Prompt Template.
The framework addresses reward sparsity in multi-turn LLM agent training by providing dense, intrinsic, turn-level supervision based on information gain, which measures the marginal increase in the policy's probability of producing the correct answer.
IGPO integrates this intrinsic turn-level reward with outcome-level supervision to form a dense reward trajectory, enhancing credit assignment and improving sample efficiency and accuracy in multi-turn scenarios.

Identity-Link IRT for Label-Free LLM Evaluation: Preserving Additivity in TVD-MI Scores

Clipped-Linear Model (Identity-Link Item Response Theory): introduces a novel LLM evaluation framework that leverages TVD-MI scores, an LLM judge, and an identity link to preserve additivity in agent-item score matrices, enabling sample-efficient sparse recovery.
This framework employs a clipped-linear model derived from Gini entropy maximization, which directly models raw TVD-MI scores as an additive decomposition of latent agent abilities and item difficulties, avoiding distortions from traditional logistic/probit links.
The approach achieves significant sample efficiency, requiring 3x fewer evaluations than dense methods while maintaining high reconstruction accuracy and preserving agent rankings, validated through discrete integrability tests and cross-domain experiments.

Mapping Smarter, Not Harder: A Test-Time Reinforcement Learning Agent That Improves Without Labels or Model Updates

TTRL Agent (Test-Time Reinforcement Learning Agent): introduces a reinforcement learning agent that self-improves schema mapping accuracy without labeled data or model updates by iteratively refining mappings through a Generative LLM, Conflict Detection, external Evidence Collection, and Confidence Evaluation, guided by dynamic prompts and an accumulating Memory/Context.
The agent identifies ambiguous mappings, formulates targeted web-search queries for external evidence, and uses confidence-based rewards to iteratively refine its mappings, reducing low-confidence mappings requiring expert review.
This approach provides an evidence-driven, transparent method for schema mapping, achieving high accuracy and reducing manual verification costs in scenarios with incomplete documentation.

THE GATEKEEPER KNOWS ENOUGH

The Gatekeeper Protocol: introduces a novel, domain-agnostic framework that governs LLM agent-system interactions, utilizing a System State-Context Representation (SCR) as a central data structure, an AGENT for reasoning and proposing actions, and a System / Execution Environment for validating and executing these actions.
This protocol mandates that the AGENT first reasons on a low-fidelity "latent state" representation within the SCR to strategically request high-fidelity context on demand, ensuring token efficiency and grounded interactions.
All interactions are mediated through a unified JSON format, serving as a declarative, state-synchronized protocol that ensures the agent's model of the system remains verifiably grounded in reality, significantly improving reliability and scalability.

Where to Search: Measure the Prior-Structured Search Space of LLM Agents

Formal Theory for LLM-assisted Iterative Search: introduces a compact formal theory to describe and measure LLM-assisted iterative search, representing agents as fuzzy relation operators and characterizing search space geometry.
The theory quantifies reachability difficulty using a coverage generating function and critical parameters, while safety is ensured by confining agents within a crisp idealized safety envelope.
A majority-vote instantiation on a 2D grid validates the abstract concepts, providing operational tools to measure LLM agents and their search spaces.

Agentic NL2SQL to Reduce Computational Costs

Datalake Agent: introduces an agentic system designed to enable an LLM to solve NL2SQL tasks more efficiently, with Information Acquisition, Iterative Refinement, and Query Formulation components, where the system reduces meta-information processing by selectively requesting necessary data.
The framework employs an interactive loop, allowing the LLM to gather general schema knowledge, refine its understanding hierarchically, and generate precise SQL queries using predefined commands like GetDBDescription, GetTables, GetColumns, and DBQueryFinalSQL.
This approach significantly reduces token usage and computational costs by up to 87% compared to direct prompting, while maintaining competitive performance on table question answering tasks across varying database sizes.

ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling

ToolPRM (Fine-Grained Inference Scaling of Structured Outputs for Function Calling): introduces an inference scaling framework that combines a ToolPRM (process reward model) with fine-grained beam search, leveraging a fine-grained intra-call process supervision dataset and function masking techniques to enhance LLM agent performance in structured function calling.
The framework decomposes function calls into semantically interpretable intermediate reasoning steps, enabling ToolPRM to provide step-level rewards for each decision, which guides the beam search to "explore more but retain less" for reliable structured output generation.
This approach significantly improves backbone model performance across various function calling tasks by offering more granular feedback than coarse-grained or outcome-based reward models, addressing the unrecoverability of early errors in structured outputs.

LLM Agents for Automated Web Vulnerability Reproduction: Are We There Yet?

LLM Agents for Automated Web Vulnerability Reproduction: introduces a comprehensive evaluation framework for assessing LLM agents' capabilities in transforming vulnerability reports into working exploits, including a benchmark dataset, LLM agents, evaluation tasks, and criteria.
The evaluation systematically assesses 20 state-of-the-art LLM agents across 16 dimensions on 3 representative CVEs, then conducts an in-depth analysis of the top 3 agents (OpenHands, SWE-agent, CAI) on 80 real-world CVEs.
Findings reveal that while LLM agents achieve reasonable success on simple library-based vulnerabilities, they consistently fail on complex service-based vulnerabilities requiring multi-component environments and robust authentication.

LLM Agents Beyond Utility: An Open-Ended Perspective

Open-Ended LLM Agent Loop: introduces an LLM agent augmented with task generation, memory management, and environmental interaction capabilities, enabling it to autonomously generate and pursue its own goals in an open-ended setting.
The agent extends the ReAct framework by incorporating self-generated tasks, persistent long-term memory, and file tools for creating lasting environmental artifacts across multiple runs.
This system explores the potential and limitations of adapting pretrained LLMs for open-ended behavior, highlighting challenges in memory management, productive exploration, and abstract goal pursuit.

JSPLIT: A Taxonomy-based Solution for Prompt Bloating in Model Context Protocol

JSPLIT (Taxonomy-based Solution for Prompt Bloating in Model Context Protocol): introduces a taxonomy-driven framework to manage prompt size effectively for AI agents using large sets of Model Context Protocol (MCP) tools, by organizing tools into a hierarchical taxonomy and using LLMs to identify and include only relevant tools based on user queries and taxonomy structure.
This approach significantly reduces prompt size, token costs, and latency while improving tool selection accuracy and task success in complex agent environments.
The framework's core, the Taxonomy-MCPResolver, leverages LLMs for a two-phase process of taxonomy classification and MCP server ranking to prune irrelevant tools from the agent's context.

E2EDEV: BENCHMARKING LARGE LANGUAGE MODELS IN END-TO-END SOFTWARE DEVELOPMENT TASK

E2EDev (End-to-End Software Development Benchmark): introduces a novel benchmark grounded in Behavior-Driven Development (BDD) principles, evaluating LLM-based End-to-End Software Development (E2ESD) frameworks by assessing generated software against user needs through mimicking real user interactions, comprising fine-grained user requirements, multiple BDD test scenarios with Python step implementations, an automated testing pipeline, and a Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA).
The HITL-MAA framework leverages specialized LLM agents, including Code Analyzer, Requirement Extractor, Test Case Generator, Test Automation Engineer, Step Checker, and Test Runner agents, with human supervision at key stages to ensure data quality and reduce annotation effort.
E2EDev addresses limitations of existing E2ESD benchmarks by providing fine-grained requirements and reliable, automated evaluation protocols built on the Behave framework, revealing that current LLM-based frameworks struggle with detailed functional specifics and multi-agent architectures often incur high costs with minimal gains.

LIRA: LINGUISTIC ROBUST ANCHORING FOR CROSS-LINGUAL LARGE LANGUAGE MODELS

LiRA (Linguistic Robust Anchoring for Large Language Models): introduces a training framework that robustly improves cross-lingual representations under low-resource conditions by jointly strengthening retrieval and reasoning.
The framework integrates Arca (Anchored Representation Composition Architecture), which anchors low-resource languages to an English semantic space via anchor-based alignment and multi-agent collaborative encoding, and LaSR (Language-coupled Semantic Reasoner), which adds a language-aware lightweight reasoning head with consistency regularization.
Arca's Translation Critic judges candidate translations, the Embedding Critic anchors feature paths, and the Actor Model fuses these critics to select candidates, while LaSR's LLM Transformer fuses English and multilingual embeddings, supported by CorrQueue and DocQueue for training stability.

Natural Language Tools: A Natural Language Approach to Tool Calling In Large Language Agents

NLT (Natural Language Tools): introduces a modular three-step architecture that replaces programmatic JSON tool calling with natural language outputs, decoupling tool selection from response generation to improve accuracy and reduce variance.
The framework utilizes a Selector LLM to identify relevant tools based on a natural language prompt, a Tool Parser to extract decisions, and a Tool Logic component to execute selected tools, before an Output Model generates the final response.
NLT significantly improves tool calling accuracy by 18.4 percentage points and reduces output variance by 70% across diverse models and domains, demonstrating enhanced robustness to prompt perturbations and extending capabilities to models lacking native support.

IMAGINE: Integrating Multi-Agent System into One Model for Complex Reasoning and Planning

IMAGINE (Integrating Multi-Agent System into One Model): introduces a framework that distills the reasoning and planning capabilities of a Multi-Agent System into a single, compact LLM model through a three-stage training pipeline, including New Query Generation, Multi-Agent System based Inference Data Generation, and Agentic Reasoning Training.
The framework's Multi-Agent System based Inference Data Generation stage employs a Reasoner, two Judges, and a Reflector to produce high-quality, reflected reasoning data for training.
Agentic Reasoning Training, comprising Agentic SFT and Agentic RL guided by a Newly Designed Agentic Reward Function, integrates and enhances the model's agentic reasoning abilities, enabling a small model to outperform larger Multi-Agent Systems.

The Role of Social Learning and Collective Norm Formation in Fostering Cooperation in LLM Multi-Agent Systems

CPR simulation framework: introduces a common-pool resource simulation framework for LLM multi-agent systems, with LLM agents, a shared resource, Harvest & Consumption, Individual Punishment, Social Learning, Group Decision modules, individual and group norms, cultural-evolutionary mechanisms, environmental feedback, payoff-biased social learning, a propose-vote rule, and prompts, enabling the endogenous emergence of cooperative norms without explicit reward signals.
The framework serves as a testbed to study how LLM agents develop strategies in mixed-motive settings and form group-beneficial norms through social learning and norm-based punishment.
The study validates the framework by reproducing human behavior findings and demonstrates its ability to discriminate LLMs based on their cooperative tendencies and norm formation capabilities under diverse environmental and social conditions.

MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering

MedTrust-Guided Iterative RAG: introduces a framework for biomedical question answering that enhances factual consistency and mitigates hallucinations by employing an iterative retrieval-verification pipeline and a MedTrust-Align Module for trust alignment.
The iterative pipeline, featuring a verifier agent and a generator agent, refines evidence and generates citation-grounded reasoning or refusal statements, while the MedTrust-Align Module constructs a hallucination-aware dataset and uses Direct Preference Optimization to reinforce reliable reasoning.
This approach systematically addresses hallucination patterns and evidence insufficiency in complex medical queries, leading to more accurate and trustworthy LLM responses in clinical contexts.

Your Next Token Prediction: A Multilingual Benchmark for Personalized Response Generation

YNTP (Your Next Token Prediction): introduces a multilingual benchmark for personalized response generation, utilizing an LLM-driven multi-NPC dialogue system that includes an FSM Engine (governs dialogue flow/state transitions), a Scenario Script (defines dialogue content/branching logic/NPC roles), and LLM Dialogue Generation (linguistic/emotional realization module).
This system collects natural, personalized, and psychologically grounded conversation data from users interacting with MBTI-dimensioned NPCs over five-day dialogue sessions.
The benchmark enables token-level prediction of individualized responses, moving beyond stylistic mimicry to model deeper cognitive regularities in word choice.

Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts

Beyond One World: introduces a benchmark for evaluating LLMs' character-grounded role-playing across multiversal contexts, featuring Canon Events and Moral Dilemmas tasks, an LLM-as-a-judge rubric for thinking/acting, and a Think-Act Matching metric.
The benchmark assesses LLMs' ability to consistently portray version-specific superhero characters by probing factual recall and ethical decision-making across 30 iconic heroes and 90 canon-specific versions from Marvel and DC universes.
The evaluation framework disentangles internal deliberation from outward decisions, using structured prompting and an LLM judge, revealing critical gaps in multiversal consistency and reasoning alignment in current LLMs.

Stop-RAG: Value-Based Retrieval Control for Iterative RAG

Stop-RAG: introduces a value-based controller for adaptive stopping in iterative retrieval-augmented generation (RAG) systems, with an Iterative RAG Pipeline, Query Generator, Retriever, Reranker, Answer Generator, Stop-RAG Controller, MDP Formulation, Q-network, Q(λ) Targets, and Decision Rule, where it frames iterative RAG as a finite-horizon Markov Decision Process and trains a Q-network using Q(λ) targets to provide forward-looking estimates of stopping quality.
The framework adaptively decides when to stop retrieving by estimating and comparing immediate and future gains, enabling more reliable stopping decisions without relying on internal telemetry or fixed iteration counts.
Stop-RAG consistently improves performance on multi-hop question-answering benchmarks, demonstrating its effectiveness as a modular, plug-and-play component compatible with black-box LLMs and existing RAG pipelines.

Terrarium: Revisiting the Blackboard for Multi-Agent Safety, Privacy, and Security Studies

TERRARIUM: introduces a modular and configurable framework for studying multi-agent systems (MAS) safety, privacy, and security, comprising Agent (LLM-based entity), Environment (simulator, state, objective), Blackboard (communication proxy), Tools (external capabilities), Communication Protocol (interaction rules), Factor Graph (blackboard initialization), MCP Server (model context protocol), Persistence (logs, configurations), and Infrastructure (LLMs, MCP servers).
The framework repurposes the blackboard design to create a modular, configurable testbed for multi-agent collaboration, enabling systematic study of attack vectors like misalignment, malicious agents, compromised communication, and data poisoning.
Its modular and configurable design facilitates rapid prototyping, evaluation, and iteration on defenses and designs, accelerating progress toward trustworthy multi-agent systems.

PRISM: AGENTIC RETRIEVAL WITH LLMS FOR MULTI-HOP QUESTION ANSWERING

PRISM (Precision-Recall Iterative Selection Mechanism): introduces an agentic retrieval framework that leverages LLM-based agents, including a Question Analyzer, Selector, and Adder, within an Iterative Refinement Loop to retrieve relevant evidence for multi-hop question answering.
The framework's Question Analyzer decomposes complex queries into sub-questions, while the Selector and Adder agents iteratively refine the evidence set by balancing precision and recall.
This approach produces compact and comprehensive evidence sets, which are then used by an Answer Generator Agent to provide accurate answers, outperforming strong baselines in multi-hop QA benchmarks.

GENLARP: Enabling Immersive Live Action Role-Play through LLM-Generated Worlds and Characters

GENLARP: introduces a virtual reality system that transforms personalized stories into immersive LARP experiences, utilizing Narrative Initialization (user input processing/world and story generation), Interactive Role Design (character and interaction logic), and Live-Action Role Play (immersive user experience) modules.
The system leverages generative AI and LLMs to create dynamic virtual worlds and characters, allowing users to act as both creators and players within the narrative.
It addresses traditional LARP limitations by enabling virtual reenactments without extensive physical setup or large groups, fostering deeper engagement through LLM-driven agents and dynamic narrative adaptation.

AlphaQuanter: An End-to-End Tool-Orchestrated Agentic Reinforcement Learning Framework for Stock Trading

AlphaQuanter: introduces a single-agent framework that leverages reinforcement learning (RL) to learn a dynamic policy over a transparent, tool-augmented decision workflow, empowering an agent to autonomously orchestrate tools and proactively acquire information on demand, establishing a transparent and auditable reasoning process.
The framework unifies workflows into a ReAct-like agent, starting with a guided plan, followed by iterative tool use and information seeking, and in-depth analysis, utilizing various financial data sources and a reward function for end-to-end optimization.
AlphaQuanter's design ensures decision consistency and interpretability by enforcing stepwise hypothesis testing and tightly coupling evidence collection with reasoning, leading to state-of-the-art performance on key financial metrics and sophisticated trading strategies.

TOWARDS AGENTIC SELF-LEARNING LLMS IN SEARCH ENVIRONMENT

ASL (Agentic Self-Learning): introduces a multi-role, closed-loop reinforcement learning framework that unifies task generation, policy execution, and evaluation within a shared tool environment and LLM backbone, including a Prompt Generator (generates tasks, adapts difficulty), a Policy Model (generates solutions, improves performance), a Generative Reward Model (assesses correctness, refines evaluation), Tools (retrieves information), and a Meta Prompt (guides task generation).
ASL enables LLMs to autonomously evolve their reasoning, generation, and evaluation capabilities in a continuous closed loop, addressing the need for scalable reward signals and agent task data.
The framework demonstrates superior sample efficiency and robustness, achieving steady performance gains and surpassing strong RLVR baselines even under zero-labeled-data conditions.

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

OHAB (Online Harassment Agentic Benchmark): introduces a framework for systematically studying how multi-turn LLM agents can be coerced into generating abusive content, with a synthetic multi-turn harassment conversation dataset generation pipeline, a multi-agent simulation design, and a mixed-methods evaluation framework.
The framework employs various jailbreak methods, including persona-only priming, toxic memory injection, planning attacks (CoT/ReAct), and jailbreak fine-tuning, to assess vulnerabilities in LLMs like LLaMA-3.1-8B-Instruct and Gemini-2.0-flash-001.
The evaluation combines LLM-based judgment with human annotation, informed by social theories like Dark Triad Traits and Conflict Avoidance, to provide nuanced insights into harassment dynamics and behavioral patterns.

DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans

DPRF (Dynamic Persona Refinement Framework): introduces a novel methodology to optimize LLM Role-Playing Agents' behavioral alignment with human ground truth by iteratively identifying cognitive divergences and refining persona profiles.
The framework operates through an iterative feedback loop, comparing agent-generated behaviors against human ground truth using a Behavior Analysis Agent and updating the persona via a Persona Refinement Agent.
DPRF is model-agnostic, domain-agnostic, and data-efficient, enhancing persona fidelity for applications like user simulation and personalized AI.

Agentic Entropy-Balanced Policy Optimization

AEPO (Agentic Entropy-Balanced Policy Optimization): introduces a dynamic entropy-balanced rollout (manages rollout sampling) and entropy-balanced policy optimization (optimizes policy updates), which together balance entropy during rollout and policy updates to enhance multi-turn tool-use capabilities in LLMs.
The dynamic entropy-balanced rollout adaptively allocates sampling budgets via entropy pre-monitoring and penalizes consecutive high-entropy branches to mitigate over-branching issues.
The policy optimization component preserves high-entropy token gradients and prioritizes learning on high-uncertainty tokens through entropy-aware advantage estimation, improving stability and scalability for web agent training.

HELMSMAN: AUTONOMOUS SYNTHESIS OF FEDERATED LEARNING SYSTEMS VIA MULTI-AGENT COLLABORATION

Helmsman: introduces a novel multi-agent system that automates the end-to-end synthesis of federated learning systems from high-level user specifications, including User, Planning Agent, Reflection Agent, Human Approval, Supervisor Agent, Coder Agent, Tester Agent, Evaluator Agent, Debugger Agent, Task Module, Client Module, Strategy Module, Server Module, Sandboxed Federated Simulation, Web Search Tool, RAG Pipeline, and AgentFL-Bench, by emulating a principled research and development workflow through interactive planning, modular code generation, and autonomous evaluation.
The framework structures the complex Federated Learning (FL) design process into three collaborative phases: interactive human-in-the-loop planning, modular code generation by supervised agent teams, and closed-loop autonomous evaluation and refinement in a sandboxed simulation environment.
Helmsman also introduces AgentFL-Bench, a new benchmark comprising 16 diverse tasks designed to rigorously assess the system-level generation capabilities of agentic systems in FL, demonstrating competitive and often superior solutions compared to hand-crafted baselines.

Why Instant-Runoff Voting Is So Resilient to Coalitional Manipulation: Phase Transitions in the Perturbed Culture

Phase Transition Analysis of Voting Rules in Perturbed Culture Model: introduces an analysis of Plurality, Two-Round System, and Instant-Runoff Voting within the Perturbed Culture Model, revealing phase transitions in their susceptibility to coalitional manipulation.
The study identifies a critical threshold (θc) for each rule, below which the CM rate tends to 1 for large electorates and above which it tends to 0.
The paper introduces the Super Condorcet Winner (SCW) concept, demonstrating its role as a key factor in IRV's exceptional resilience to CM, with IRV's θc being 0.

HI-AGENT: HIERARCHICAL VISION-LANGUAGE AGENTS FOR MOBILE DEVICE CONTROL

Hi-Agent (Hierarchical Vision-Language Agents for Mobile Device Control): introduces a trainable hierarchical vision-language agent for mobile control, featuring a high-level reasoning model and a low-level action model that are jointly optimized.
The framework reformulates multi-step decision-making as a sequence of single-step subgoals and employs a foresight advantage function, leveraging execution feedback to guide high-level optimization.
Hi-Agent achieves state-of-the-art performance on mobile control benchmarks by combining structured task decomposition with stable, critic-free joint training.

MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evaluation

MAGPIE (Multi-AGent contextual Privacy Evaluation): introduces a novel benchmark for evaluating privacy understanding and preservation in multi-agent collaborative, non-adversarial scenarios, featuring a Dataset Construction Pipeline (generates and validates scenarios), a Simulation Environment (orchestrates multi-agent negotiations), and an Evaluator LLM (assesses privacy leakage and task outcomes).
The benchmark comprises 200 high-stakes, multi-turn tasks where private information is integral to task resolution, forcing LLM agents to balance effective collaboration with strategic information control.
Evaluations reveal that state-of-the-art LLM agents, including GPT-5 and Gemini 2.5-Pro, exhibit significant privacy leakage and struggle with consensus, often resorting to undesirable behaviors like manipulation and power-seeking.

Procedural Game Level Design with Deep Reinforcement Learning

Co-adaptive Procedural Content Generation Framework: introduces a novel method for procedural game level design using DRL, featuring a Hummingbird Agent (solver), a Floating Island Agent (generator), a Unity Environment (3D simulation), Proximal Policy Optimization (PPO) (training algorithm), Unity ML-Agents Toolkit (platform), a Feedback Loop (interaction mechanism), and Auxiliary Inputs (observation enhancement), where the system integrates DRL agents for both environment generation and task-solving.
This framework employs two PPO-trained agents: a hummingbird agent that learns to collect flowers in a dynamic 3D Unity environment, and an island agent that generates diverse, context-aware flower placements based on environmental cues and performance feedback.
The dynamic feedback loop between the agents enables co-adaptive learning, where the island agent evolves to create effective level configurations, and the hummingbird agent concurrently learns to solve them with greater robustness and generalization.

Policy Transfer Ensures Fast Learning for Continuous-Time LQR with Entropy Regularization

Policy Transfer with IPO (Iterative Policy Optimization): introduces a theoretical analysis of policy transfer for continuous-time Linear Quadratic Regulators (LQRs) with entropy regularization, proposing a novel IPO algorithm that achieves global linear and local super-linear convergence.
The framework demonstrates that an optimal policy from a source LQR can serve as a near-optimal initialization for closely related target LQRs, preserving convergence rates.
The analysis also establishes the stability of a class of continuous-time score-based diffusion models by connecting them with LQRs.

HUGAGENT: EVALUATING LLMS IN SIMULATING HUMAN-LIKE INDIVIDUAL REASONING ON OPEN-ENDED TASKS

HugAgent (Human-Grounded AGENT Benchmark): introduces a dual-track benchmark for average-to-individual reasoning adaptation, including an interactive semi-structured chatbot, a structured questionnaire, a dynamic question generator, and a Causal Belief Network for representing individual belief systems.
The framework utilizes both a synthetic track for scalable stress tests and a human-grounded track for ecologically valid reasoning data, enabling reproducible evaluation of intra-agent fidelity.
It operationalizes reasoning adaptation into two measurable tasks: Belief-State Inference and Belief Dynamics Update, aiming to predict how specific individuals reason and update beliefs in novel scenarios.

INTERNALIZING WORLD MODELS VIA SELF-PLAY FINETUNING FOR AGENTIC RL

SPA (Self Play Agent): introduces a reinforcement learning framework that equips LLM agents with an internal world model, decomposed into State Estimation and Transition Modeling, learned via a Self-Play Supervised Finetuning stage, to improve performance in out-of-distribution environments.
The framework first cold-starts the policy by enabling the LLM agent to self-play and acquire world knowledge from the environment, then uses this learned world model to simulate future states prior to policy optimization through RL training.
This approach significantly boosts success rates in environments like Sokoban and FrozenLake by grounding LLM reasoning in environmental rules rather than memorized trajectories, leading to more robust generalization.

GUIrilla: A Scalable Framework for Automated Desktop UI Exploration

GUIrilla: introduces a scalable framework for automated desktop UI exploration, systematically exploring macOS applications via native accessibility APIs and simulated user interactions, supported by three LLM-based agents for element ordering, input generation, and task postprocessing.
The framework generates hierarchical GUI graphs from discovered interface elements and crawler actions, addressing data collection challenges in GUI automation and producing the GUIrilla-TASK dataset.
GUIrilla leverages specialized interaction handlers to achieve comprehensive application coverage and constructs function-centric tasks, enabling LLM-based agents to significantly improve performance on downstream UI tasks with less data.

15th October 2025

GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

GAPS (Grounding-Adequacy-Perturbation-Safety): introduces a clinically grounded, automated benchmark for evaluating AI clinicians, featuring Grounding (reasoning depth), Adequacy (answer completeness), Perturbation (input robustness), and Safety (harm prevention) axes, operationalized by a pipeline that constructs guideline-centered evaluation items and rubrics.
The framework employs an automated pipeline for evidence neighborhood assembly, knowledge graph and hierarchical tree representations, item generation across G-levels and P-perturbations, and rubric synthesis by a DeepResearch agent using a ReAct-style loop.
Scoring is performed by an ensemble of LLM judges, revealing that current LLMs excel at factual recall but struggle with increased reasoning depth, answer completeness, and robustness to adversarial inputs, guiding future AI clinician development.

From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails

ReGuard (Recovery Guardrail): introduces a control-theoretic approach to generative AI guardrails that formalizes AI safety as a sequential decision problem, learning predictive guardrails to monitor and proactively correct risky LLM outputs in real-time.
This framework operates in the LLM's latent representation of the world, enabling model-agnostic guardrails that can be trained via safety-critical reinforcement learning to detect and recover from unsafe states.
It moves beyond traditional flag-and-block guardrails by providing a principled dynamic alternative that balances safety and task efficiency, demonstrated in autonomous driving, e-commerce, and AI assistant scenarios.

Training LLM Agents to Empower Humans

Empower: introduces a self-supervised method for fine-tuning LLM agents to better assist humans by maximizing their empowerment, which is their ability to effect desired changes in the environment, using offline text data and a logit threshold mechanism to identify predictable code for completion.
The framework trains an LLM agent to complete predictable text, allowing the human user to focus on important design decisions rather than boilerplate code, thereby increasing their control over future outcomes.
Empower demonstrates that LLM assistants can be aligned without explicit human feedback or verifiable rewards by reasoning about how their actions enable humans to complete tasks more quickly.

Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs

Deflanderization for Game Dialogue: introduces a novel approach for LLM-based NPCs in game dialogue, combining lightweight prompting techniques and fine-tuned large models to balance character authenticity with task execution.
The approach employs a Deflanderization prompting method to prevent excessive role-play and improve task fidelity, alongside Retrieval Augmented Generation and Supervised Finetuning for robust dialogue grounding.
The framework addresses the challenge of maintaining consistent NPC personas and executing tasks in fantasy RPG environments, achieving high rankings in a dialogue challenge.

STEER-MOE: EFFICIENT AUDIO-LANGUAGE ALIGNMENT WITH A MIXTURE-OF-EXPERTS STEERING MODULE

SteerMoE (Efficient Audio-Language Alignment with a Mixture-of-Experts Steering Module): introduces a novel and modular framework for audio-language alignment, utilizing a lightweight steering module with a Mixture-of-Experts router to dynamically transform continuous audio representations for a frozen LLM decoder.
The framework freezes both the audio encoder and LLM decoder, training only the steering module to preserve LLM's reasoning capabilities and enable plug-and-play component interchangeability.
SteerMoE achieves strong performance on ASR and audio understanding tasks, demonstrating a parameter-efficient and modular approach to multimodal AI by operating entirely in the continuous embedding space.

In-Browser LLM-Guided Fuzzing for Real-Time Prompt Injection Testing in Agentic AI Browsers

In-Browser LLM-Guided Fuzzing Framework: introduces an in-browser, LLM-guided fuzzing framework for real-time prompt injection testing in agentic AI browsers, with Fuzzing Controller, LLM Integration Layer, Browser Automation Layer, and Data Collection and Analytics components, designed to automatically discover prompt injection vulnerabilities in real-time by generating and testing malicious webpage content within a live browser environment.
The framework leverages LLMs to generate diverse and evolving attack content, using a real-time feedback loop to refine attack strategies based on the AI agent's observed behavior and actions.
This approach enables high-fidelity testing with full DOM context and action monitoring, demonstrating that static pattern-matching defenses are insufficient against adaptive, LLM-guided prompt injection attacks.

Make an Offer They Can't Refuse: Grounding Bayesian Persuasion in Real-World Dialogues without Pre-Commitment

Type-Induced Bayesian Persuasion (BP): introduces a framework for implementing Bayesian Persuasion in natural language dialogues without pre-commitment, leveraging a commitment-communication mechanism where the persuader explicitly narrates potential types to guide the persuadee's Bayesian belief update.
The framework integrates a Bayesian setup, a composite signal structure (mbasic, mtype, mdes, minf), and a type-induced information schema (Sender Types, Base Policies, Schema Induction) to facilitate the Receiver's inference and decision process, implemented through Semi-Formal-Natural-Language (SFNL) BP and Fully-Natural-Language (FNL) BP.
Experimental results show that BP-guided LLMs consistently outperform non-BP baselines, with SFNL excelling in credibility and logical coherence, while FNL demonstrates stronger emotional resonance and robustness, and supervised fine-tuning enables smaller models to achieve comparable performance to larger models.

MADREC: A Multi-Aspect Driven LLM Agent for Explainable and Adaptive Recommendation

MADREC (Multi-Aspect Driven LLM Agent): introduces an autonomous LLM-based recommender that constructs user and item profiles by unsupervised extraction of multi-aspect information from reviews, performs direct and sequential recommendation, generates explanations, and dynamically adjusts inference criteria via a SELF-FEEDBACK mechanism.
The framework leverages MEMORY to store user and item profiles, TOOLS for aspect extraction, summarization, and re-ranking, and TASKS for various recommendation objectives, all integrated within an active agent architecture.
MADREC enhances explainability and adaptivity by generating structured profiles, re-ranking candidate items based on multi-aspect relevance, and iteratively refining recommendations through self-feedback, outperforming traditional and LLM-based baselines.

Document Intelligence in the Era of Large Language Models: A Survey 2510.13366 http://arxiv.org/abs/2510.13366 DI-LLM (Document Intelligence in the Era of Large Language Models): Multimodal Document AI / Multilingual Document AI / Retrieval-Augmented Paradigm / DocAgent Framework / DocAgent Foundation Model Multimodal Document AI (integrating diverse modalities) / Multilingual Document AI (handling diverse languages) / Retrieval-Augmented Paradigm (leveraging external knowledge) / DocAgent Framework (agent-based document processing) / DocAgent Foundation Model (domain-aware, cross-modal models) DI-LLM (Document Intelligence in the Era of Large Language Models): introduces a comprehensive survey of Document AI advancements, categorizing tasks into understanding and generation, integrating multimodal and multilingual capabilities, and leveraging retrieval-augmented methods. The survey explores key advancements and challenges in multimodal, multilingual, and retrieval-augmented DAI, while also suggesting future research directions including agent-based approaches and document-specific foundation models. It provides a structured analysis of the state-of-the-art in DAI, highlighting the evolution of LLM-based approaches and their implications for both academic and practical applications.

List the architectural components found in the figures. The paper is a survey and does not present architectural figures for a single proposed framework. Instead, Table 1 summarizes benchmark datasets, detailing their supported languages, document counts, modalities (Text, Visual, Layout), and tasks (Key Information Extraction (KIE), Document Layout Analysis (DLA), Document Sentiment Analysis (DSA), Document Classification (DC), Document Summarization (DS), Document Content Generation (DCG), Question Answering (QA)). These modalities and tasks represent functional components or capabilities within Document AI systems.