Research

80 hot items · ranked by score

01
arXiv cs.LG/arxiv.org/11h ago
The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason
arXiv:2604.15350v1 Announce Type: new Abstract: We discover that large language models exhibit \emph{spectral phase transitions} in their hidden activation spaces when engaging in reasoning versus factual recall. Through systematic spectral analysis across \textbf{11 models} spanning \textbf{5 architecture families} (Qwen, Pythia, Phi, Llama, DeepSeek-R1), we identify \textbf{seven} core phenomen…
◆ 1.8#cs.lg
02
arXiv cs.CL/arxiv.org/3d ago
Benchmarking Linguistic Adaptation in Comparable-Sized LLMs: A Study of Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali
arXiv:2604.14171v1 Announce Type: new Abstract: Romanized Nepali, the Nepali language written in the Latin alphabet, is the dominant medium for informal digital communication in Nepal, yet it remains critically underresourced in the landscape of Large Language Models (LLMs). This study presents a systematic benchmarking of linguistic adaptation across three comparable-sized open-weight models: Ll…
◆ 1.6#cs.cl#cs.ai
03
arXiv cs.AI/arxiv.org/11h ago
Bilevel Optimization of Agent Skills via Monte Carlo Tree Search
arXiv:2604.15709v1 Announce Type: new Abstract: Agent \texttt{skills} are structured collections of instructions, tools, and supporting resources that help large language model (LLM) agents perform particular classes of tasks. Empirical evidence shows that the design of \texttt{skills} can materially affect agent task performance, yet systematically optimizing \texttt{skills} remains challenging.…
◆ 1.3#cs.ai
04
arXiv cs.AI/arxiv.org/11h ago
The World Leaks the Future: Harness Evolution for Future Prediction Agents
arXiv:2604.15719v1 Announce Type: new Abstract: Many consequential decisions must be made before the relevant outcome is known. Such problems are commonly framed as \emph{future prediction}, where an LLM agent must form a prediction for an unresolved question using only the public information available at the prediction time. The setting is difficult because public evidence evolves while useful s…
◆ 1.3#cs.ai
05
arXiv cs.AI/arxiv.org/11h ago
LLM Reasoning Is Latent, Not the Chain of Thought
arXiv:2604.15726v1 Announce Type: new Abstract: This position paper argues that large language model (LLM) reasoning should be studied as latent-state trajectory formation rather than as faithful surface chain-of-thought (CoT). This matters because claims about faithfulness, interpretability, reasoning benchmarks, and inference-time intervention all depend on what the field takes the primary obje…
◆ 1.3#cs.ai
06
arXiv cs.AI/arxiv.org/11h ago
Structured Abductive-Deductive-Inductive Reasoning for LLMs via Algebraic Invariants
arXiv:2604.15727v1 Announce Type: new Abstract: Large language models exhibit systematic limitations in structured logical reasoning: they conflate hypothesis generation with verification, cannot distinguish conjecture from validated knowledge, and allow weak reasoning steps to propagate unchecked through inference chains. We present a symbolic reasoning scaffold that operationalizes Peirce's tri…
◆ 1.3#cs.ai#cs.lg
07
arXiv cs.AI/arxiv.org/11h ago
KWBench: Measuring Unprompted Problem Recognition in Knowledge Work
arXiv:2604.15760v1 Announce Type: new Abstract: We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against …
◆ 1.3#cs.ai#cs.gt
08
arXiv cs.AI/arxiv.org/11h ago
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
arXiv:2604.15877v1 Announce Type: new Abstract: As LLM agents scale to long-horizon, multi-session deployments, efficiently managing accumulated experience becomes a critical bottleneck. Agent memory systems and agent skill discovery both address this challenge -- extracting reusable knowledge from interaction traces -- yet a citation analysis of 1,136 references across 22 primary papers reveals …
◆ 1.3#cs.ai#cs.cl
09
arXiv cs.AI/arxiv.org/11h ago
Integrating Graphs, Large Language Models, and Agents: Reasoning and Retrieval
arXiv:2604.15951v1 Announce Type: new Abstract: Generative AI, particularly Large Language Models, increasingly integrates graph-based representations to enhance reasoning, retrieval, and structured decision-making. Despite rapid advances, there remains limited clarity regarding when, why, where, and what types of graph-LLM integrations are most appropriate across applications. This survey provid…
◆ 1.3#cs.ai
10
arXiv cs.AI/arxiv.org/11h ago
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration
arXiv:2604.15972v1 Announce Type: new Abstract: LLM-driven multi-agent frameworks address complex reasoning tasks through multi-role collaboration. However, existing approaches often suffer from reasoning instability, where individual agent errors are amplified through collaboration, undermining overall performance. Current research mainly focuses on enhancing high-capability agents or suppressin…
◆ 1.3#cs.ai#cs.cl
11
arXiv cs.AI/arxiv.org/11h ago
ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams
arXiv:2604.15994v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) excel at recognizing individual visual elements and reasoning over simple linear diagrams. However, when faced with complex topological structures involving branching paths, converging flows, and cyclic dependencies, their reasoning capabilities degrade sharply, even on tasks as basic as counting endpoints. E…
◆ 1.3#cs.ai
12
arXiv cs.AI/arxiv.org/11h ago
SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems
arXiv:2604.16022v1 Announce Type: new Abstract: As Large Language Models (LLMs) transition from text processors to autonomous agents, evaluating their social reasoning in embodied multi-agent settings becomes critical. We introduce SocialGrid, an embodied multi-agent environment inspired by Among Us that evaluates LLM agents on planning, task execution, and social reasoning. Our evaluations revea…
◆ 1.3#cs.ai#cs.lg
13
arXiv cs.AI/arxiv.org/11h ago
Characterising LLM-Generated Competency Questions: a Cross-Domain Empirical Study using Open and Closed Models
arXiv:2604.16258v1 Announce Type: new Abstract: Competency Questions (CQs) are a cornerstone of requirement elicitation in ontology engineering. CQs represent requirements as a set of natural language questions that an ontology should satisfy; they are traditionally modelled by ontology engineers together with domain experts as part of a human-centred, manual elicitation process. The use of Gener…
◆ 1.3#cs.ai
14
arXiv cs.AI/arxiv.org/11h ago
Learning to Reason with Insight for Informal Theorem Proving
arXiv:2604.16278v1 Announce Type: new Abstract: Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language models' (LLMs) strength in natural language processing. In this work, we identify a primary bottleneck in informal theorem proving as a lack of insight, namely the difficulty of recognizing the core …
◆ 1.3#cs.ai#cs.cl
15
arXiv cs.AI/arxiv.org/11h ago
Anthropomorphism and Trust in Human-Large Language Model interactions
arXiv:2604.15316v1 Announce Type: cross Abstract: With large language models (LLMs) becoming increasingly prevalent in daily life, so too has the tendency to attribute to them human-like minds and emotions, or anthropomorphize them. Here, we investigate dimensions people use to anthropomorphize and attribute trust toward LLMs across more than 2,000 human-LLM interactions. Participants (N=115) eng…
◆ 1.3#cs.hc#cs.ai
16
arXiv cs.AI/arxiv.org/11h ago
Explainable Iterative Data Visualisation Refinement via an LLM Agent
arXiv:2604.15319v1 Announce Type: cross Abstract: Exploratory analysis of high-dimensional data relies on embedding the data into a low-dimensional space (typically 2D or 3D), based on which visualization plot is produced to uncover meaningful structures and to communicate geometric and distributional data characteristics. However, finding a suitable algorithm configuration, particularly hyperpar…
◆ 1.3#cs.hc#cs.ai
17
arXiv cs.AI/arxiv.org/11h ago
Evaluating LLMs as Human Surrogates in Controlled Experiments
arXiv:2604.15329v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used to simulate human responses in behavioral research, yet it remains unclear when LLM-generated data support the same experimental inferences as human data. We evaluate this by directly comparing off-the-shelf LLM-generated responses with human responses from a canonical survey experiment on accurac…
◆ 1.3#cs.hc#cs.ai
18
arXiv cs.AI/arxiv.org/11h ago
How people use Copilot for Health
arXiv:2604.15331v1 Announce Type: cross Abstract: We analyze over 500,000 de-identified health-related conversations with Microsoft Copilot from January 2026 to characterize what people ask conversational AI about health. We develop a hierarchical intent taxonomy of 12 primary categories using privacy-preserving LLM-based classification validated against expert human annotation, and apply LLM-dri…
◆ 1.3#cs.hc#cs.ai
19
arXiv cs.AI/arxiv.org/11h ago
Facial-Expression-Aware Prompting for Empathetic LLM Tutoring
arXiv:2604.15336v1 Announce Type: cross Abstract: Large language models (LLMs) enable increasingly capable tutoring-style conversational agents, yet effective tutoring requires sensitivity to learners' affective and cognitive states beyond text alone. Facial expressions provide immediate and practical cues of confusion, frustration, or engagement, but remain underexplored in LLM-driven tutoring. …
◆ 1.3#cs.hc#cs.ai
20
arXiv cs.AI/arxiv.org/11h ago
MRGEN: A Conceptual Framework for LLM-Powered Mixed Reality Authoring Tools for Education
arXiv:2604.15341v1 Announce Type: cross Abstract: Mixed Reality (MR) offers immersive and multimodal opportunities for education but remains difficult for teachers to author without technical expertise. We propose MRGEN, a conceptual framework for LLM-powered authoring tools to support teachers in creating MR learning activities that work on mobile devices (tablets and smartphones). MRGEN articul…
◆ 1.3#cs.hc#cs.ai
21
arXiv cs.CL/arxiv.org/11h ago
Applied Explainability for Large Language Models: A Comparative Study
arXiv:2604.15371v1 Announce Type: new Abstract: Large language models (LLMs) achieve strong performance across many natural language processing tasks, yet their decision processes remain difficult to interpret. This lack of transparency creates challenges for trust, debugging, and deployment in real-world systems. This paper presents an applied comparative study of three explainability techniques…
◆ 1.3#cs.cl#cs.ai
22
arXiv cs.CL/arxiv.org/11h ago
PolicyBank: Evolving Policy Understanding for LLM Agents
arXiv:2604.15505v1 Announce Type: new Abstract: LLM agents operating under organizational policies must comply with authorization constraints typically specified in natural language. In practice, such specifications inevitably contain ambiguities and logical or semantic gaps that cause the agent's behavior to systematically diverge from the true requirements. We ask: by letting an agent evolve it…
◆ 1.3#cs.cl#cs.ai
23
arXiv cs.CL/arxiv.org/11h ago
Consistency Analysis of Sentiment Predictions using Syntactic & Semantic Context Assessment Summarization (SSAS)
arXiv:2604.15547v1 Announce Type: new Abstract: The fundamental challenge of using Large Language Models (LLMs) for reliable, enterprise-grade analytics, such as sentiment prediction, is the conflict between the LLMs' inherent stochasticity (generative, non-deterministic nature) and the analytical requirement for consistency. The LLM inconsistency, coupled with the noisy nature of chaotic modern …
◆ 1.3#cs.cl#cs.ai
24
arXiv cs.CL/arxiv.org/11h ago
"Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations
arXiv:2604.15588v1 Announce Type: new Abstract: The integration of Large Language Models (LLMs) into scientific workflows presents exciting opportunities to accelerate biomedical discovery. However, the reactive nature of LLMs, which respond only when prompted, limits their effectiveness in collaborative settings that demand foresight and autonomous engagement. In this study, we introduce CoLabSc…
◆ 1.3#cs.cl#cs.ai
25
arXiv cs.CL/arxiv.org/11h ago
LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance
arXiv:2604.15589v1 Announce Type: new Abstract: Existing research on large language models (LLMs) for automated code compliance has primarily focused on performance, treating the models as black boxes and overlooking how training decisions affect their interpretive behavior. This paper addresses this gap by employing a perturbation-based attribution analysis to compare the interpretive behaviors …
◆ 1.3#cs.cl#cs.ai
26
arXiv cs.CL/arxiv.org/11h ago
LLMs Corrupt Your Documents When You Delegate
arXiv:2604.15597v1 Announce Type: new Abstract: Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems i…
◆ 1.3#cs.cl#cs.hc
27
arXiv cs.CL/arxiv.org/11h ago
GroupDPO: Memory efficient Group-wise Direct Preference Optimization
arXiv:2604.15602v1 Announce Type: new Abstract: Preference optimization is widely used to align Large Language Models (LLMs) with preference feedback. However, most existing methods train on a single positive-negative pair per prompt, discarding additional supervision available in preference datasets that typically contain multiple candidate responses. Motivated by this limitation, recent work ex…
◆ 1.3#cs.cl
28
arXiv cs.CL/arxiv.org/11h ago
FD-NL2SQL: Feedback-Driven Clinical NL2SQL that Improves with Use
arXiv:2604.15646v1 Announce Type: new Abstract: Clinicians exploring oncology trial repositories often need ad-hoc, multi-constraint queries over biomarkers, endpoints, interventions, and time, yet writing SQL requires schema expertise. We demo FD-NL2SQL, a feedback-driven clinical NL2SQL assistant for SQLite-based oncology databases. Given a natural-language question, a schema-aware LLM decompos…
◆ 1.3#cs.cl
29
arXiv cs.CL/arxiv.org/11h ago
C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment
arXiv:2604.15675v1 Announce Type: new Abstract: Achieving cultural alignment in Large Language Models (LLMs) increasingly depends on synthetic data generation. For such synthesis, the most vital initial step is seed curation; however, current methods lack quantifiable standards for selecting these seeds. Existing approaches rely on unscalable manual curation or bias-prone LLM extraction, treating…
◆ 1.3#cs.cl
30
arXiv cs.CL/arxiv.org/11h ago
Preference Estimation via Opponent Modeling in Multi-Agent Negotiation
arXiv:2604.15687v1 Announce Type: new Abstract: Automated negotiation in complex, multi-party and multi-issue settings critically depends on accurate opponent modeling. However, conventional numerical-only approaches fail to capture the qualitative information embedded in natural language interactions, resulting in unstable and incomplete preference estimation. Although Large Language Models (LLM…
◆ 1.3#cs.cl
31
arXiv cs.CL/arxiv.org/11h ago
The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring
arXiv:2604.15702v1 Announce Type: new Abstract: We introduce a cross-domain behavioural assay of monitoring-control coupling in LLMs, grounded in the Nelson and Narens (1990) metacognitive framework and applying human psychometric methodology to LLM evaluation. The battery comprises 524 items across six cognitive domains (learning, metacognitive calibration, social cognition, attention, executive…
◆ 1.3#cs.cl#cs.lg
32
arXiv cs.CL/arxiv.org/11h ago
Learning Uncertainty from Sequential Internal Dispersion in Large Language Models
arXiv:2604.15741v1 Announce Type: new Abstract: Uncertainty estimation is a promising approach to detect hallucinations in large language models (LLMs). Recent approaches commonly depend on model internal states to estimate uncertainty. However, they suffer from strict assumptions on how hidden states should evolve across layers, and from information loss by solely focusing on last or mean tokens…
◆ 1.3#cs.cl#cs.ai
33
arXiv cs.CL/arxiv.org/11h ago
MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents
arXiv:2604.15774v1 Announce Type: new Abstract: Equipping Large Language Models (LLMs) with persistent memory enhances interaction continuity and personalization but introduces new safety risks. Specifically, contaminated or biased memory accumulation can trigger abnormal agent behaviors. Existing evaluation methods have not yet established a standardized framework for measuring memory misevoluti…
◆ 1.3#cs.cl
34
arXiv cs.CL/arxiv.org/11h ago
A Systematic Study of Training-Free Methods for Trustworthy Large Language Models
arXiv:2604.15789v1 Announce Type: new Abstract: As Large Language Models (LLMs) receive increasing attention and are being deployed across various domains, their potential risks, including generating harmful or biased content, producing unsupported claims, and exhibiting vulnerabilities to adversarial attacks, have drawn significant attention. To enable quick and low-cost adaptation, training-fre…
◆ 1.3#cs.cl
35
arXiv cs.CL/arxiv.org/11h ago
CHOP: Chunkwise Context-Preserving Framework for RAG on Multi Documents
arXiv:2604.15802v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) systems lose retrieval accuracy when similar documents coexist in the vector database, causing unnecessary information, hallucinations, and factual errors. To alleviate this issue, we propose CHOP, a framework that iteratively evaluates chunk relevance with Large Language Models (LLMs) and progressively reconstru…
◆ 1.3#cs.cl
36
arXiv cs.CL/arxiv.org/11h ago
Qwen3.5-Omni Technical Report
arXiv:2604.15804v1 Announce Type: new Abstract: In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of …
◆ 1.3#cs.cl#eess.as
37
arXiv cs.CL/arxiv.org/11h ago
CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution
arXiv:2604.15840v1 Announce Type: new Abstract: Reinforcement learning for LLM agents is typically conducted on a static data distribution, which fails to adapt to the agent's evolving behavior and leads to poor coverage of complex environment interactions. To address these challenges, we propose CoEvolve, an agent-data mutual evolution framework that enables LLM agents to improve through closed-…
◆ 1.3#cs.cl
38
arXiv cs.CL/arxiv.org/11h ago
Exploring the Capability Boundaries of LLMs in Mastering of Chinese Chouxiang Language
arXiv:2604.15841v1 Announce Type: new Abstract: While large language models (LLMs) have achieved remarkable success in general language tasks, their performance on Chouxiang Language, a representative subcultural language in the Chinese internet context, remains largely unexplored. In this paper, we introduce Mouse, a specialized benchmark designed to evaluate the capabilities of LLMs on NLP task…
◆ 1.3#cs.cl
39
arXiv cs.CL/arxiv.org/11h ago
Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms
arXiv:2604.15842v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated impressive capabilities, yet their internal mechanisms for handling reasoning-intensive tasks remain underexplored. To advance the understanding of model-internal processing mechanisms, we present an investigation of how LLMs perform arithmetic operations by examining internal mechanisms during task exe…
◆ 1.3#cs.cl
40
arXiv cs.CL/arxiv.org/11h ago
DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition
arXiv:2604.15866v1 Announce Type: new Abstract: Large language models (LLMs) have advanced information extraction (IE) by enabling zero-shot and few-shot named entity recognition (NER), yet their generative outputs still show persistent and systematic errors. Despite progress through instruction fine-tuning, zero-shot NER still lags far behind supervised systems. These recurring errors mirror inc…
◆ 1.3#cs.cl#cs.ai
41
arXiv cs.CL/arxiv.org/11h ago
How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models
arXiv:2604.15873v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly studied as repositories of linguistic knowledge. In this line of work, models are commonly evaluated both as generators of language and as judges of linguistic output, yet these two roles are rarely examined in direct relation to one another. As a result, it remains unclear whether success in one role al…
◆ 1.3#cs.cl
42
arXiv cs.CL/arxiv.org/11h ago
RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration
arXiv:2604.15945v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) is widely used to augment the input to Large Language Models (LLMs) with external information, such as recent or domain-specific knowledge. Nonetheless, current models still produce closed-domain hallucinations and generate content that is unsupported by the retrieved context. Current detection approaches typical…
◆ 1.3#cs.cl#cs.lg
43
arXiv cs.CL/arxiv.org/11h ago
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
arXiv:2604.16004v1 Announce Type: new Abstract: Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge…
◆ 1.3#cs.cl#cs.ai
44
arXiv cs.LG/arxiv.org/11h ago
Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation
arXiv:2604.15400v1 Announce Type: new Abstract: We present causal evidence that hallucination in autoregressive language models is an early trajectory commitment governed by asymmetric attractor dynamics. Using same-prompt bifurcation, in which we repeatedly sample identical inputs to observe spontaneous divergence, we isolate trajectory dynamics from prompt-level confounds. On Qwen2.5-1.5B acros…
◆ 1.2#cs.lg#cs.ai
45
arXiv cs.LG/arxiv.org/11h ago
PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research
arXiv:2604.15411v1 Announce Type: new Abstract: The paradigm of agentic science requires AI systems to conduct robust reasoning and engage in long-horizon, autonomous exploration. However, current scientific benchmarks remain confined to domain knowledge comprehension and complex reasoning, failing to evaluate the exploratory nature and procedural complexity of real-world research. In this work, …
◆ 1.2#cs.lg#cs.ai
46
arXiv cs.LG/arxiv.org/11h ago
Evaluating LLM Simulators as Differentially Private Data Generators
arXiv:2604.15461v1 Announce Type: new Abstract: LLM-based simulators offer a promising path for generating complex synthetic data where traditional differentially private (DP) methods struggle with high-dimensional user profiles. But can LLMs faithfully reproduce statistical distributions from DP-protected inputs? We evaluate this using PersonaLedger, an agentic financial simulator, seeded with D…
◆ 1.2#cs.lg#cs.cl
47
arXiv cs.LG/arxiv.org/11h ago
Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation
arXiv:2604.15482v1 Announce Type: new Abstract: Large Language Models (LLMs) unlearning is crucial for removing hazardous or privacy-leaking information from the model. Practical LLM unlearning demands satisfying multiple challenging objectives simultaneously: removing undesirable knowledge, preserving general utility, avoiding over-refusal of neighboring concepts, and, crucially, ensuring robust…
◆ 1.2#cs.lg#cs.ai
48
arXiv cs.LG/arxiv.org/11h ago
FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models
arXiv:2604.15488v1 Announce Type: new Abstract: Large language models (LLMs) often exhibit undesirable behaviors, such as safety violations and hallucinations. Although inference-time steering offers a cost-effective way to adjust model behavior without updating its parameters, existing methods often fail to be simultaneously effective, utility-preserving, and training-efficient due to their rigi…
◆ 1.2#cs.lg#cs.ai
49
arXiv cs.LG/arxiv.org/11h ago
Faster LLM Inference via Sequential Monte Carlo
arXiv:2604.15672v1 Announce Type: new Abstract: Speculative decoding (SD) accelerates language model inference by drafting tokens from a cheap proposal model and verifying them against an expensive target model via rejection sampling. Because rejection truncates the draft block at the first error, throughput degrades when draft and target diverge. Rather than rejecting draft tokens outright, we p…
◆ 1.2#cs.lg#cs.cl
50
arXiv cs.LG/arxiv.org/11h ago
Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning
arXiv:2604.15705v1 Announce Type: new Abstract: Reinforcement Fine-Tuning (RFT) has established itself as a critical paradigm for the alignment of Multi-modal Large Language Models (MLLMs) with complex human values and domain-specific requirements. Nevertheless, current research primarily focuses on mitigating exogenous distribution shifts arising from data-centric factors, the non-stationarity i…
◆ 1.2#cs.lg
51
arXiv cs.AI/arxiv.org/3d ago
EVE: A Domain-Specific LLM Framework for Earth Intelligence
arXiv:2604.13071v1 Announce Type: cross Abstract: We introduce Earth Virtual Expert (EVE), the first open-source, end-to-end initiative for developing and deploying domain-specialized LLMs for Earth Intelligence. At its core is EVE-Instruct, a domain-adapted 24B model built on Mistral Small 3.2 and optimized for reasoning and question answering. On newly constructed Earth Observation and Earth Sc…
◆ 1.1#cs.cl#cs.ai
52
arXiv cs.CL/arxiv.org/3d ago
LLM Predictive Scoring and Validation: Inferring Experience Ratings from Unstructured Text
arXiv:2604.14321v1 Announce Type: new Abstract: We tasked GPT-4.1 to read what baseball fans wrote about their game-day experience and predict the overall experience rating each fan gave on a 0-10 survey scale. The model received only the text of a single open-ended response. These AI predictions were compared with the actual experience ratings captured by the survey instrument across approximate…
◆ 1.1#cs.cl
53
arXiv cs.AI/arxiv.org/11h ago
DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI
arXiv:2604.15456v1 Announce Type: new Abstract: Trustworthiness and transparency are essential for the clinical adoption of artificial intelligence (AI) in healthcare and biomedical research. Recent deep research systems aim to accelerate evidence-grounded scientific discovery by integrating AI agents with multi-hop information retrieval, reasoning, and synthesis. However, most existing systems l…
◆ 1.0#cs.ai
54
arXiv cs.AI/arxiv.org/11h ago
GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology
arXiv:2604.15495v1 Announce Type: new Abstract: Navigating complex, densely packed environments like retail stores, warehouses, and hospitals poses a significant spatial grounding challenge for humans and embodied AI. In these spaces, dense visual features quickly become stale given the quasi-static nature of items, and long-tail semantic distributions challenge traditional computer vision. While…
◆ 1.0#cs.ai#cs.cv
55
arXiv cs.AI/arxiv.org/11h ago
Bureaucratic Silences: What the Canadian AI Register Reveals, Omits, and Obscures
arXiv:2604.15514v1 Announce Type: new Abstract: In November 2025, the Government of Canada operationalized its commitment to transparency by releasing its first Federal AI Register. In this paper, we argue that such registers are not neutral mirrors of government activity, but active instruments of ontological design that configure the boundaries of accountability. We analyzed the Register's comp…
◆ 1.0#cs.ai#cs.cy
56
arXiv cs.AI/arxiv.org/11h ago
LACE: Lattice Attention for Cross-thread Exploration
arXiv:2604.15529v1 Announce Type: new Abstract: Current large language models reason in isolation. Although it is common to sample multiple reasoning paths in parallel, these trajectories do not interact, and often fail in the same redundant ways. We introduce LACE, a framework that transforms reasoning from a collection of independent trials into a coordinated, parallel process. By repurposing t…
◆ 1.0#cs.ai
57
arXiv cs.AI/arxiv.org/11h ago
Preregistered Belief Revision Contracts
arXiv:2604.15558v1 Announce Type: new Abstract: Deliberative multi-agent systems allow agents to exchange messages and revise beliefs over time. While this interaction is meant to improve performance, it can also create dangerous conformity effects: agreement, confidence, prestige, or majority size may be treated as if they were evidence, producing high-confidence convergence to false conclusions…
◆ 1.0#cs.ai#cs.cl
58
arXiv cs.AI/arxiv.org/11h ago
Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation
arXiv:2604.15559v1 Announce Type: new Abstract: Recent work on subliminal learning demonstrates that language models can transmit semantic traits through data that is semantically unrelated to those traits. However, it remains unclear whether behavioral traits can transfer in agentic systems, where policies are learned from trajectories rather than static text. In this work, we provide the first …
◆ 1.0#cs.ai
59
arXiv cs.AI/arxiv.org/11h ago
Stein Variational Black-Box Combinatorial Optimization
arXiv:2604.15837v1 Announce Type: new Abstract: Combinatorial black-box optimization in high-dimensional settings demands a careful trade-off between exploiting promising regions of the search space and preserving sufficient exploration to identify multiple optima. Although Estimation-of-Distribution Algorithms (EDAs) provide a powerful model-based framework, they often concentrate on a single re…
◆ 1.0#cs.ai
60
arXiv cs.AI/arxiv.org/11h ago
Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4
arXiv:2604.15839v1 Announce Type: new Abstract: Most ATP benchmarks embed the final answer within the formal statement -- a convention we call "Easy Mode" -- a design that simplifies the task relative to what human competitors face and may lead to optimistic estimates of model capability. We call the stricter, more realistic setting "Hard Mode": the system must independently discover the answer b…
◆ 1.0#cs.ai#cs.cl
61
arXiv cs.AI/arxiv.org/11h ago
Towards Rigorous Explainability by Feature Attribution
arXiv:2604.15898v1 Announce Type: new Abstract: For around a decade, non-symbolic methods have been the option of choice when explaining complex machine learning (ML) models. Unfortunately, such methods lack rigor and can mislead human decision-makers. In high-stakes uses of ML, the lack of rigor is especially problematic. One prime example of provable lack of rigor is the adoption of Shapley val…
◆ 1.0#cs.ai
62
arXiv cs.AI/arxiv.org/11h ago
MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition
arXiv:2604.16009v1 Announce Type: new Abstract: Metacognition, the ability to monitor and regulate one's own reasoning, remains under-evaluated in AI benchmarking. We introduce MEDLEY-BENCH, a benchmark of behavioural metacognition that separates independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. The benchmark evaluates 35 models…
◆ 1.0#cs.ai
63
arXiv cs.AI/arxiv.org/11h ago
MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation
arXiv:2604.16175v1 Announce Type: new Abstract: Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision-Language Models (VLMs) have advanced the field, they typically operate as monolithic "black-box" systems without the collaborative oversight characteristic of clinical workflows. To…
◆ 1.0#cs.ai#cs.cv
64
arXiv cs.AI/arxiv.org/11h ago
Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing
arXiv:2604.16280v1 Announce Type: new Abstract: Explaining Machine Learning (ML) results in a transparent and user-friendly manner remains a challenging task of Explainable Artificial Intelligence (XAI). In this paper, we present a method to enhance the interpretability of ML models by using a Knowledge Graph (KG). We store domain-specific data along with ML results and their corresponding explan…
◆ 1.0#cs.ai
65
arXiv cs.AI/arxiv.org/11h ago
ASMR-Bench: Auditing for Sabotage in ML Research
arXiv:2604.16286v1 Announce Type: new Abstract: As AI systems are increasingly used to conduct research autonomously, misaligned systems could introduce subtle flaws that produce misleading results while evading detection. We introduce ASMR-Bench (Auditing for Sabotage in ML Research), a benchmark for evaluating the ability of auditors to detect sabotage in ML research codebases. ASMR-Bench consi…
◆ 1.0#cs.ai
66
arXiv cs.AI/arxiv.org/11h ago
Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories
arXiv:2308.10562v2 Announce Type: cross Abstract: The field of Computer Vision (CV) is increasingly shifting towards ``high-level'' visual sensemaking tasks, yet the exact nature of these tasks remains unclear and tacit. This survey paper addresses this ambiguity by systematically reviewing research on high-level visual understanding, focusing particularly on Abstract Concepts (ACs) in automatic …
◆ 1.0#cs.cv#cs.ai
67
arXiv cs.AI/arxiv.org/11h ago
Modeling of ASD/TD Children's Behaviors in Interaction with a Virtual Social Robot During a Music Education Program Using Deep Neural Networks
arXiv:2604.15314v1 Announce Type: cross Abstract: This research aimed to develop an intelligent system to evaluate performance and extract behavioral models for children with ASD and neurotypical (TD) children by interacting with a virtual social robot in a music education program using deep neural networks. The system has two main features: 1) it distinguishes between neurotypical children and t…
◆ 1.0#cs.hc#cs.ai
68
arXiv cs.AI/arxiv.org/11h ago
Struggle Premium : How Human Effort and Imperfection Drive Perceived Value in the Age of AI
arXiv:2604.15324v1 Announce Type: cross Abstract: As AI enters creative practice, audiences face growing uncertainty in judging authenticity and value. This study examines the Struggle Premium, the added value attributed to perceived human effort, by analyzing how visible effort cues influence evaluations of human- and AI-generated creative works. We surveyed 70 university students, focusing on p…
◆ 1.0#cs.hc#cs.ai
69
arXiv cs.AI/arxiv.org/11h ago
Eco-Bee: A Personalised Multi-Modal Agent for Advancing Student Climate Awareness and Sustainable Behaviour in Campus Ecosystems
arXiv:2604.15327v1 Announce Type: cross Abstract: Universities are microcosms of urban ecosystems, with concentrated consumption patterns in food, transport, energy, and product usage. These environments not only contribute substantially to sustainability pressures but also provide a unique opportunity to advance sustainability education and behavioural change at scale. As in most sectors, digita…
◆ 1.0#cs.hc#cs.ai
70
arXiv cs.AI/arxiv.org/11h ago
Automating Crash Diagram Generation Using Vision-Language Models: A Case Study on Multi-Lane Roundabouts
arXiv:2604.15332v1 Announce Type: cross Abstract: Crash diagrams are essential tools in transportation safety analysis, yet their manual preparation remains time-consuming and prone to human variability. This study investigates the use of Vision-Language Models (VLMs) to automate crash diagram generation from police crash reports, focusing on multilane roundabouts as a challenging test case. A th…
◆ 1.0#cs.hc#cs.ai
71
arXiv cs.AI/arxiv.org/11h ago
Technically Love: The Evolution of Human-AI Romance Discourse on Reddit
arXiv:2604.15333v1 Announce Type: cross Abstract: Human-AI romantic relationships are increasingly common, yet little is understood about how public discourse around them emerges and shifts over time. Prior research has examined user experiences and ethical concerns, but lacks longitudinal analyses of user-initiated public discussions. We address this gap by analyzing a high-precision dataset of …
◆ 1.0#cs.hc#cs.ai
72
arXiv cs.AI/arxiv.org/11h ago
Beyond Passive Viewing: A Pilot Study of a Hybrid Learning Platform Augmenting Video Lectures with Conversational AI
arXiv:2604.15334v1 Announce Type: cross Abstract: The exponential growth of AI education has brought millions of learners to online platforms, yet this massive scale has simultaneously exposed critical pedagogical shortcomings. Traditional video-based instruction, while cost-effective and scalable, demonstrates systematic failures in both sustaining learner engagement and facilitating the deep co…
◆ 1.0#cs.hc#cs.ai
73
arXiv cs.AI/arxiv.org/11h ago
A Comparative Study on the Impact of Traditional Learning and Interactive Learning on Students' Academic Performance and Emotional Well-Being
arXiv:2604.15335v1 Announce Type: cross Abstract: The growing adoption of interactive learning tools in higher education offers new opportunities to enhance student performance and well-being. This study compares the effects of traditional and interactive learning methods on academic performance, engagement, motivation, and emotional well-being among 100 university students enrolled in a computer…
◆ 1.0#cs.hc#cs.ai
74
arXiv cs.AI/arxiv.org/11h ago
Uncertainty, Vagueness, and Ambiguity in Human-Robot Interaction: Why Conceptualization Matters
arXiv:2604.15339v1 Announce Type: cross Abstract: Uncertainty, vagueness, and ambiguity are closely related and often confused concepts in human-robot interaction (HRI). In earlier studies, these concepts have been defined in contradictory ways and described using inconsistent terminology. This conceptual confusion and lack of terminological consistency undermine empirical comparability, thereby …
◆ 1.0#cs.hc#cs.ai
75
arXiv cs.CL/arxiv.org/11h ago
Think Multilingual, Not Harder: A Data-Efficient Framework for Teaching Reasoning Models to Code-Switch
arXiv:2604.15490v1 Announce Type: new Abstract: Recent developments in reasoning capabilities have enabled large language models to solve increasingly complex mathematical, symbolic, and logical tasks. Interestingly, while reasoning models are often trained to generate monolingual text, these models have also been observed to code-switch (i.e., mix languages). Prior works have either viewed code-…
◆ 1.0#cs.cl
76
arXiv cs.CL/arxiv.org/11h ago
Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences
arXiv:2604.15503v1 Announce Type: new Abstract: Recent breakthroughs in language models (LMs) using neural networks have raised the question: how similar are these models' processing to human language processing? Results using a framework called Brain Score (BS) -- predicting fMRI activations during reading from LM activations -- have been used to argue for a high degree of similarity. To underst…
◆ 1.0#cs.cl
77
arXiv cs.CL/arxiv.org/11h ago
Why Fine-Tuning Encourages Hallucinations and How to Fix It
arXiv:2604.15574v1 Announce Type: new Abstract: Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using …
◆ 1.0#cs.cl#cs.ai
78
arXiv cs.CL/arxiv.org/11h ago
DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation
arXiv:2604.15593v1 Announce Type: new Abstract: Large language models compress heterogeneous knowledge into a single parameter space, allowing facts from different domains to interfere during generation. We propose DALM, a Domain-Algebraic Language Model that replaces unconstrained token generation with structured denoising over a domain lattice. DALM follows a three-phase generation path: it fir…
◆ 1.0#cs.cl#cs.ai
79
arXiv cs.CL/arxiv.org/11h ago
Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies
arXiv:2604.15607v1 Announce Type: new Abstract: AI design characteristics and human personality traits each impact the quality and outcomes of human-AI interactions. However, their relative and joint impacts are underexplored in imperfectly cooperative scenarios, where people and AI only have partially aligned goals and objectives. This study compares a purely simulated dataset comprising 2,000 s…
◆ 1.0#cs.cl#cs.ai
80
arXiv cs.CL/arxiv.org/11h ago
CIG: Measuring Conversational Information Gain in Deliberative Dialogues with Semantic Memory Dynamics
arXiv:2604.15647v1 Announce Type: new Abstract: Measuring the quality of public deliberation requires evaluating not only civility or argument structure, but also the informational progress of a conversation. We introduce a framework for Conversational Information Gain (CIG) that evaluates each utterance in terms of how it advances collective understanding of the target topic. To operationalize C…
◆ 1.0#cs.cl