arXiv:2604.16654v1 Announce Type: new Abstract: Reclaimed slur usage is a common and meaningful practice online for many marginalized communities. It serves as a source of solidarity, identity, and shared experience. However, contemporary automated and AI-based moderation tools for online content largely fail to distinguish between reclaimed and hateful uses of slurs, resulting in the suppression…
arXiv cs.CL
↗ arxiv.org/list/cs.CL/recentresearch · en · weight 1.2
- IYKYK (But AI Doesn't): Automated Content Moderation Does Not Capture Communities' Heterogeneous Attitudes Towards Reclaimed Language◆ 1.1#cs.cl
- Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning
arXiv:2604.16378v1 Announce Type: new Abstract: Large language models (LLMs) and classical machine learning methods offer complementary strengths for predictive modeling, yet their fundamentally different representations and training paradigms hinder effective integration: LLMs rely on gradient-based optimization over textual data, whereas models such as Random Forests (RF) employ non-differentia…
◆ 1.4#cs.cl#cs.lg - Data Mixing for Large Language Models Pretraining: A Survey and Outlook
arXiv:2604.16380v1 Announce Type: new Abstract: Large language models (LLMs) rely on pretraining on massive and heterogeneous corpora, where training data composition has a decisive impact on training efficiency and downstream generalization under realistic compute and data budget constraints. Unlike sample-level data selection, data mixing optimizes domain-level sampling weights to allocate limi…
◆ 1.4#cs.cl#cs.lg - Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM
arXiv:2604.16368v1 Announce Type: new Abstract: Speculative decoding accelerates LLM inference by using a small draft model to propose k candidate tokens for a target model to verify. While effective for same-tokenizer pairs on high-bandwidth GPUs, its applicability to cross-family pairs with mismatched tokenizers and consumer-grade unified memory remains underexplored. We extend the MLX-LM frame…
◆ 1.4#cs.cl - AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
arXiv:2604.16625v1 Announce Type: new Abstract: Recent large language model (LLM) agents have shown promise in using execution feedback for test-time adaptation. However, robust self-improvement remains far from solved: most approaches still treat each problem instance independently, without accumulating reusable knowledge. This limitation is particularly pronounced in domain-specific languages s…
◆ 1.4#cs.cl#cs.ai - Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning
arXiv:2604.16622v1 Announce Type: new Abstract: Backchannels (e.g., `yeah', `mhm', and `right') are short, non-interruptive feedback signals whose lexical form and prosody jointly convey pragmatic meaning. While prior computational research has largely focused on predicting backchannel timing, the relationship between lexico-prosodic form and meaning remains underexplored. We propose a two-stage …
◆ 1.4#cs.cl#cs.ai - Spotlights and Blindspots: Evaluation Machine-Generated Text Detection
arXiv:2604.16607v1 Announce Type: new Abstract: With the rise of generative language models, machine-generated text detection has become a critical challenge. A wide variety of models is available, but inconsistent datasets, evaluation metrics, and assessment strategies obscure comparisons of model effectiveness. To address this, we evaluate 15 different detection models from six distinct systems…
◆ 1.1#cs.cl#cs.ai - Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models
arXiv:2604.16593v1 Announce Type: new Abstract: We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiom…
◆ 1.1#cs.cl - SynopticBench: Evaluating Vision-Language Models on Generating Weather Forecast Discussions of the Future
arXiv:2604.16451v1 Announce Type: new Abstract: Recent advances in visual-language models (VLMs) have led to significant improvements in a plethora of complex multimodal tasks like image captioning, report generation, and visual perception. However, generating text from meteorological data is highly challenging because the atmosphere is a chaotic system that is rapidly changing at various spatial…
◆ 1.1#cs.cl#cs.cv - HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders
arXiv:2604.16430v1 Announce Type: new Abstract: Large Language Models (LLMs) are powerful and widely adopted, but their practical impact is limited by the well-known hallucination phenomenon. While recent hallucination detection methods have made notable progress, we find most of them overlook the dynamic nature and underlying mechanisms of it. To address this gap, we propose HalluSAE, a phase tr…
◆ 1.4#cs.cl#cs.ai - Injecting Structured Biomedical Knowledge into Language Models: Continual Pretraining vs. GraphRAG
arXiv:2604.16422v1 Announce Type: new Abstract: The injection of domain-specific knowledge is crucial for adapting language models (LMs) to specialized fields such as biomedicine. While most current approaches rely on unstructured text corpora, this study explores two complementary strategies for leveraging structured knowledge from the UMLS Metathesaurus: (i) Continual pretraining that embeds kn…
◆ 1.1#cs.cl#cs.ai - Foundational Study on Authorship Attribution of Japanese Web Reviews for Actor Analysis
arXiv:2604.16376v1 Announce Type: new Abstract: This study investigates the applicability of authorship attribution based on stylistic features to support actor analysis in threat intelligence. As a foundational step toward future application to dark web forums, we conducted experiments using Japanese review data from clear web sources. We constructed datasets from Rakuten Ichiba reviews and comp…
◆ 1.1#cs.cl#cs.cr - CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark
arXiv:2604.16372v1 Announce Type: new Abstract: Multimodal sarcasm detection has recently garnered significant attention. However, existing benchmarks suffer from coarse-grained annotations and limited cultural coverage, which hinder research into fine-grained semantic understanding. To address this, we construct CFMS, the first fine-grained multimodal sarcasm dataset tailored for Chinese social …
◆ 1.1#cs.cl#cs.ai - Brain-CLIPLM: Decoding Compressed Semantic Representations in EEG for Language Reconstruction
arXiv:2604.16370v1 Announce Type: new Abstract: Decoding natural language from non-invasive electroencephalography (EEG) remains fundamentally limited by low signal-to-noise ratio and restricted information bandwidth. This raises a fundamental question regarding whether sentence-level linguistic structure can be reliably recovered from such signals. In this work, we suggest that this assumption m…
◆ 1.1#cs.cl#cs.ai - Multimodal Claim Extraction for Fact-Checking
arXiv:2604.16311v1 Announce Type: new Abstract: Automated Fact-Checking (AFC) relies on claim extraction as a first step, yet existing methods largely overlook the multimodal nature of today's misinformation. Social media posts often combine short, informal text with images such as memes, screenshots, and photos, creating challenges that differ from both text-only claim extraction and well-studie…
◆ 1.1#cs.cl#cs.ai - A Community-Based Approach for Stance Distribution and Argument Organization
arXiv:2604.16852v1 Announce Type: new Abstract: The proliferation of online debate platforms and social media has led to an unprecedented volume of argumentative content on controversial topics from multiple perspectives. While this wealth of perspectives offers opportunities for developing critical thinking and breaking filter bubbles (Pariser 2011), the sheer volume and complexity of arguments …
◆ 1.1#cs.cl - DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training
arXiv:2604.16845v1 Announce Type: new Abstract: Large language models (LLMs) tuned for safety often avoid acknowledging demographic differences, even when such acknowledgment is factually correct (e.g., ancestry-based disease incidence) or contextually justified (e.g., religious hiring preferences). This identity-blindness yields incorrect responses, unnecessary refusals, or generic "equal-treatm…
◆ 1.4#cs.cl - Detecting Alarming Student Verbal Responses using Text and Audio Classifier
arXiv:2604.16717v1 Announce Type: new Abstract: This paper addresses a critical safety gap in the use Automated Verbal Response Scoring (AVRS). We present a novel hybrid framework for troubled student detection that combines a text classifier, trained to detect responses based on their content, and an audio classifier, trained to detect responses using prosodic markers. This approach overcomes ke…
◆ 1.1#cs.cl#cs.ir - x1: Learning to Think Adaptively Across Languages and Cultures
arXiv:2604.16917v1 Announce Type: new Abstract: Languages encode distinct abstractions and inductive priors, yet most large language models (LLMs) overlook this diversity by reasoning in a single dominant language. In this work, we introduce x1, a family of reasoning models that can adaptively reason in an advantageous language on a per-instance basis. To isolate the effect of reasoning-language …
◆ 1.4#cs.cl - When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints
arXiv:2604.16916v1 Announce Type: new Abstract: Safety alignment in large language models (LLMs) is primarily evaluated under open-ended generation, where models can mitigate risk by refusing to respond. In contrast, many real-world applications place LLMs in structured decision-making tasks, such as multiple-choice questions (MCQs), where abstention is discouraged or unavailable. We identify a s…
◆ 1.4#cs.cl - Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation
arXiv:2604.16881v1 Announce Type: new Abstract: Cross-cultural entity translation remains challenging for large language models (LLMs) as literal or phonetic renderings are usually yielded instead of culturally appropriate translations in context. However, relevant knowledge may already be encoded in model parameters during large-scale pre-training. To incentivize the effective use of parametric …
◆ 1.4#cs.cl#cs.ai - PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations
arXiv:2604.16909v1 Announce Type: new Abstract: As large language models (LLMs) evolve from conversational assistants into agents capable of handling complex tasks, they are increasingly deployed in high-risk domains. However, existing benchmarks largely rely on mixed queries and posterior evaluation, output-level scoring, which quantifies hallucination severity but offers limited insight into wh…
◆ 1.4#cs.cl#cs.ai - Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution
arXiv:2604.16889v1 Announce Type: new Abstract: Existing feature-interpretation pipelines typically operate on uniformly sampled units, but only a small fraction of cross-layer transcoder (CLT) features matter for a target behavior, with the rest resulting in expensive feature explaining and evaluating costs. We introduce the first CLT-native end-to-end framework, PIE, connecting Pruning, automat…
◆ 1.1#cs.cl - HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents
arXiv:2604.16839v1 Announce Type: new Abstract: Long-term memory is a critical challenge for Large Language Model agents, as fixed context windows cannot preserve coherence across extended interactions. Existing memory systems represent conversation history as unstructured embedding vectors, retrieving information through semantic similarity. This paradigm fails to capture the associative structu…
◆ 1.4#cs.cl - Crowded in B-Space: Calibrating Shared Directions for LoRA Merging
arXiv:2604.16826v1 Announce Type: new Abstract: Merging separately trained LoRA adapters is a practical alternative to joint multi-task training, but it often hurts performance. Existing methods usually treat the LoRA update $\Delta W = BA$ as a single object and do not distinguish the two LoRA matrices. We show that the main source of LoRA merge interference comes from the output-side matrix $B$…
◆ 1.1#cs.cl - When Informal Text Breaks NLI: Tokenization Failure, Distribution Shift, and Targeted Mitigations
arXiv:2604.16787v1 Announce Type: new Abstract: We study how informal surface forms degrade NLI accuracy in ELECTRA-small (14M) and RoBERTa-large (355M) across four transforms applied to SNLI and MultiNLI: slang substitution, emoji replacement, Gen-Z filler tokens, and their combination. Slang substitution (replacing formal words with informal equivalents, e.g., "going to" -> "gonna", "friend" ->…
◆ 1.1#cs.cl#cs.ai - StageMem: Lifecycle-Managed Memory for Language Models
arXiv:2604.16774v1 Announce Type: new Abstract: Long-horizon language model systems increasingly rely on persistent memory, yet many current designs still treat memory primarily as a static store: write an item, place it into memory, and retrieve it later if needed. We argue that this framing does not adequately capture the practical memory-control problem in deployed LLM systems. In realistic se…
◆ 1.4#cs.cl#cs.ai - When Misinformation Speaks and Converses: Rethinking Fact-Checking in Audio Platforms
arXiv:2604.16767v1 Announce Type: new Abstract: Audio platforms have evolved beyond entertainment. They have become central to public discourse, from podcasts and radio to WhatsApp voice notes and live streams. With millions of shows and hundreds of millions of listeners, audio platforms are now a major channel for misinformation. Yet existing fact-checking pipelines are mostly designed for writt…
◆ 1.1#cs.cl#cs.cy - Expressing Social Emotions: Misalignment Between LLMs and Human Cultural Emotion Norms
arXiv:2604.16757v1 Announce Type: new Abstract: The expression of emotions that serve social purposes, such as asserting independence or fostering interdependence, is central to human interactions and varies systematically across cultures. As LLMs are increasingly used to simulate human behavior in culturally nuanced interactions, it is important to understand whether they faithfully capture huma…
◆ 1.4#cs.cl#cs.cy - Evaluating Adaptive Personalization of Educational Readings with Simulated Learners
arXiv:2604.16744v1 Announce Type: new Abstract: We present a framework for evaluating adaptive personalization of educational reading materials with theory-grounded simulated learners. The system builds a learning-objective and knowledge-component ontology from open textbooks, curates it in a browser-based Ontology Atlas, labels textbook chunks with ontology entities, and generates aligned readin…
◆ 1.1#cs.cl#cs.ai - No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation
arXiv:2604.16686v1 Announce Type: new Abstract: Large language models (LLMs) can answer questions and summarize documents when conditioned on external contexts (e.g., retrieved evidence), yet context use remains unreliable: models may overwrite an already-correct output (neutral regression) even when the context is non-informative. We formalize neutral regression as a do-no-harm requirement and q…
◆ 1.4#cs.cl#cs.ai - CBRS: Cognitive Blood Request System with Bilingual Dataset and Dual-Layer Filtering for Multi-Platform Social Streams
arXiv:2604.16665v1 Announce Type: new Abstract: Urgent blood donation seeking posts and messages on social media often go unnoticed due to the overwhelming volume of daily communications. Traditional app-based systems, reliant on manual input, struggle to reach users in low-resource settings, delaying critical responses. To address this, we introduce the Cognitive Blood Request System (CBRS), a m…
◆ 1.1#cs.cl - Defragmenting Language Models: An Interpretability-based Approach for Vocabulary Expansion
arXiv:2604.16656v1 Announce Type: new Abstract: All languages are equal; when it comes to tokenization, some are more equal than others. Tokens are the hidden currency that dictate the cost and latency of access to contemporary LLMs. However, many languages written in non-Latin scripts observe a poor exchange rate: LLMs take several multiples of tokens to encode the same information in many langu…
◆ 1.4#cs.cl - Migrant Voices, Local News: Insights on Bridging Community Needs with Media Content
arXiv:2604.16651v1 Announce Type: new Abstract: Research shows news consumption differs across demographics, yet little is known about non-mainstream audiences, especially in relation to local media. Our study addresses this gap by examining how French-speaking migrants in a mid-size European city engage with local news, and whether their needs are reflected in coverage. Eight community members p…
◆ 1.1#cs.cl - EchoChain: A Full-Duplex Benchmark for State-Update Reasoning Under Interruptions
arXiv:2604.16456v1 Announce Type: new Abstract: Real-time voice assistants must revise task state when users interrupt mid-response, but existing spoken-dialog benchmarks largely evaluate turn-based interaction and miss this failure mode. We introduce EchoChain, a controlled benchmark for evaluating full-duplex state-update reasoning under mid-speech interruptions. EchoChain identifies three recu…
◆ 1.1#cs.cl#cs.ai - Measuring Representation Robustness in Large Language Models for Geometry
arXiv:2604.16421v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly evaluated on mathematical reasoning, yet their robustness to equivalent problem representations remains poorly understood. In geometry, identical problems can be expressed in Euclidean, coordinate, or vector forms, but existing benchmarks report accuracy on fixed formats, implicitly assuming representati…
◆ 1.4#cs.cl#cs.ai - QU-NLP at QIAS 2026: Multi-Stage QLoRA Fine-Tuning for Arabic Islamic Inheritance Reasoning
arXiv:2604.16396v1 Announce Type: new Abstract: Islamic inheritance law (ilm al-mawar{\i}th) presents a challenging domain for evaluating large language models' structured reasoning capabilities, requiring multi-step legal analysis, rule-based blocking decisions, and precise fractional calculations. We present QU-NLP's submission to the QIAS 2026 shared task on Arabic Islamic inheritance reasonin…
◆ 1.1#cs.cl - The impact of postediting on AI generative translation in Yemeni context: Translating literary prose by ChatGPT
arXiv:2604.16704v1 Announce Type: new Abstract: This study examines the role of artificial intelligence in translation, focusing on ChatGPT, specifically ChatGPT-4, and the extent to which human postediting is required in literary translation. A mixed-method approach was adopted, involving 30 professional translators who evaluated and postedited AI-generated translations of selected Arabic and En…
◆ 1.4#cs.cl#cs.ai - LiFT: Does Instruction Fine-Tuning Improve In-Context Learning for Longitudinal Modelling by Large Language Models?
arXiv:2604.16382v1 Announce Type: new Abstract: Longitudinal NLP tasks require reasoning over temporally ordered text to detect persistence and change in human behavior and opinions. However, in-context learning with large language models struggles on tasks where models must integrate historical context, track evolving interactions, and handle rare change events. We introduce LiFT, a longitudinal…
◆ 1.1#cs.cl - GoCoMA: Hyperbolic Multimodal Representation Fusion for Large Language Model-Generated Code Attribution
arXiv:2604.16377v1 Announce Type: new Abstract: Large Language Models (LLMs) trained on massive code corpora are now increasingly capable of generating code that is hard to distinguish from human-written code. This raises practical concerns, including security vulnerabilities and licensing ambiguity, and also motivates a forensic question: 'Who (or which LLM) wrote this piece of code?' We present…
◆ 1.4#cs.cl#cs.cy - CHOP: Chunkwise Context-Preserving Framework for RAG on Multi Documents
arXiv:2604.15802v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) systems lose retrieval accuracy when similar documents coexist in the vector database, causing unnecessary information, hallucinations, and factual errors. To alleviate this issue, we propose CHOP, a framework that iteratively evaluates chunk relevance with Large Language Models (LLMs) and progressively reconstru…
◆ 0.9#cs.cl - Why Fine-Tuning Encourages Hallucinations and How to Fix It
arXiv:2604.15574v1 Announce Type: new Abstract: Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using …
◆ 0.7#cs.cl#cs.ai - C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment
arXiv:2604.15675v1 Announce Type: new Abstract: Achieving cultural alignment in Large Language Models (LLMs) increasingly depends on synthetic data generation. For such synthesis, the most vital initial step is seed curation; however, current methods lack quantifiable standards for selecting these seeds. Existing approaches rely on unscalable manual curation or bias-prone LLM extraction, treating…
◆ 0.9#cs.cl - HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning
arXiv:2604.15648v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) consistently require new arenas to guide their expanding boundaries, yet their capabilities with hypergraphs remain unexplored. In the real world, hypergraphs have significant practical applications in areas such as life sciences and social communities. Recent advancements in LVLMs have shown promise in understan…
◆ 0.7#cs.cl#cs.cv - Applied Explainability for Large Language Models: A Comparative Study
arXiv:2604.15371v1 Announce Type: new Abstract: Large language models (LLMs) achieve strong performance across many natural language processing tasks, yet their decision processes remain difficult to interpret. This lack of transparency creates challenges for trust, debugging, and deployment in real-world systems. This paper presents an applied comparative study of three explainability techniques…
◆ 0.9#cs.cl#cs.ai - Think Multilingual, Not Harder: A Data-Efficient Framework for Teaching Reasoning Models to Code-Switch
arXiv:2604.15490v1 Announce Type: new Abstract: Recent developments in reasoning capabilities have enabled large language models to solve increasingly complex mathematical, symbolic, and logical tasks. Interestingly, while reasoning models are often trained to generate monolingual text, these models have also been observed to code-switch (i.e., mix languages). Prior works have either viewed code-…
◆ 0.7#cs.cl - PolicyBank: Evolving Policy Understanding for LLM Agents
arXiv:2604.15505v1 Announce Type: new Abstract: LLM agents operating under organizational policies must comply with authorization constraints typically specified in natural language. In practice, such specifications inevitably contain ambiguities and logical or semantic gaps that cause the agent's behavior to systematically diverge from the true requirements. We ask: by letting an agent evolve it…
◆ 0.9#cs.cl#cs.ai - LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance
arXiv:2604.15589v1 Announce Type: new Abstract: Existing research on large language models (LLMs) for automated code compliance has primarily focused on performance, treating the models as black boxes and overlooking how training decisions affect their interpretive behavior. This paper addresses this gap by employing a perturbation-based attribution analysis to compare the interpretive behaviors …
◆ 0.9#cs.cl#cs.ai - DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation
arXiv:2604.15593v1 Announce Type: new Abstract: Large language models compress heterogeneous knowledge into a single parameter space, allowing facts from different domains to interfere during generation. We propose DALM, a Domain-Algebraic Language Model that replaces unconstrained token generation with structured denoising over a domain lattice. DALM follows a three-phase generation path: it fir…
◆ 0.7#cs.cl#cs.ai - LLMs Corrupt Your Documents When You Delegate
arXiv:2604.15597v1 Announce Type: new Abstract: Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems i…
◆ 0.9#cs.cl#cs.hc - FD-NL2SQL: Feedback-Driven Clinical NL2SQL that Improves with Use
arXiv:2604.15646v1 Announce Type: new Abstract: Clinicians exploring oncology trial repositories often need ad-hoc, multi-constraint queries over biomarkers, endpoints, interventions, and time, yet writing SQL requires schema expertise. We demo FD-NL2SQL, a feedback-driven clinical NL2SQL assistant for SQLite-based oncology databases. Given a natural-language question, a schema-aware LLM decompos…
◆ 0.9#cs.cl - CIG: Measuring Conversational Information Gain in Deliberative Dialogues with Semantic Memory Dynamics
arXiv:2604.15647v1 Announce Type: new Abstract: Measuring the quality of public deliberation requires evaluating not only civility or argument structure, but also the informational progress of a conversation. We introduce a framework for Conversational Information Gain (CIG) that evaluates each utterance in terms of how it advances collective understanding of the target topic. To operationalize C…
◆ 0.7#cs.cl - Preference Estimation via Opponent Modeling in Multi-Agent Negotiation
arXiv:2604.15687v1 Announce Type: new Abstract: Automated negotiation in complex, multi-party and multi-issue settings critically depends on accurate opponent modeling. However, conventional numerical-only approaches fail to capture the qualitative information embedded in natural language interactions, resulting in unstable and incomplete preference estimation. Although Large Language Models (LLM…
◆ 0.9#cs.cl - Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information
arXiv:2604.15701v1 Announce Type: new Abstract: The significant computational demands of large language models have increased interest in distilling reasoning abilities into smaller models via Chain-of-Thought (CoT) distillation. Current CoT distillation methods mainly focus on transferring teacher-generated rationales for complex reasoning to student models. However, they do not adequately explo…
◆ 0.7#cs.cl - Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences
arXiv:2604.15503v1 Announce Type: new Abstract: Recent breakthroughs in language models (LMs) using neural networks have raised the question: how similar are these models' processing to human language processing? Results using a framework called Brain Score (BS) -- predicting fMRI activations during reading from LM activations -- have been used to argue for a high degree of similarity. To underst…
◆ 0.7#cs.cl - GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
arXiv:2604.15715v1 Announce Type: new Abstract: The development of general-purpose agents requires a shift from executing simple instructions to completing complex, real-world productivity workflows. However, current tool-use benchmarks remain misaligned with real-world requirements, relying on AI-generated queries, dummy tools, and limited system-level coordination. To address this, we propose G…
◆ 0.7#cs.cl#cs.ai - Learning Uncertainty from Sequential Internal Dispersion in Large Language Models
arXiv:2604.15741v1 Announce Type: new Abstract: Uncertainty estimation is a promising approach to detect hallucinations in large language models (LLMs). Recent approaches commonly depend on model internal states to estimate uncertainty. However, they suffer from strict assumptions on how hidden states should evolve across layers, and from information loss by solely focusing on last or mean tokens…
◆ 0.9#cs.cl#cs.ai - Language, Place, and Social Media: Geographic Dialect Alignment in New Zealand
arXiv:2604.15744v1 Announce Type: new Abstract: This thesis investigates geographic dialect alignment in place-informed social media communities, focussing on New Zealand-related Reddit communities. By integrating qualitative analyses of user perceptions with computational methods, the study examines how language use reflects place identity and patterns of language variation and change based on u…
◆ 0.7#cs.cl - TTL: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models
arXiv:2604.15756v1 Announce Type: new Abstract: Vision-language models (VLMs) such as CLIP exhibit strong Out-of-distribution (OOD) detection capabilities by aligning visual and textual representations. Recent CLIP-based test-time adaptation methods further improve detection performance by incorporating external OOD labels. However, such labels are finite and fixed, while the real OOD semantic sp…
◆ 0.7#cs.cl#cs.cv - Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing
arXiv:2604.15771v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) has emerged as a foundational paradigm for grounding large language models in external knowledge. While adaptive retrieval mechanisms have improved retrieval efficiency, existing approaches treat post-retrieval failure as a signal to retry rather than to diagnose -- leaving the structural causes of query-evidence…
◆ 0.7#cs.cl - MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents
arXiv:2604.15774v1 Announce Type: new Abstract: Equipping Large Language Models (LLMs) with persistent memory enhances interaction continuity and personalization but introduces new safety risks. Specifically, contaminated or biased memory accumulation can trigger abnormal agent behaviors. Existing evaluation methods have not yet established a standardized framework for measuring memory misevoluti…
◆ 0.9#cs.cl - PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection
arXiv:2604.15776v1 Announce Type: new Abstract: We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain-specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems. We consolidate ten publicly availabl…
◆ 0.7#cs.cl#cs.ai - Consistency Analysis of Sentiment Predictions using Syntactic & Semantic Context Assessment Summarization (SSAS)
arXiv:2604.15547v1 Announce Type: new Abstract: The fundamental challenge of using Large Language Models (LLMs) for reliable, enterprise-grade analytics, such as sentiment prediction, is the conflict between the LLMs' inherent stochasticity (generative, non-deterministic nature) and the analytical requirement for consistency. The LLM inconsistency, coupled with the noisy nature of chaotic modern …
◆ 0.9#cs.cl#cs.ai - Qwen3.5-Omni Technical Report
arXiv:2604.15804v1 Announce Type: new Abstract: In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of …
◆ 0.9#cs.cl#eess.as - CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution
arXiv:2604.15840v1 Announce Type: new Abstract: Reinforcement learning for LLM agents is typically conducted on a static data distribution, which fails to adapt to the agent's evolving behavior and leads to poor coverage of complex environment interactions. To address these challenges, we propose CoEvolve, an agent-data mutual evolution framework that enables LLM agents to improve through closed-…
◆ 0.9#cs.cl - Exploring the Capability Boundaries of LLMs in Mastering of Chinese Chouxiang Language
arXiv:2604.15841v1 Announce Type: new Abstract: While large language models (LLMs) have achieved remarkable success in general language tasks, their performance on Chouxiang Language, a representative subcultural language in the Chinese internet context, remains largely unexplored. In this paper, we introduce Mouse, a specialized benchmark designed to evaluate the capabilities of LLMs on NLP task…
◆ 0.9#cs.cl - Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms
arXiv:2604.15842v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated impressive capabilities, yet their internal mechanisms for handling reasoning-intensive tasks remain underexplored. To advance the understanding of model-internal processing mechanisms, we present an investigation of how LLMs perform arithmetic operations by examining internal mechanisms during task exe…
◆ 0.9#cs.cl - CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization
arXiv:2604.15847v1 Announce Type: new Abstract: Machine unlearning has gained increasing attention in recent years, as a promising technique to selectively remove unwanted privacy or copyrighted information from Large Language Models that are trained on a massive scale of human data. However, the emergence of Large Reasoning Models (LRMs), which emphasize long chain-of-thought (CoT) reasoning to …
◆ 0.7#cs.cl - DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition
arXiv:2604.15866v1 Announce Type: new Abstract: Large language models (LLMs) have advanced information extraction (IE) by enabling zero-shot and few-shot named entity recognition (NER), yet their generative outputs still show persistent and systematic errors. Despite progress through instruction fine-tuning, zero-shot NER still lags far behind supervised systems. These recurring errors mirror inc…
◆ 0.9#cs.cl#cs.ai - How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models
arXiv:2604.15873v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly studied as repositories of linguistic knowledge. In this line of work, models are commonly evaluated both as generators of language and as judges of linguistic output, yet these two roles are rarely examined in direct relation to one another. As a result, it remains unclear whether success in one role al…
◆ 0.9#cs.cl - MUSCAT: MUltilingual, SCientific ConversATion Benchmark
arXiv:2604.15929v1 Announce Type: new Abstract: The goal of multilingual speech technology is to facilitate seamless communication between individuals speaking different languages, creating the experience as though everyone were a multilingual speaker. To create this experience, speech technology needs to address several challenges: Handling mixed multilingual input, specific vocabulary, and code…
◆ 0.7#cs.cl - RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration
arXiv:2604.15945v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) is widely used to augment the input to Large Language Models (LLMs) with external information, such as recent or domain-specific knowledge. Nonetheless, current models still produce closed-domain hallucinations and generate content that is unsupported by the retrieved context. Current detection approaches typical…
◆ 0.9#cs.cl#cs.lg - SCHK-HTC: Sibling Contrastive Learning with Hierarchical Knowledge-Aware Prompt Tuning for Hierarchical Text Classification
arXiv:2604.15998v1 Announce Type: new Abstract: Few-shot Hierarchical Text Classification (few-shot HTC) is a challenging task that involves mapping texts to a predefined tree-structured label hierarchy under data-scarce conditions. While current approaches utilize structural constraints from the label hierarchy to maintain parent-child prediction consistency, they face a critical bottleneck, the…
◆ 0.7#cs.cl - AgentV-RL: Scaling Reward Modeling with Agentic Verifier
arXiv:2604.16004v1 Announce Type: new Abstract: Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge…
◆ 0.9#cs.cl#cs.ai - The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring
arXiv:2604.15702v1 Announce Type: new Abstract: We introduce a cross-domain behavioural assay of monitoring-control coupling in LLMs, grounded in the Nelson and Narens (1990) metacognitive framework and applying human psychometric methodology to LLM evaluation. The battery comprises 524 items across six cognitive domains (learning, metacognitive calibration, social cognition, attention, executive…
◆ 0.9#cs.cl#cs.lg - A Systematic Study of Training-Free Methods for Trustworthy Large Language Models
arXiv:2604.15789v1 Announce Type: new Abstract: As Large Language Models (LLMs) receive increasing attention and are being deployed across various domains, their potential risks, including generating harmful or biased content, producing unsupported claims, and exhibiting vulnerabilities to adversarial attacks, have drawn significant attention. To enable quick and low-cost adaptation, training-fre…
◆ 0.9#cs.cl - "Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations
arXiv:2604.15588v1 Announce Type: new Abstract: The integration of Large Language Models (LLMs) into scientific workflows presents exciting opportunities to accelerate biomedical discovery. However, the reactive nature of LLMs, which respond only when prompted, limits their effectiveness in collaborative settings that demand foresight and autonomous engagement. In this study, we introduce CoLabSc…
◆ 0.9#cs.cl#cs.ai - GroupDPO: Memory efficient Group-wise Direct Preference Optimization
arXiv:2604.15602v1 Announce Type: new Abstract: Preference optimization is widely used to align Large Language Models (LLMs) with preference feedback. However, most existing methods train on a single positive-negative pair per prompt, discarding additional supervision available in preference datasets that typically contain multiple candidate responses. Motivated by this limitation, recent work ex…
◆ 0.9#cs.cl - Target-Oriented Pretraining Data Selection via Neuron-Activated Graph
arXiv:2604.15706v1 Announce Type: new Abstract: Everyday tasks come with a target, and pretraining models around this target is what turns them into experts. In this paper, we study target-oriented language model (LM) pretraining by introducing Neuron-Activated Graph Ranking (NAG-based Ranking), a training-free and interpretable framework for target pretraining data selection. Rather than using b…
◆ 0.7#cs.cl - Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies
arXiv:2604.15607v1 Announce Type: new Abstract: AI design characteristics and human personality traits each impact the quality and outcomes of human-AI interactions. However, their relative and joint impacts are underexplored in imperfectly cooperative scenarios, where people and AI only have partially aligned goals and objectives. This study compares a purely simulated dataset comprising 2,000 s…
◆ 0.7#cs.cl#cs.ai