research · en · weight 1.2

arXiv cs.CL

When Debiasing Backfires: Counterintuitive Side Effects of Preprocessing-Based Stereotype Mitigation

AI·The research finds that preprocessing-based stereotype mitigation in NLP can backfire by increasing stereotyping or counter-stereotyping for some groups relative to neutral baselines.

arXiv cs.CL·arxiv.org·22h ago·1.5Researchdebiasing stereotype nlp

ICDAR 2026 HIPE-OCRepair Competition on LLM-Assisted OCR Post-Correction for Historical Documents

AI·ICDAR 2026 HIPE-OCRepair competition evaluated LLM-assisted post-correction of noisy OCR from 17th-20th century multilingual (EN/FR/DE) historical newspapers and books. Four teams used zero-shot to fine-tuning approaches; results show significant error reduction but recurring over-correction on low-noise inputs, with a public dataset and evaluation framework released.

arXiv cs.CL·arxiv.org·22h ago·1.7Researchllm ocr post-correction

Tool-Making and Self-Evolving LLM Agents in Low-Latency Systems

AI·The paper proposes an agentic tool-making pipeline that compiles repeated SOP steps into validated versioned tools to reduce latency and improve reliability in production LLM agents.

arXiv cs.CL·arxiv.org·22h ago·1.5Researchagents llm tools

Structured Pruning of Large Language Models via Power Transformation and Sign-Preserving Score Aggregation with Adaptive Feature Retention

AI·The paper proposes structured pruning for LLMs using power transformation and sign-preserving score aggregation with adaptive feature retention to address distribution mismatch issues.

arXiv cs.CL·arxiv.org·22h ago·1.4Researchpruning llm structured

PLURAL: A Global Dataset for Value Alignment

AI·The paper introduces PLURAL, a large-scale value-focused preference dataset grounded in the Integrated Values Survey across 92 countries to improve LLM representation of diverse non-Western value systems.

arXiv cs.CL·arxiv.org·22h ago·1.5Researchvalue-alignment llm dataset

LEXIC: Lightweight Eye-tracking eXtension via Injected Complexity

AI·LEXIC pushes gaze-only reading comprehension prediction on EyeBench with lightweight language-model-free conditioning and injected complexity.

arXiv cs.CL·arxiv.org·22h ago·1.3Researcheye-tracking comprehension neural

SQuaD-SQL: Efficient Text-to-SQL with Small Language Models via LLM-Guided Knowledge Distillation

AI·SQuaD-SQL uses LLM-guided synthetic data and LoRA fine-tuning to train 1.5B-parameter SLMs to reach 86.9% execution accuracy on WikiSQL, matching large models while requiring only one consumer GPU and delivering faster, lower-memory inference.

arXiv cs.CL·arxiv.org·22h ago·1.6Researchtext-to-sql small-language-models knowledge-distillation

Grounded Event Extraction from SEC 8-K Filings with a Fine-Grained Taxonomy

AI·Grounded Event Extraction from SEC 8-K filings uses a fine-grained taxonomy to overcome coarse SEC item codes for market-moving disclosures.

arXiv cs.CL·arxiv.org·22h ago·1.4Researchevent-extraction fintech taxonomy

Echoes Across Vietnam's Highlands, Delta, and Coast: A Multilingual Corpus for Cham, Khmer, and Tay-Nung

AI·CKTN multilingual corpus covers Cham, Khmer, and Tay-Nung from Vietnam's highlands, delta, and coast for NLP of under-resourced minority languages.

arXiv cs.CL·arxiv.org·22h ago·1.3Releasemultilingual corpus minority-languages

Prompt Compression via Activation Aggregation

AI·Prompt Compression via Activation Aggregation compresses task-relevant prompt information into a single activation vector for re-injection into the model.

arXiv cs.CL·arxiv.org·22h ago·1.5Researchprompt-compression activation llm

When Synthetic Speech Is All You Have: Better Call GRPO🔥 hot

AI·LLM-based ASR in regulated domains like banking is limited by privacy and real-speech collection costs; synthetic TTS data is a cost-effective substitute, but acoustic mismatch hinders supervised fine-tuning (SFT). Group Relative Policy Optimization (GRPO) applied solely to synthetic speech reduces WER by 40% relative to SFT (36.71% to 22.09%) and by 45% in the SFT-then-GRPO sequence, by improving behavioral calibration and audio attention rather than representations.

arXiv cs.CL·arxiv.org·22h ago·1.8Researchspeech-recognition asr llm

It Takes a MAESTRO To Prune Bad Experts

AI·Structured pruning method for sparsely-activated MoE models removes bad experts to address full expert bank memory bottleneck while preserving inference efficiency.

arXiv cs.CL·arxiv.org·22h ago·1.5Researchmoe pruning llm

When the Judge Changes, So Does the Measurement: Auditing LLM-as-Judge Reliability

AI·LLM-as-judge scores fluctuate across evaluators even with fixed candidate responses, treating the issue as measurement validity. Across four datasets, scaling Qwen3 from 1.7B to 32B parameters yields only limited adjacent gains, while MiniMax M2-to-M2.7 API upgrades show none; stronger judges reduce but do not eliminate position and verbosity biases, with repeated juries offering little benefit under correlated errors.

arXiv cs.CL·arxiv.org·22h ago·1.9Researchllm-as-judge evaluation-reliability bias

When Implausible Tokens Get Reinforced: Tail-Aware Credit Calibration for LLM Reinforcement Learning

AI·The paper identifies a failure mode in critic-free RL methods for LLMs where implausible tokens receive uniform credit, and proposes tail-aware credit calibration to improve reinforcement learning.

arXiv cs.CL·arxiv.org·22h ago·1.5Researchllm reinforcement-learning calibration

A Reliability Assessment of LALM Audio Judges for Full-Duplex Voice Agents

AI·The paper empirically assesses the reliability of Gemini models as audio judges for full-duplex voice agents by scoring stereo waveforms, validated against human calibration.

arXiv cs.CL·arxiv.org·22h ago·1.5Researchaudio llm agents

Can We Trust LLM's Logic? Quantifying Uncertainty, Coherence, and Robustness via a Graph-Based Framework

AI·The paper introduces a graph-based framework to quantify uncertainty, coherence, and robustness in LLM reasoning, addressing gaps in decoding strategies like Self-Consistency that only check final-answer agreement.

arXiv cs.CL·arxiv.org·22h ago·1.6Researchllm reasoning graph

Best-of-$N$ TTS Evaluation is Confounded by ASR Family Alignment

AI·Best-of-N TTS evaluation is confounded by ASR family alignment, where verifier quality depends on the judging ASR model family.

arXiv cs.CL·arxiv.org·22h ago·1.3Researchtts evaluation asr

Hidden Decoding at Scale: Latent Computation Scaling for Large Language Models🔥 hot

AI·Hidden Decoding expands each token into n independent streams during continued pretraining of a fixed Transformer backbone, using separate embedding tables and retaining intermediate KV caches for latent computation; Stream-Factorized Attention limits cross-stream mixing to reduce costs. Experiments show frontier 80B and 617B MoE models improve on all benchmarks over matched baselines.

arXiv cs.CL·arxiv.org·22h ago·1.8Researchllm scaling continued-pretraining

Diarization-Guided Qwen-ASR Adaptation for Multilingual Two-Speaker Conversational Speech

AI·Researchers developed SQZ-Qwen-ASR-1.7B for MLC-SLM 2026 Task 1, combining a modular speaker diarization front-end (VAD, CAMPPlus embeddings, spectral clustering, RTTM segmentation) with Qwen3-ASR-1.7B adapted via full supervised fine-tuning, LoRA on TTS-generated synthetic speech, and GRPO reinforcement learning for lower tcpMER (23.70 on dev set, 17.97 on eval set).

arXiv cs.CL·arxiv.org·22h ago·1.6Researchspeech-recognition speaker-diarization multilingual

From Solvers to Research: Large Language Model-Driven Formal Mathematics at the Research Frontier🔥 hot

AI·This position paper reviews LLM-driven formal theorem provers and argues that current systems function mainly as solvers for well-defined problems, not as research agents capable of discovering new theorems or resolving open conjectures at the frontier. It identifies key limitations in datasets, exploration, tools, and collaboration and proposes a roadmap for AI4Math systems to support genuine mathematical research.

arXiv cs.CL·arxiv.org·22h ago·1.8Researchllm formal-methods ai4math

DeepSearch-World: Self-Distillation for Deep Search Agents in a Verifiable Environment

AI·DeepSearch-World is a self-distillation framework for web search agents that uses self-generated experience for training in verifiable environments.

arXiv cs.CL·arxiv.org·22h ago·1.5Researchagents llm training

How Do I Know What to Say Next? Barenholtz's Autogenerative Theory as an Enrichment of Harrisean Integrationism

AI·The paper enriches Roy Harris's Integrationism theory with Barenholtz's Autogenerative Theory to address gaps in computational language approaches.

arXiv cs.CL·arxiv.org·22h ago·1.2Researchlinguistics theory integrationism

Scalable and Culturally Specific Stereotype Dataset Construction via Human-LLM Collaboration

AI·The paper presents a cost-efficient human-LLM collaborative annotation framework to construct the EspanSt stereotype dataset for non-English languages and underrepresented cultures.

arXiv cs.CL·arxiv.org·22h ago·1.4Researchstereotype llm annotation

A Multi-cluster Boundary Learning Method for Out-of-Scope Intent Detection via MiniLM Embedding

AI·The paper introduces a multi-cluster boundary learning method using MiniLM embeddings for detecting out-of-scope intents in human-machine interaction systems.

arXiv cs.CL·arxiv.org·22h ago·1.4Researchintent-detection oos mini lm

Hallucination Self-Play: Bootstrapping Reinforced Detector via Evolved Generator

AI·The paper introduces Hallucination Self-Play, a bootstrapping approach that uses an evolved generator to reinforce a detector for identifying faithfulness hallucinations in LLMs.

arXiv cs.CL·arxiv.org·22h ago·1.6Researchhallucination llm self-play

From Execution to Education: A Bloom-Aligned Framework for Measuring Educational Control in LLMs

AI·The paper introduces a Bloom-aligned framework to measure educational control in LLMs, focusing on preserving instructional intent while aligning cognitive demand with learning objectives in programming tasks.

arXiv cs.CL·arxiv.org·22h ago·1.4Researchllm education bloom

MASTE: A Multi-Agent Pipeline for Zero-Shot Aspect Sentiment Triplet Extraction

AI·MASTE: A Multi-Agent Pipeline for Zero-Shot Aspect Sentiment Triplet Extraction

arXiv cs.CL·arxiv.org·22h ago·1.7Researchastm multi-agent llm

XALPHA: A Memory-Driven AI Quant Researcher for Hypothesis-to-Code Alpha Discovery

AI·XALPHA is a memory-driven AI quant researcher that uses multi-source research memory integrating external financial reports and prior discovery feedback, with Macro Brain for theme planning and archetype selection, Micro Brain for hypothesis-to-code translation and tri-alignment verification, and Cross Brain for feedback consolidation, enabling closed-loop continuous alpha discovery that outperforms baselines on CSI300.

arXiv cs.CL·arxiv.org·22h ago·1.7Researchllm quant ai-research

Two Axes of LLM Abstention: Answer Correctness and Question Answerability🔥 hot

AI·Across five instruction-tuned LLMs, ordinary answer-confidence separates correct from wrong answers on answerable questions but fails to distinguish unanswerable questions (e.g., false-premise ones in CREPE), while a linear probe on hidden states performs the opposite, revealing two distinct abstention axes that remain separable even at 14B scale.

arXiv cs.CL·arxiv.org·22h ago·1.8Researchllm abstention llm-safety

DominoTree: Conditional Tree-Structured Drafting with Domino for Speculative Decoding🔥 hot

AI·DominoTree is a training-free best-first tree draft builder for speculative decoding that scores each candidate node by re-applying Domino's GRU-based causal correction along its specific root-to-node path. It restricts per-node correction to top-M candidates for efficiency and delivers up to 6.6x speedup with the highest mean accept length of any tested method.

arXiv cs.CL·arxiv.org·22h ago·1.8Researchspeculative-decoding drafting llm-inference

Unveiling Public Opinion: A Study of Sentiment Analysis Using LSTM and Traditional Models

AI·The study applies LSTM and traditional models to analyze public sentiment on social media platforms like Twitter regarding real-time events and issues.

arXiv cs.CL·arxiv.org·22h ago·1.3Researchsentiment-analysis lstm nlp

What LLM Forecasters Know but Don't Say: Probing Internal Representations for Calibration and Faithfulness

AI·The paper probes internal representations of Eternis-Forecaster 8B for forecasting, training a representation-pooling method to assess calibration and faithfulness of internal CoT reasoning.

arXiv cs.CL·arxiv.org·22h ago·1.4Researchllm forecasting calibration

UltraX: Refining Pre-Training Data at Scale with Adaptive Programmatic Editing

AI·The paper proposes UltraX, a method for refining large-scale pre-training data using adaptive programmatic editing to improve LLM quality when scaling data yields diminishing returns.

arXiv cs.CL·arxiv.org·22h ago·1.6Researchllm data pre-training

Holographic Neural PCFG for Unsupervised Parsing

AI·Holographic Neural PCFG induces latent constituency trees from raw text using holographic memory and neural rule scoring for unsupervised parsing.

arXiv cs.CL·arxiv.org·22h ago·1.5Researchparsing unsupervised pcfg

COBART: Controlled, Optimized, Bidirectional and Auto-Regressive Transformer for Ad Headline Generation

AI·COBART uses a bidirectional auto-regressive transformer for optimized ad headline generation with multi-objective control over quality and CTR.

arXiv cs.CL·arxiv.org·22h ago·1.4Researchnlp generation advertising

COALA: Robust Contextualized Speech-augmented Language Modeling for ASR via Contrastive Regularizer and Biasing Score Estimation

AI·COALA enhances contextualized SLMs for multi-entity ASR via contrastive regularizer and biasing score estimation to handle domain-specific entities robustly.

arXiv cs.CL·arxiv.org·22h ago·1.4Researchasr nlp language-model

TypeProbe: Recovering Type Representations from Hidden States of Pre-trained Code Models

AI·TypeProbe recovers type representations from hidden states of pretrained code models using parallel Java and Python datasets.

arXiv cs.CL·arxiv.org·22h ago·1.4Researchcode-model type probe

Large-Language-Models-as-a-Judge in Theory-Agnostic Adaptive Metric-Alignment for Prototypical Networks in Personality Recognition

AI·Large-Language-Models-as-a-Judge enable theory-agnostic adaptive metric-alignment for prototypical networks in personality recognition.

arXiv cs.CL·arxiv.org·22h ago·1.3Researchpersonality prototype alignment

Detecting Ladder Logic Bombs in IEC 61131-3 PLC Programs using ESBMC-PLC+: A Formal Verification Approach with Trigger Synthesis

AI·Formal verification method using ESBMC-PLC+ detects Ladder Logic Bombs in IEC 61131-3 PLC programs by focusing on function-block bodies where malicious code hides.

arXiv cs.CL·arxiv.org·22h ago·1.5Researchformal-verification plc security

Cross-seed explainability using Procrustes-conditioned Joint End-to-end Top-K Sparse Autoencoders

AI·Procrustes-conditioned Joint End-to-end Top-K Sparse Autoencoders extract cross-seed universal features from independently trained BERT models to address dictionary learning misalignment in mechanistic interpretability.

arXiv cs.CL·arxiv.org·22h ago·1.5Researchinterpretability saes bert