LC-ATSR: A Transformer-Based Approach for Semantic Retrieval in Lakehouse Data Platforms

Abstract

This study proposes a Lakehouse Collaborative Adaptive Transformer Semantic Representation Algorithm (LC-ATSR) to address the core challenge of semantic retrieval of unstructured data in a Lakehouse integrated data platform. The algorithm uses a transformer as its backbone and incorporates a dual-dimensional adaptive mechanism of storage attributes and data types to construct an integrated "preprocessing-representation-retrieval" framework. The core designs include a storage attribute-enhanced attention mechanism that integrates Lake storage tags into self-attention computation to adapt to heterogeneous storage characteristics; a lightweight semantic compression and multi-modal alignment module that reduces vector dimensions (from 768 to 256) while constructing a unified semantic space; and a Lakehouse incremental semantic indexing mechanism that dynamically updates the index based on LSH hashing. The experimental results show that the LC-ATSR algorithm achieves a P@10 score of 89.7% and an F1 score of 88.2%, significantly outperforming mainstream algorithms such as BERT, DPR, and RoBERTa. The single-retrieval latency was 18.3ms, the incremental index construction time was reduced by 68.4% compared with the full-data approach, and the accuracy fluctuation was only 3.2% in heterogeneous data scenarios. The retrieval system built based on this algorithm achieves a 99.8% functional pass rate, a throughput of 286 QPS with 500 concurrent connections, and a response time of 21.3ms, meeting the engineering requirements of enterprise data platforms and providing technical support for the value mining of unstructured data.

Authors

  • Youfang Xu School of Information Technology and media, Hexi University, Zhangye Gansu, 734000, China

DOI:

https://doi.org/10.31449/inf.v50i11.13673

Downloads

Published

04/23/2026

How to Cite

Xu, Y. (2026). LC-ATSR: A Transformer-Based Approach for Semantic Retrieval in Lakehouse Data Platforms. Informatica, 50(11). https://doi.org/10.31449/inf.v50i11.13673