The Computer Society
Large Language Models as Pretrained Data Engineers
Pages
20
Time to read
50 mins
Publication
Language
English
Pages
20
Time to read
50 mins
Publication
Language
English
This technical report discusses the integration of large language models (LLMs) into data engineering workflows, highlighting their potential to enhance data management tasks. The paper outlines an architectural framework that incorporates LLMs across three critical stages: data wrangling, analytical querying, and table augmentation for machine learning. It details how LLMs can simplify and optimize data preparation and transformation processes, extend querying capabilities, and improve the performance of data-centric tasks. The report also reviews current research and presents three developed systems: UniDM for data wrangling, DAIL-SQL for Text2SQL solutions, and SMARTFEAT for automating feature engineering. Furthermore, it addresses the challenges of integrating LLMs into complex data workflows and envisions future directions for LLM-assisted data engineering applications, including automated systems for efficient data preparation and enhanced querying interfaces for unstructured data.