CS 1: Unlocking our Collective Knowledge: LLMs for Data Extraction from Long-Form Documents

Рет қаралды 53

4 ай бұрын

Patrick Bjornstad is a Systems Engineer / Data Scientist at Jet Propulsion Laboratory in the Systems Modeling, Analysis & Architectures group. With expertise in a range of topics including statistical modeling, machine/deep learning, software development, and data engineering, Patrick has been involved with a variety of projects, primarily supporting formulation work at JPL. Patrick earned a B.S. in an Applied & Computational Mathematics and an M.S. in Applied Data Science at the University of Southern California (USC).
As the primary mode of communication between humans, natural language (oftentimes found in the form of text) is one of the most prevalent sources of information across all domains. From scholarly articles to industry reports, textual documentation pervades every facet of knowledge dissemination. This is especially true in the world of aerospace. While other structured data formats may struggle to capture complex relationships, natural language excels by allowing for detailed explanations that a human can understand. However, the flexible, human-centered nature of text has made it traditionally difficult to incorporate into quantitative analyses, leaving potentially valuable insights and features hidden within the troves of documents collecting dust in various repositories.
Large Language Models (LLMs) are an emerging technology that can bridge the gap between the expressiveness of unstructured text and the practicality of structured data. Trained to predict the next most likely word following a sequence of text, LLMs built on large and diverse datasets must implicitly learn knowledge related to a variety of fields in order to perform prediction effectively. As a result, modern LLMs have the capability to interpret the underlying semantics of language in many different contexts, allowing them to digest long-form, domain-specific textual information in a fraction of the time that a human could. Among other things, this opens up the possibility of knowledge extraction: the transformation of unstructured textual knowledge to a structured format that is consistent, queryable, and amenable to being incorporated in future statistical or machine learning analyses.
Specifically, this work begins by highlighting the use of GPT-4 for categorizing NASA work contracts based on JPL’s organizational structure using textual descriptions of the contract’s work, allowing the lab to better understand how different divisions will be impacted by the increasingly outsourced work environment. Despite its simplicity, the task demonstrates the capability of LLMs to ingest unstructured text and produce structured results (categorical features for each contract indicating the JPL organization that the work would involve) useful for statistical analysis. Potential extensions to this proof of concept are then highlighted, such as the generation of knowledge-graphs/ontologies to encode domain and mission-specific information. Access to a consistent, structured graphical knowledge base would not only improve data-driven decision making in engineering contexts by exposing previously out-of-reach data artifacts to traditional analyses (e.g., numerical data extracted from text, or even graph embeddings which encode entities/nodes as vectors in a way that captures the entity’s relation to the overall structure of the graph), but could also accelerate the development of specialized capabilities like the mission Digital Twin (DT) by enabling access to a reliable, machine-readable database of mission and domain expertise.
Session Materials: dataworks.test...