Sparse Machine Learning Methods for Understanding Large Text Corpora

Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents.

Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.

Data and Resources

Additional Info

Field Value
Maintainer Ashok Srivastava
Last Updated February 19, 2025, 00:44 (UTC)
Created February 19, 2025, 00:44 (UTC)
accessLevel public
accrualPeriodicity irregular
bureauCode {026:00}
catalog_@context https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld
catalog_@id https://data.nasa.gov/data.json
catalog_conformsTo https://project-open-data.cio.gov/v1.1/schema
catalog_describedBy https://project-open-data.cio.gov/v1.1/schema/catalog.json
harvest_object_id 8aeca056-85ad-4c28-91dd-474db5602c69
harvest_source_id b37e5849-07d2-41cd-8bb6-c6e83fc98f2d
harvest_source_title DNG Legacy Data
identifier DASHLINK_513
issued 2012-01-27
landingPage https://c3.nasa.gov/dashlink/resources/513/
modified 2020-01-29
programCode {026:029}
publisher Dashlink
resource-type Dataset
source_datajson_identifier true
source_hash c248811aa04273959675180d256b856c623ee96d5a16b202c50f62494df01cf7
source_schema_version 1.1