Mining Distance-Based Outliers in Near Linear Time

Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule

Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.

Data and Resources

BaySchwabacherKDD2003.pdfPDF
Mining Distance-Based Outliers in Near Linear Time with Randomization and a...
Explore
- More information
- Go to resource

Additional Info

Field	Value
Maintainer	MARK SCHWABACHER
Last Updated	February 19, 2025, 08:16 (UTC)
Created	February 19, 2025, 08:16 (UTC)
accessLevel	public
accrualPeriodicity	irregular
bureauCode	{026:00}
catalog_@context	https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld
catalog_@id	https://data.nasa.gov/data.json
catalog_conformsTo	https://project-open-data.cio.gov/v1.1/schema
catalog_describedBy	https://project-open-data.cio.gov/v1.1/schema/catalog.json
harvest_object_id	3aa0315f-4e25-4e55-b4f4-7e158fafb7dd
harvest_source_id	b37e5849-07d2-41cd-8bb6-c6e83fc98f2d
harvest_source_title	DNG Legacy Data
identifier	DASHLINK_191
issued	2010-09-22
landingPage	https://c3.nasa.gov/dashlink/resources/191/
modified	2020-01-29
programCode	{026:029}
publisher	Dashlink
resource-type	Dataset
source_datajson_identifier	true
source_hash	ecb0916407f4ce87b29fe420e0e72789579bd1bb45b11b16ab58fa335dc389a0
source_schema_version	1.1