Collector Overview

The Collector is the on-premises component of InsightIDR, or a machine on your network running Rapid7 software that either polls data or receives data from Event Sources and makes it available for InsightIDR analysis. An Event Source represents a single device that sends logs to the Collector.

For example, if you have three firewalls, you will have one Event Source for each firewall in the Collector.

It is usually more efficient to deploy multiple Collectors throughout an environment rather than break firewall rules or overload a single Collector.

You may need to distribute the bandwidth across your network if you have very high logging levels or if your network is geographically dispersed.

Advantages of the Collector

The Collector workflow has two main advantages over sending logs to InsightIDR directly: normalization and user attribution.

Normalization

Normalization transforms log data from multiple diverse sources into a common JSON format and extracts standard information such as hostnames, timestamps, and error levels. Normalization allows you to run more advanced queries on your endpoint logs and enhance your data visualization.

User Attribution

User attribution correlates endpoint activity to individual users using that endpoint while logged into applications. Attribution provides a fuller image of your security posture because user accounts are the most common targets for sophisticated attacks.

If you decide to use the collector, there can be a delay of up to 5 minutes for endpoint information to show up on InsightIDR. You should consider Custom Logs if real-time visibility of logs is a critical priority.

In order for InsightIDR to apply user attribution, the event source must be supported. InsightIDR must also have reliable data to recognize the asset by IP address and the user by the user field in the log data. These are often achieved by the Insight Agent and a DHCP event source.

Data Deduplication

When data is ingested, repetitive activity is processed and combined into a single entry. These combined entries contain information about the number of occurrences of the activity that was observed, and accumulate values like the total bytes transferred across the original events. This approach results in significant data storage savings and improves the overall search, dashboard, and reporting experience by making search queries execute faster. Detection capability remains unchanged as InsightIDR still retains unique firewall activity, DNS queries, and web proxy activity.

After deduplication has been applied to the DNS Query, Firewall Activity, and Web Proxy Activity event types, three new keys appear in the schemas:

  • observation_count - Shows how many times the same activity was found when processing the received data. InsightIDR's data collection techniques group multiple lines or events from a single event source together for processing, and deduplication occurs on these groups of events.
  • first_observed_time - The date and time of the first duplicated log record contained within the group of events.
  • last_observed_time - The date and time of the last duplicated log record contained within the group of events.

These fields are present only on events where the same activity occurred multiple times within the group of events. When InsightIDR performs detections and search queries on these duplicated events, the observation_count field value is used to ensure activity counts are accurately represented in the results. To view the schemas that contain deduplicated data, navigate to the DNS Query, Firewall Activity, and Web Proxy Activity sections of the Keys to Use in Your Queries topic.

Account Requirements

When setting up the Collector, you should be aware that:

  • InsightIDR ingests data from existing sources in your environment. InsightIDR needs administrator access to pull data from these sources or push data to log aggregators from a Domain Admin account, if possible.
  • You should treat your Collector(s) as you would any other valuable asset, as it stores credentials from your event sources.
  • InsightIDR normalizes and attributes data on AWS but does not store credentials. The Collector strips raw, unnecessary logs in your environment to prevent storage of sensitive data, such as personally identifiable information, medical records, and employee, organization, or asset names.