Troubleshoot an orchestrator

You may run into orchestrator issues during installation, activation, or general use. This article contains common troubleshooting steps you can take to resolve those issues.

General orchestrator troubleshooting

Once the orchestrator is installed, the first step in all orchestrator troubleshooting scenarios should be running the orch-diagnostics command. This command runs a script that executes a series of tests on the orchestrator and its environment. In many situations, this command will identify the issue.

⚠️

Traceroute failures are not fatal

The orch-diagnostics command attempts to perform a traceroute to help isolate connectivity-related problems. However, many environments block traceroute. As a result, if the orch-diagnostics script passes everything but traceroute, it should be considered a 100% successful test.

Installation

If you’re having issues installing an orchestrator, first verify that the system the orchestrator is installed on meets the minimum system requirements.

You should also be using the latest version of one of the following virtualization solutions:

HyperV
Virtual Box
VMWare

If your system meets all of the requirements, and you still can’t install the orchestrator, send the logs from /var/log/rapid7-orchestrator.log to our support team for assistance.

Activation

After successfully installing an orchestrator, you receive a key that you must use to activate the orchestrator. If you’re having issues with your key, here are the likely scenarios and how to resolve them.

I didn’t get an activation key

If your orchestrator installation wasn’t successful or the installed orchestrator can’t communicate with the Command Platform (Insight Platform), you won’t get an activation key, so check that your installation meets these requirements:

The orchestrator has a unique name. You can’t have more than one orchestrator with the same name.
Your environment meets the network requirements. If not, the orchestrator can’t communicate with the Command Platform (Insight Platform) to generate your activation key.

If a key still does not generate for you after verifying that your orchestrator meets these requirements, contact support and provide your orchestrator details.

I need to retrieve my activation key

If the orchestrator installation was successful and you received an activation key, but you weren’t able to copy it for some reason, you can retrieve your key using the secure shell (SSH) protocol to access your orchestrator’s virtual machine (VM) and print the activation key.

Once you have access to your orchestrator with SSH, use the orch-print-activation command to get your key.

I can’t copy or paste my activation key

Some VM solutions make copying and pasting difficult, so if you can’t copy the key, you can download the activation key as a .txt file and copy the key from the file instead:

Run orch-print-activation > ~/activation.txt in your VM’s terminal window.
Copy the activation.txt file to your desktop or local machine.
Open activation.txt on your desktop.
Copy the key.
Try to activate the orchestrator again.

My activation key doesn’t work

If you successfully receive an activation key but find that submitting it to your orchestrator in Automation (InsightConnect) fails, you may have one of these issues:

Copy and paste failure: Sometimes, additional or non-printing terminal characters appear in the clipboard when you copy your activation key. So check that you have captured the entire key, and that no extra characters are showing up. The activation key should not include space characters, or ‘escape’ characters like /n for newline. Most text editors have an option to show non-printing characters to help you with this.
Activation key reuse: Activation keys are single-use only, so you can’t reuse one after you’ve successfully used it to activate an orchestrator, even if you delete the orchestrator it’s associated with—that just makes the key invalid. Instead, start a fresh installation using a new key.

If all else fails and you still can’t activate your orchestrator, you may need to reset it. Resetting an orchestrator allows you to reuse an existing key, but there can be adverse consequences to this course of action, such as credential loss, so we don’t recommend doing this unless you’ve been advised to by a support representative.

Orchestrator

In this section we cover common orchestrator issues and solutions.

You can go to Settings > Orchestrators to see if any of your orchestrators have warnings or errors, or have stopped running. Even healthy orchestrators may have problems due to CPU, memory, or storage usage. To keep orchestrators running smoothly, routinely check orchestrator health.

No orchestrator connection due to disabled DHCP

DHCP should be enabled in Ubuntu unless your organization specifically disabled it. If DHCP is disabled, your machine will not have an IPv4 address and the orchestrator will not be able to communicate as needed.

To remediate DHCP issues, first check if DHCP is enabled on your network:

In a terminal window, run ifconfig.
In the output, look for a line that starts with inet for the ens32 interface. If this line is missing and you only see lines beginning with inet6, your network likely has DHCP disabled.

If DHCP is disabled according to your organization’s needs, you will need to configure a static IP address for the orchestrator to connect to.

My orchestrator is running slowly

If it seems like your orchestrator is running slowly, it may be low on memory or disk space. Examine the resource utilization of the orchestrator and ensure it’s healthy. You should also ensure you’re running the latest version of the Orchestrator. To do this, open a terminal window and run:

apt-get install --only-upgrade rapid7-orchestrator on an Ubuntu machine
yum update rapid7-orchestrator on a Red Hat Enterprise Linux (RHEL) machine

My orchestrator is running out of disk space

Your orchestrator may be heavily trafficked and running short on disk space. There are a few easy ways to fix this.

Stop the orchestrator process before taking any of these steps:

Inspect your running containers with docker ps -a. If you see containers mapping to plugins you’re certain you’re not using any more, you can find the docker id of the container, and stop the container. You can then run a prune to remove those containers, and reclaim some space.
Run a docker prune by following the instructions: https://docs.docker.com/engine/reference/commandline/system_prune/
Ensure that log rotation and syslog settings are managing their size correctly, and tune them if needed. You can find the location of the rsyslog and logrotate setting using our orchestrator files reference information.

I’ve been instructed to reset my orchestrator

Resetting an orchestrator is a fairly simple process, but it will impact any existing workflows you’re using, and invalidate any credentials you’ve entered into the system. That’s why we don’t recommend you reset your orchestrator unless a support representative advises you to do so. For more information about how credentials work, take a look at the Orchestrator credentials section of our Rapid7 Orchestrator (Insight Orchestrator) overview article.

If a support representative instructs you to reset your orchestrator, you will be required to run the following script:

sudo /opt/rapid7/orchestrator/bin/reset-orchestrator.sh

Here’s what will happen after you’ve run the reset script:

Your orchestrator installation is effectively deactivated, though it will still appear in Automation (InsightConnect).
Any in-flight jobs that were not done processing will be in a hung, incomplete state. You must cancel these jobs manually to clear them.
Your existing workflows will continue to run, and will continue to generate jobs (for those using API, SIEM (InsightIDR), or Vulnerability Management (InsightVM) trigger types). If these workflows are configured to use your reset orchestrator, these jobs will also eventually enter a hung state once they hit an Action Step using your orchestrator.
Any credentials you entered will potentially be invalidated and you’ll need to reenter them when you update your workflows.
Your orchestrator will generate a new Activation Code, as well as freshly generate a set of public and private keys for managing credential encryption and for signing requests. For more information on these processes, take a look at the Orchestrator-to-cloud communication section in the Rapid7 Orchestrator (Insight Orchestrator) overview.

My Orchestrator log shows repeated TLS handshake timeout errors

If you use AWS Network Firewall or certain versions of Suricata 7.x, you may be observing an issue that causes communication between the Orchestrator and the cloud to fail. This can result in failures in network operations, such as docker pull.

To resolve this issue, update the AWS Network Firewall rules. For more information, read the AWS Network Firewall documentation at: docs.aws.amazon.com/network-firewall/latest/developerguide/rule-group-stateful-creating.html .

In your AWS Network Firewall configuration, create a stateful rule group that includes these 2 rules:

A rule that allows Docker to properly pull plugins:


pass http $HOME_NET any -> $EXTERNAL_NET any (http.host; dotprefix; content:".<REGION>.plugins.connect.insight.rapid7.com"; nocase; endswith; msg:"matching HTTP allowlisted FQDNs"; flow:established; sid:3;)
pass tls $HOME_NET any -> $EXTERNAL_NET any (ssl_state:client_hello; tls.sni; bsize:>0; dotprefix; content:".<REGION>.plugins.connect.insight.rapid7.com"; nocase; endswith; msg:"matching TLS allowlisted FQDNs"; flow:to_server; sid:4;)

A rule that allows the Orchestrator to communicate with the Command Platform:


pass http $HOME_NET any -> $EXTERNAL_NET any (http.host; dotprefix; content:".<REGION>.api.connect.insight.rapid7.com"; nocase; endswith; msg:"matching HTTP allowlisted FQDNs"; flow:established; sid:3;)
pass tls $HOME_NET any -> $EXTERNAL_NET any (ssl_state:client_hello; tls.sni; bsize:>0; dotprefix; content:".<REGION>.api.connect.insight.rapid7.com"; nocase; endswith; msg:"matching TLS allowlisted FQDNs"; flow:to_server; sid:4;)

Adding these firewall rules ensures that requests to the Command Platform allow for the larger ClientHello message that is required for the SSL cipher suite that the Orchestrator uses and will resolve the handshake timeout errors.

If you use Suricata, consult their documentation to learn how to add the stateful rules above.

Automated workflow

In this section we outline some common orchestrator problems that cause issues with automated workflows or jobs and provide steps for how to resolve them.

My workflows aren’t completing, but I see no errors

If your automated workflows or jobs aren’t completing, but you don’t see any errors, you may have an issue with a trigger. Take a look at troubleshooting information regarding triggers not creating any automation to see if that solves your issue. If not, here are some other things to consider:

Have your credentials or permissions changed? If so, the connection between your third party service providers and your orchestrator may have been broken.
Have any of your workflows been disabled?
Is there a logical issue in the makeup of your workflow?

If you’re seeing “hung jobs,” that is your automations create, but never finish, it’s helpful to establish a baseline of information:

When did the issues start?
How long had the automation been working prior to this?
Are all automations hanging, or do some complete and some not?
Are automations hanging in the same spot, or do they hang at different points of time?

With that information compiled, you’re well prepared to reach out to support for advice and assistance.

ℹ️

Hung jobs due to request timeouts

One very common situation that can inadvertently lead to hung jobs when building a workflow is not specifying a timeout value when using a Python plugin to make a RESTful call using the requests library. If there’s no timeout value, the request can hang for a long time, or even indefinitely due to the underlying nature of the Python and OS networking stacks. We recommend you specify both a connection and a request timeout to prevent hung jobs due to request timeouts. See our Python 2 or 3 Script documentation for more details.

I set up a new trigger, but it’s not creating any automations

First, find the orchestrator container id associated with the trigger.

ℹ️

Find the container ID for a trigger

If you’re having a hard time locating the orchestrator container for the problematic trigger, you can narrow down your search by getting the trigger ID from the workflow you’re working on, then use grep commands to isolate it and identify its container.

Once you have the container ID, use the following command to grab associated logs for further troubleshooting: sudo docker logs -f <Table.Trigger container id>

These logs may tell you enough about the issue to understand what’s going wrong. For example, it is common to see a very hard failure to start a trigger due to incorrect credentials. If you’re still unable to determine the issue, you can provide the logs to a support representative for troubleshooting assistance.

A trigger isn’t working or an action is failing

If a trigger isn’t working or an action is failing, the best way to debug the issue is to get the docker container ID of the orchestrator to get logs and manage or stop problematic processes.

To find an orchestrator’s docker container ID:

Determine the plugin of the container you’re looking for. For example, rapid7/jira/1.0.0.
Run this sample command to list out all containers: sudo docker ps -a | grep X Replace X with the plugins name, for example, sudo docker ps -a | grep jira.

The container ID is the left-most column in the output of the command. You can continue to use grep to further isolate specific containers, per your level of comfort with grep

If you’re specifically looking for a trigger, it can be helpful to run this command: sudo docker ps -a | grep X | grep trigger

This will further scope the lookup to only triggers, which is a common debugging process.

My automations were fine, now they’re erroring a lot

If you suddenly see failing automations on the platform, consider these potential issues and troubleshooting steps. If you still can’t solve the problem, reach out to our support team to help you diagnose the problem.

Third party downtime

Sometimes a third party is partially or fully down. Unfortunately, we can’t guarantee the availability of third party systems. But if you start seeing connection failures, timeouts, or other “failed to talk to” style issues, you can reach out to the third party service to inquire about the health of their product and potential next steps.

Continue on failure

Automation (InsightConnect) provides a Continue on Failure feature so that you can continue to execute a workflow even if part of it fails. This allows you to build robust processes that anticipate expected failures and provide workarounds.

However, sometimes workflows are built to rely on information provided by Continue on Failure steps, and the workflow still fails. For example, if step A allows you to continue on failure, but step B requires the output of step A, step B is very likely to fail, or to return an incorrect result.

The simplest way to solve this issue is to make use of Decision Steps to check if a Continue on Failure step succeeded, and design the workflow around those possibilities.

Changes to workflows

Even small changes can break existing workflows. If you made any changes to the workflow prior to the failures, examine those changes closely to see if they could have led to an issue. Some examples include changing the inputs to certain steps, adding new steps between 2 steps that were previously working, or making changes to automated decision logic.

Changes to incoming data

It’s possible to set up a workflow, have it ingesting data for weeks or months, then finally see an entry in the data that causes problems. While rare, it’s always a possibility that data you haven’t accounted for previously has now become a possibility, and your workflow needs to account for this.

Compare old successful jobs to the failures, and determine if any changes to the trigger inputs could be the problem

Permissions issues

You might think that once you get your credentials and permissions correct, things should remain stable, but that’s not always the case. Your underlying credentials may not have the right permissions for every situation that arises.

For example, credentials may be owned by someone else in your organization. Reach out to them to find out if any permissions or scopes have changed, or if the credentials themselves are still valid at all.

There may also be issues with data you don’t have permission to access. We usually see this with email-based systems, where permissions can be complex and interrelated. For example, may have access to a mailbox, but not every item that enters it. Work closely with the administrators of these systems to determine what permissions may be required and if any emergent data is outside of those permission scopes.

Contact support

If you continue to have problems with an orchestrator, reach out to Rapid7 Support with the error and orchestrator information and we’ll help you investigate.