Version: 7.10

NetObserv Flow Queues 90% Full

The NetObserv Flow’s log reports, processor to output writer or UDP Server to Flow Decoder are 90% full.

SYMPTOM

NetObserv Flow’s log reports one or both of the following messages:

{"level":"info","ts":"2023-08-07T08:08:14.301Z","logger":"flowcoll","caller":"flowprocessor/metrics.go:118","msg":"flow processor to output writer is 90% full. This is normal when the collector is starting. If it persists for hours, it may indicate that you are at your license threshold or your system is under-resourced."}

{"level":"info","ts":"2023-08-07T08:08:34.264Z","logger":"flowcoll","caller":"server/metrics.go:125","msg":"UDP Server to Flow Decoder is 90% full. This is normal when the collector is starting. If it persists for hours, it may indicate that you are at your license threshold or your system is under-resourced."}

These logs might also be accompanied by throttler logs:

2023-06-28T21:20:21.821Z        warn    throttle/restricted_throttle.go:105     [throttler]: start burst
2023-06-28T21:20:41.822Z        warn    throttle/restricted_throttle.go:111     [throttler]: stop burst
2023-06-28T21:20:41.822Z        warn    throttle/restricted_throttle.go:117     [throttler]: start recovery
2023-06-28T21:50:42.142Z        warn    throttle/restricted_throttle.go:123     [throttler]: stop recovery

PROBLEM

It is typical for these messages to occur when the collector first starts, as various internal processes may not yet be fully initialized. However, if the messages persist after the first few minutes, one of the following issues may exist:

ONLY flow processor to output writer - This indicates that the system which data is being output lacks sufficient performance to ingest records at the rate being sent by the collector. This may be due to insufficient CPU, memory, disk space, or excessive disk latency. Insufficient network bandwidth between the collector and target system might also cause the problem. (also see the NOTE below)
BOTH UDP Server to Flow Decoder and flow processor to output writer - This is a further progression of the previous condition. The resulting back pressure from the slow downstream system is now likely causing data to be lost.
ONLY UDP Server to Flow Decoder - The internal decoder/processor workers cannot keep up with the rate of records being received. This can be caused by one of the following conditions:
- More records are being received than are allowed by the license. If so, throttler messages will also appear in the log.
- The collector has insufficient resources, primarily CPU cores, to process the rate of records being received.
- The collector has just been started and the caches (for IPs, interfaces, etc.) have yet to be "warmed up" and the related high latency enrichment tasks are limiting throughput.

note

6.x versions prior to 6.3.4, had an issue with automatically scaling the output pool size for OpenSearch and Splunk based on the Licensed Units. Increasing the output pool size manually, via EF_OUTPUT_OPENSEARCH_POOL_SIZE or EF_OUTPUT_SPLUNK_HEC_POOL_SIZE respectively, often solved the issue. Upgrading to 6.3.4 or later also fixes the issue.

SOLUTION

The solution varies depending on the indicated issue, as described in the problem section above.

ONLY flow processor to output writer - Increase the performance of the system to which records are being sent.
BOTH UDP Server to Flow Decoder and flow processor to output writer - Increase the performance of the system to which records are being sent.
ONLY UDP Server to Flow Decoder
- If throttler messages will also appear in the log, contact sales@elastiflow.com to learn about subscription options which will allow you to collector more flow records.
- Increase the CPU cores available to the collector.
- If the collector has sufficient CPU resources try increasing the processor pool size by setting EF_PROCESSOR_POOL_SIZE. This allows great concurrency of high latency enrichment tasks.
- If DNS enrichment is enabled and disabling DNS enrichment resolves the problem, the issue may be with DNS resolution time. Verify DNS server performance and DNS query performance. If lowering DNS timeout alleviates the symptoms, the issue would point to DNS server performance and not collector performance.

REFERENCE

EF_PROCESSOR_POOL_SIZE

SYMPTOM​

PROBLEM​

SOLUTION​

REFERENCE​

SYMPTOM

PROBLEM

SOLUTION

REFERENCE