Skip to content

Conversation

@prodion23
Copy link
Collaborator

Description

Originally I tried implementing this by storing a single global exporter and then each new registered service tracer would have it's own in memory exporter. This performed essentially the same(marginal benefit ~1% in benchmarks..)

As an alternative, otel exposes a WithGRPCConn argument we can pass to each exporter. This keeps our implementation simpler and achieves essentially the same result.

Since we need to have a reference to the shared connection the flow would be:

// Create the config before calling 1 of the .Init functions
grpcConn, err := hypertrace.CreateGrpcConn(agentCfg)

// we need to handle closing the Conn, the underlying TraceProviders will not shut it down since they don't "own" the connection
defer grpcConn.Close()

flusher, err := hypertrace.InitWithSpanProcessorWrapperAndZap(...<all the args>, opentelemetry.WithGrpcConn(grpcConn)
...

In my benchmarks using the WithGrpcConn option with 10 services and sending 10k spans per service:

| Metric                   | Without Shared Connection | With Shared Connection |
|--------------------------|---------------------------|------------------------|
| **Performance**          |                           |                        |
| Total Time (ns/op)       | 21,856,828                | 19,969,818             |
| Span Creation (ns/op)    | 21,855,983                | 19,968,738             |
| Shutdown (ns/op)         | 6,009,575,057             | 6,007,935,632          |
| Connection Creation (ns) | N/A                       | 184.0                  |
| **Memory**               |                           |                        |
| Allocation (MB)          | 7.416                     | 11.51                  |
| Total Allocation (MB)    | 31.09                     | 32.61                  |
| System (MB)              | 38.83                     | 39.08                  |
| GC Count                 | 7.000                     | 6.000                  |
| **Network**              |                           |                        |
| Max Connection Count     | 0                         | 0                      |
| Network Received (MB)    | 10.49                     | 15.28                  |
| Network Sent (MB)        | 10.44                     | 15.20                  |
| **Scale**                |                           |                        |
| Total Services           | 10.00                     | 10.00                  |
| Total Spans              | 10,000                    | 10,000                 |

(I used the gopsutil pkg for network tests and was reporting to local collector, so I think that may be messing up network stats specificall)

@codecov
Copy link

codecov bot commented Jul 14, 2025

Codecov Report

Attention: Patch coverage is 50.00000% with 22 lines in your changes missing coverage. Please review.

Project coverage is 47.70%. Comparing base (95e853c) to head (cbf4056).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
instrumentation/opentelemetry/init.go 50.00% 19 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #252      +/-   ##
==========================================
- Coverage   47.79%   47.70%   -0.09%     
==========================================
  Files          62       62              
  Lines        3013     3050      +37     
==========================================
+ Hits         1440     1455      +15     
- Misses       1499     1518      +19     
- Partials       74       77       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@t-santoshsahu
Copy link

I think creating connection explicitly will be better. This is good.
@prodion23 Does it have timeouts config? We do timeouts on TPA, so this may be not needed for laodbalancing.

@prodion23
Copy link
Collaborator Author

@t-santoshsahu
OTLP grpc export does have a timeout option, though we weren't exposing it before(default of 10s)
however; I don't think it is respected anyway because we are using the BatchSpanProcessor.

(default of 30s)

func (bsp *batchSpanProcessor) exportSpans(ctx context.Context) error {
bsp.timer.Reset(bsp.o.BatchTimeout)
bsp.batchMutex.Lock()
defer bsp.batchMutex.Unlock()
if bsp.o.ExportTimeout > 0 {
var cancel context.CancelFunc
ctx, cancel = context.WithTimeout(ctx, bsp.o.ExportTimeout)
defer cancel()
}
if l := len(bsp.batch); l > 0 {
Debug("exporting spans", "count", len(bsp.batch), "total_dropped", atomic.LoadUint32(&bsp.dropped))
defer func() {
if r := recover(); r != nil {
Error(fmt.Errorf("panic value: %v.\n\n[stacktrace]:\n%s", r, string(debug.Stack())), "recovering from a panic")
// Reset the batch if len is greater than 0
if len(bsp.batch) > 0 {
clear(bsp.batch) // Erase elements to let GC collect objects
bsp.batch = bsp.batch[:0]
}
}
}()
err := bsp.e.ExportSpans(ctx, bsp.batch)
// A new batch is always created after exporting, even if the batch failed to be exported.
//
// It is up to the exporter to implement any type of retry logic if a batch is failing
// to be exported, since it is specific to the protocol and backend being sent to.
clear(bsp.batch) // Erase elements to let GC collect objects
bsp.batch = bsp.batch[:0]
if err != nil {
return err
}
}
return nil
}

@prodion23 prodion23 merged commit e15303b into main Jul 16, 2025
6 of 7 checks passed
@tim-mwangi tim-mwangi deleted the ENGTAI-64543-shared-exporter-connection branch July 21, 2025 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants