feat: support APM Server intake API version 2#465
Conversation
|
This PR should technically be ready for testing with the POC. The checklist above is meant for making it production ready |
|
would be great to get some benchmarks to see if this implementation is better than the current intake protocol, for example from a memory usage perspective. |
lib/agent.js
Outdated
| agent._apmServer.flush(cb) | ||
| }) | ||
| } else { | ||
| process.nextTick(cb) |
There was a problem hiding this comment.
Can you log something here so we can tell when someone tries to send something without _apmServer set?
| if (this._apmServer) { | ||
| this._apmServer.flush(cb) | ||
| } else { | ||
| process.nextTick(cb) |
There was a problem hiding this comment.
Again, log something so we know when _apmServer is not set.
lib/instrumentation/transaction.js
Outdated
| }) | ||
|
|
||
| Object.defineProperty(this, 'timestamp', { | ||
| get: function () { |
There was a problem hiding this comment.
Could use shorthand here. get () {
|
|
||
| if (this.ended) { | ||
| this._agent.logger.debug('transaction already ended - cannot build new span %o', {id: this.id}) | ||
| this._agent.logger.debug('transaction already ended - cannot build new span %o', {id: this.id}) // TODO: Should this be supported in the new API? |
There was a problem hiding this comment.
When everything is fully streaming, I don't think there's any reason to prevent further spans. It's only useful without streaming to ensure the transaction doesn't stay open forever.
There was a problem hiding this comment.
The reason why I'm hesitant with this is that we might have a transaction that kicks off some form of background job that gets contextually attached to the transaction that started it. Any spans that's created as a result of this would get associated with the transaction. This will make the span waterfall almost useless as it can suddenly cover a period of days or weeks. Am I being too paranoid?
There was a problem hiding this comment.
That'd be user mode queuing though. We won't actually form that link unless we explicitly bind the callback. The logical continuation to the next job would be out-of-band from the request, it'd be triggered by the completion of another job in the queue.
I could see maybe time-based jobs naively using setTimeout for everything, and that maybe being an issue. But that's just not great design and we probably should be drawing attention to it in some way. 🤔
There was a problem hiding this comment.
Yeah setTimeout and friends are probably the most common way to trigger this.
Question is however if the spans happening after the transaction ends are relevant to the user at all?
@elastic/apm-agent-devs @roncohen @makwarth In the new intake API, we get the ability to record spans that start after the parent transaction ends. Is this something we want to do or should those just be ignored? I.e. are they adding any value to the user or are they just noise?
There was a problem hiding this comment.
I.e. are they adding any value to the user or are they just noise?
I think that depends on how the application is making use of that. I'd rather default to not ignore than to ignore these spans.
There was a problem hiding this comment.
If we don't ignore them by default, would it be a good idea to have a config option to ignore them?
It could even have 3 settings:
- Ignore spans started after the transaction ends
- Ignore spans started before the transaction ends, but ended after
- Allow all spans
There was a problem hiding this comment.
i understand you're worried about the case where people would set up a recurring periodic job from inside a transaction, because it would mean spans continue to happen in an endless stream. Two ideas:
- We could experiment in making the assumption that there are two categories of delayed execution
process.nextTick()/setImmediate()andsetTimeout()/setInterval(). The first one could indicate that the developer intended this to be part of the transaction, while the second category will start a new transaction. - We could set a numerical limit to the number of spans that we will consider, after a transaction has ended - or a wall clock timeout which would be the limit to how much time we'll continue to "monitor" a transaction after is has ended.
There was a problem hiding this comment.
We could also include a special escape hatch that specifically calls to setTimeout or setInterval after the transaction ends would not continue the context. That way, anything else in the code path will continue, but anything that could potentially be significantly delayed will not.
|
To make the scope of this PR as small as possible, I've made it go to a different branch than |
|
To help this PR getting merge, I've created the meta issue #471 and moved all the non-essential todo items from this PR to that |
|
jenkins, run tav tests please |
| this._apmServer.flush(cb) | ||
| } else { | ||
| this.logger.warn(new Error('cannot flush agent before it is started')) | ||
| process.nextTick(cb) |
There was a problem hiding this comment.
Would it be worthwhile for the flush callback be able to receive that error?
There was a problem hiding this comment.
I left it out as this only occurs if the agent isn't started. So from the outside I want the user to use the module in the same way no matter if the agent is started or not. And normally errors coming from the _apmServer is just logged, so I think it's best to maintain that behavior here as well.
There was a problem hiding this comment.
Actually I just realized that the callback given to captureError will pass a similar error along in the callback: https://github.com/elastic/apm-agent-nodejs/pull/465/files#diff-7403da6ce2244c65d641fdfcc095be93R331
But I think maybe that should be avoided in captureError. If so the captureError callback will never be called with an error either.
lib/config.js
Outdated
| active: true, | ||
| logLevel: 'info', | ||
| hostname: os.hostname(), | ||
| hostname: os.hostname(), // TODO: Should we just let the http client default to this? |
There was a problem hiding this comment.
Defaulting in the http client sounds reasonable to me.
lib/instrumentation/index.js
Outdated
| if (!payload) return agent.logger.debug('transaction ignored by filter %o', {id: transaction.id}) | ||
| truncate.transaction(payload) | ||
| agent.logger.debug('sending transaction %o', {id: transaction.id}) | ||
| if (agent._apmServer) agent._apmServer.sendTransaction(payload) |
There was a problem hiding this comment.
Is there any way for agent._apmServer to be null when this._started is true? 🤔
There was a problem hiding this comment.
No you're right - if this._started === true, then we'll also have an agent._apmServer. Would that be a better check? The current check is guaranteed to always work, whereas the conditions of the other might change in the future right?
There was a problem hiding this comment.
I just made that comment, because this check is already in an outer if block checking this._started, so it seemed redundant. Github cuts off the preview just after that line though. 😅
There was a problem hiding this comment.
Oh yes... I'll remove it
| architecture: process.arch, | ||
| platform: process.platform | ||
| } | ||
| } |
There was a problem hiding this comment.
It seems like process, system and runtime stuff no longer exists? I don't see it in the http client PR either. Is that correct?
There was a problem hiding this comment.
Yeah, I can see it's a bit confusing given the current state of the all the in-progress dependencies.
As you can see here, this PR depends on a custom branch of the http client called v2-1 that currently lives on my fork: https://github.com/elastic/apm-agent-nodejs/pull/465/files#diff-b9cfc7f2cdf78a7f4b91a753d10865a2R74
In that branch the metadata responsibilities have been moved to the http client: https://github.com/watson/apm-nodejs-http-client/blob/4f02d7eda173b738a309fdca2e3464a1bbae5652/index.js#L293-L332
But there's no PR for this branch yet as I don't want to make the current http client PR too complicated, so once the the current PR is merged, I'll open a new one for the v2-1 branch 😅
There was a problem hiding this comment.
Just merged the first http client PR and created the next based on my v2-1 branch: elastic/apm-nodejs-http-client#6 - soon it should hopefully be less confusing
There was a problem hiding this comment.
Ah. Gotcha. I was trying to match up the two PRs to make sure the functionality lined up between the two. 😅
test/_apm_server.js
Outdated
| client.request('transactions', {}, body, () => {}) | ||
| req.pipe(zlib.createGunzip()).pipe(ndjson.parse()).on('data', function (data) { | ||
| if (req.method !== 'POST') throw new Error(`Unexpected HTTP method: ${req.method}`) | ||
| if (req.url !== '/v2/intake') throw new Error(`Unexpected HTTP url: ${req.url}`) |
There was a problem hiding this comment.
You could do the req field validation outside the pipe sequence and data handler. Also, the assert module might be cleaner here.
| t.equal(body.errors[0].exception.message, 'with callback') | ||
| .on('data-error', function (data) { | ||
| t.equal(data.exception.message, 'with callback') | ||
| t.end() |
There was a problem hiding this comment.
Does tape work correctly when you explicitly t.end() while using a t.plan(...)?
There was a problem hiding this comment.
Yes, all it does when reaching the t.end is just fail if the number of assertions doesn't match what's planned. This is a way to validate our expectations of the async execution order. I.e. when we reach this point, we should have made the planned number of assertions.
test/instrumentation/_agent.js
Outdated
| agent._instrumentation = sharedInstrumentation | ||
| agent._instrumentation.currentTransaction = null | ||
| agent._instrumentation._queue._clear() | ||
| agent._apmServer = mockClient(expected, cb || noop) // TODO: Expected will not work here |
There was a problem hiding this comment.
Is this comment still accurate? What's the issue?
There was a problem hiding this comment.
Hmm I think you're right. I think this line is completely irrelevant and can just be removed. I'll update it
| t.ok(payload.start > 0) | ||
| t.ok(payload.duration > 0) | ||
| assert.stacktrace(t, 'myTest3', __filename, payload.stacktrace, agent) | ||
| t.deepEqual(Object.keys(payload), ['transactionId', 'timestamp', 'name', 'type', 'start', 'duration']) |
There was a problem hiding this comment.
You don't think we need to validate the contents of the object here?
test/integration/no-sampling.js
Outdated
| var data = JSON.parse(Buffer.concat(buffers)) | ||
| t.equal(data.transactions.length, 20, 'expect 20 transactions to be sent') | ||
| t.end() | ||
| process.exit() |
There was a problem hiding this comment.
Does this process.exit() need to be here? I'd rather not have these hiding in the tests as they could cause confusion when adding more tests to the file in the future, if you don't notice them.
|
Finally managed to get a complete review it. Looks good, other than a couple more comments. |
|
Approved for v2 branch. Any remaining issues can be fixed in another PR, before merging v2 to master. |
Closes #356
Checklist