Skip to content

Open telemetry usage (Proof of Concept)#379

Draft
AcquaDiGiorgio wants to merge 2 commits intoDIRACGrid:mainfrom
AcquaDiGiorgio:issue-257-open-telemetry-usage
Draft

Open telemetry usage (Proof of Concept)#379
AcquaDiGiorgio wants to merge 2 commits intoDIRACGrid:mainfrom
AcquaDiGiorgio:issue-257-open-telemetry-usage

Conversation

@AcquaDiGiorgio
Copy link

See #257.
This pull request aims to showcase a possible way of implementing Open Telemetry's traces and metrics. This code is NOT final and is not intended to be merged.

Important notes:

  1. With OTel not only we can keep track of certain information of the variables, but we can also see, for example, the cost in time of each function. Which might be useful in locating unusual agent executions or slow response times.

First access to the Auth API
Screenshot from 2024-12-11 14-41-26
Second access (after caching)
Screenshot from 2024-12-11 14-43-53
Look at the execution time of the 3rd and 4th span (diracx.routers.auth.utils.initiate_authorization_flow_with_iam and diracx.routers.auth.utils.get_server_metadata)

  1. Currently OTel is located at the "routers" package, but we can move it to the core of DiracX and trace everything, from the headers used at API level to the specific variables that took place during the execution of a database query.

  2. I implemented tracing using decorators. This is the most elegant solution I was able to come up with, but decorating every function could be a pain and also a quite over the top solution. Instead, we can either only decorate strategic functions or class functions that are problematic or just interesting to look at, or we could offer a secondary option of tracing inside the code of a function using a context manager.

  3. Because we are using custom FastAPI routers with already decorated entry points, they must NOT be decorated. Tracing with OTel clashes with an already traced function, so these decorators should only be used in internal functions.

  4. Lastly, we should consider the use Propagators, which implements a way of distributing traces. For example, we could start a trace in a python script file on the client side, serialize and inject the span context in the header and continue it's processing on the server side, having a full trace from the very first operation a user submitted from the last that DiracX returned.

For the metrics part, the only interesting way of using them that I was able to find (there is surely a hundred more) is to measure CEs job slots availability, having counters that go up when submitting and down when completing or failing and we could also compare it with the values obtained by the status of the CE, for example, to see the percentage of jobs submitted by the different investigation groups.

@chaen
Copy link
Contributor

chaen commented Jan 27, 2025

Hi @AcquaDiGiorgio thanks a lot for this PR !
I'll reach out privately to see how we can follow up from there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants