Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jaeger mangles (?) Trace IDs sent with Thrift #1951

Closed
haljin opened this issue Nov 29, 2019 · 12 comments · May be fixed by opencensus-beam/opencensus-jaeger#3
Closed

Jaeger mangles (?) Trace IDs sent with Thrift #1951

haljin opened this issue Nov 29, 2019 · 12 comments · May be fixed by opencensus-beam/opencensus-jaeger#3

Comments

@haljin
Copy link

haljin commented Nov 29, 2019

Requirement - what kind of business use case are you trying to solve?

Match Trace IDs output to logs to those in Jaeger

Problem - what in Jaeger blocks you from solving the requirement?

Jaeger interprets Trace IDs in a mysterious way that makes the hexed ID completely different

I have an application that uses OpenCensus to generate spans and traces and then sends those to Jaeger using the Thrift interface. At the same time it logs the current Trace ID with the log entries so that we can match the traces from Jaeger to application logs. Except these IDs don't seem to match. Sometimes they do however.

A couple examples (The application is written in Elixir and I used that to parse so I'll use the byte syntax to show binaries)

Trace ID generated by the application:
0e7b69838055876fdf8b7ee0ecfc276a

Trace ID in Jaeger:
e7b69838055876fa074811f1303d895

If we look at it in binary format:
Application:
00001110011110110110100110000011100000000101010110000111011011111101111110001011011111101110000011101100111111000010011101101010
Jaeger:
11100111101101101001100000111000000001010101100001110110111110100000011101001000000100011111000100110000001111011000100100000101

If you look closely it seems that leading zeros have been removed (I found #1578 but it seems only somewhat related to this problem) while the lower part is inverted? Upper part has leading zeroes while lower part has a leading 1.
Application upper:
0000111001111011011010011000001110000000010101011000011101101111
Jaeger upper:
1110011110110110100110000011100000000101010110000111011011111010

Application lower:
1101111110001011011111101110000011101100111111000010011101101010
Jaeger lower:
0000011101001000000100011111000100110000001111011000100100000101

Trace ID generated by the application:
3b4b7e0eb55df3283a352c86fa0df6af

Trace ID in Jaeger:
3b4b7e0eb55df3283a352c86fa0df6af

Binary:
00111011010010110111111000001110101101010101110111110011001010000011101000110101001011001000011011111010000011011111011010101111

This one seems to match perfectly? I noticed that both lower and upper half have leading zeroes.

Upper:
0011101101001011011111100000111010110101010111011111001100101000
Lower:
0011101000110101001011001000011011111010000011011111011010101111

Trace ID generated by the application:
9fd1aaa32ff34ca5ad7869260f54dc08

Trace ID in Jaeger:
e02e555cd00cb35ad28796d9f0ab23f7

Binary:
Application:
10011111110100011010101010100011001011111111001101001100101001011010110101111000011010010010011000001111010101001101110000001000
Jaeger:
11100000001011100101010101011100110100000000110010110011010110101101001010000111100101101101100111110000101010110010001111110111

Now both lower and upper have leading 1s causing weird things to happen most notable bits invert:

Application upper:
1001111111010001101010101010001100101111111100110100110010100101
Jaeger upper:
1110000000101110010101010101110011010000000011001011001101011010

Application lower:
1010110101111000011010010010011000001111010101001101110000001000
Jaeger lower:
1101001010000111100101101101100111110000101010110010001111110111

Is this some unsigned/signed magic that is happening that causes Jaeger to interpret these numbers differently? Is this some 1s complement magic with a sign bit? I am very confused and don't know if it is something I can change in my application (or rather the reporter library) or is it something in Jaeger? I am running jaeger with jaegertracing/all-in-one:1.14 Docker image, but have seen the same problem happen on the latest image.

@yurishkuro
Copy link
Member

It is more likely that the library you're using to generate the messages is doing something different. I think Jaeger internally is pretty consistent, and we have integration tests in different languages that validate string representation (aside from the 0 padding issue, which is not addressed in all clients, but that doesn't seem to be the main problem in your case).

Internally Jaeger always represents trace ID as two uint64s, in the clients, in the server, and in Thrift. The algorithm for generating a string hex representation is here: https://github.com/jaegertracing/jaeger-client-go/blob/3585cc566102e0ea2225177423e3fcc3d2e5fd7a/span_context.go#L214. The raw bytes are interpreted using big-endian ordering.

@haljin
Copy link
Author

haljin commented Nov 30, 2019

I was suspecting that but I still cannot figure out what is wrong. In example 1 above the trace id generated by the application was 19249984526203758347956588965281933162.

The exporter deals with it using https://github.com/opencensus-beam/opencensus-jaeger/blob/master/src/oc_reporter_jaeger.erl :

 TraceId = Span#span.trace_id,
  <<High:64, Low:64>> = <<TraceId:128/integer>>,

  ParentSpanId = case Span#span.parent_span_id of
                   undefined -> 0;
                   PSI -> PSI
                 end,

  [#'Jaeger.Thrift.Span'{'traceIdLow' = Low,
                         'traceIdHigh' = High,

Running that gives me:

19> <<High:64, Low:64>> = <<TraceId:128/integer>>.
<<14,123,105,131,128,85,135,111,223,139,126,224,236,252, 39,106>>
20> High.
1043543752181122927
21> Low.
16108107996734433130
17> High bsl 64 + Low.   (bit shift left)
19249984526203758347956588965281933162

Everything looks fine here - the first byte is 0e and that is what I see in the logs, but in Jaeger it is e7. That is also interpreting those numbers as big endian.

@yurishkuro
Copy link
Member

$ python
>>> x=19249984526203758347956588965281933162L
>>> x
19249984526203758347956588965281933162L
>>> "{0:0{1}x}".format(x,32)
'0e7b69838055876fdf8b7ee0ecfc276a'

the first byte is 0e and that is what I see in the logs, but in Jaeger it is e7

When you say "in Jaeger", what do you mean exactly? In the UI?

It seems we never fixed #1578 on the backend, let's see if #1956 works.

@haljin
Copy link
Author

haljin commented Dec 2, 2019

Yes, I mean in the UI. :)

@yurishkuro
Copy link
Member

@haljin can you try with latest master?

@haljin
Copy link
Author

haljin commented Dec 5, 2019

I assume the jaegertracing/all-in-one:latest docker image contain the latest master?

It seems the problems still persist, although there it seems it matches more often now.

E.g;
af02d2d55b9ade8536a7750ab74511ac in Jaeger UI
d0fd2d2aa465217a36a7750ab74511ac in the log

Note that the lower half matches

7adc1410ac412c95b2e1f3f88477f8dd in the Jaeger UI
7adc1410ac412c95cd1e0c077b880722 in the log

Now the upper half matches?

@yurishkuro
Copy link
Member

Can you post your Elixir marshalling/unmarshalling code?

@haljin
Copy link
Author

haljin commented Dec 6, 2019

The Thrift code is what I posted above, it's an Erlang library: https://github.com/opencensus-beam/opencensus-jaeger/blob/master/src/oc_reporter_jaeger.erl

make_spans(Span) ->

  TraceId = Span#span.trace_id,
  <<High:64, Low:64>> = <<TraceId:128/integer>>,

  ParentSpanId = case Span#span.parent_span_id of
                   undefined -> 0;
                   PSI -> PSI
                 end,

  [#'Jaeger.Thrift.Span'{'traceIdLow' = Low,
                         'traceIdHigh' = High,
                         'spanId' = Span#span.span_id,
                         'parentSpanId' = ParentSpanId,
                         'operationName' = Span#span.name,
                         'references' = to_references(Span#span.links),
                         'flags' = Span#span.trace_options,
                         'startTime' = wts:to_absolute(Span#span.start_time),
                         'duration' = wts:duration(Span#span.start_time, Span#span.end_time),
                         'tags' = to_tags(Span#span.attributes),
                         'logs' = to_logs(Span#span.time_events)}].

The code for the logger is here: https://github.com/opencensus-beam/opencensus_elixir/blob/master/lib/opencensus/span_context.ex

def hex_span_id(n) when is_integer(n) and n > 0,
    do: :io_lib.format("~16.16.0b", [n]) |> to_string()

@yurishkuro
Copy link
Member

What does this mean?

TraceId = Span#span.trace_id,
  <<High:64, Low:64>> = <<TraceId:128/integer>>,

Is the trace_id field represented as 128bit number? How does it get split into two 64bit parts?

@haljin
Copy link
Author

haljin commented Dec 9, 2019

It encodes the number on 128 bits and then takes the first 64 bits and puts them as an integer on the first half and the second 64 bits as an integer as the second half.

3> TraceId = 19249984526203758347956588965281933162.
19249984526203758347956588965281933162
4> << TraceId:128/integer>>.                        
<<14,123,105,131,128,85,135,111,223,139,126,224,236,252,39,106>> %Binary syntax - full Id byte by byte
9> << High:64>>.
<<14,123,105,131,128,85,135,111>> %First half of the binary
10> << Low:64>>. 
<<223,139,126,224,236,252,39,106>> %Second half of the binary

6> High.
1043543752181122927
7> Low.
16108107996734433130
8> High bsl 64 + Low.
19249984526203758347956588965281933162

@yurishkuro
Copy link
Member

These numbers look ok to me:

>>> n=19249984526203758347956588965281933162L
>>> "{0:0{1}x}".format(n,32)
'0e7b69838055876fdf8b7ee0ecfc276a'
>>> "{0:0{1}x}".format(1043543752181122927,16)
'0e7b69838055876f'
>>> "{0:0{1}x}".format(16108107996734433130,16)
'df8b7ee0ecfc276a'

However, you cannot assign 16108107996734433130 to the field in Thrift as is, because it's too large for int64, you need to convert it to a negative complement. See how this is done in jaeger-client-python:

https://github.com/jaegertracing/jaeger-client-python/blob/51d9027c3e0b87f6721dd14e7bc2b3303ce682c2/jaeger_client/thrift.py#L50

def id_to_int(big_id):
    if big_id is None:
        return None
    # thrift defines ID fields as i64, which is signed,
    # therefore we convert large IDs (> 2^63) to negative longs
    if big_id > _max_signed_id:
        big_id -= _max_unsigned_id
    return big_id

KronicDeth added a commit to KronicDeth/opencensus-jaeger that referenced this issue Oct 21, 2020
Fixes jaegertracing/jaeger#1951

Thrift IDs are i64, which is a _signed_ 64-bit integer, but the Jaeger protocol says IDs are unsigned, so in our records we store them unsigned.  When sending the IDs through Thrift, we need to convert the bits directly from unsigned to signed.  This matches the behavior in the Python client (https://github.com/jaegertracing/jaeger-client-python/blob/51d9027c3e0b87f6721dd14e7bc2b3303ce682c2/jaeger_client/thrift.py#L32-L47) as pointed out here (jaegertracing/jaeger#1951 (comment)), but we get to use type specifiers to make it clearer (:tada: Erlang!) instead of bit shifts and masking that Python needs to do.
@yurishkuro
Copy link
Member

I assume this has been answered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants