Odysseus

This is the official support forum for Odysseus

Odysseus Benchmarking

stefan
Posts: 85
Joined: Tue Jul 12, 2016 1:03 pm

Odysseus Benchmarking

Postby stefan » Sun Jan 08, 2017 8:33 pm

Hello,

my first prototype is nearly ready and I want to do some benchmarking to check if my development fulfills some performance requirements. The main steps of my development are:
1. Read data from csv files in Odysseus
2. Do some preprocessing
3. Send the data to a message queue
4. Read this data through a python script, create and execute a Neo4j statement and send the results to another message queue
5. Read this data in Odysseus
6. Do some processing in Odysseus
7. Write the results to csv files.

I want to measure the processing performance of the Odysseus script. Of course there is some external (to Odysseus) processing also, therefore I want to meassure the processing time per tuple and/or the throughput at specific sections of the script.

I found the Benchmarker (old Odysseus Wiki), the benchmark operators and the evaluation feature in the Odysseus Wiki. The documentation is not very extensive and unfortunately I dont fully understand how to do such a benchmarking. Which feature and which operators should I use for this? How can I use these things? What should be the first steps?

I would really appreciate any help!

Thanks in advance,
Stefan

User avatar
Marco Grawunder
Posts: 256
Joined: Tue Jul 29, 2014 10:29 am
Location: Oldenburg, Germany
Contact:

Re: Odysseus Benchmarking

Postby Marco Grawunder » Mon Jan 09, 2017 10:03 am

Hi Stefan,

you can measure latency and throughput with Odysseus very easy.
To calc latency you need to add the latency metadata in the source definition part.

https://wiki.odysseus.offis.uni-oldenburg.de/display/ODYSSEUS/Latency

Code: Select all

#METADATA TimeInterval
#METADATA Latency


Important: You need to add TimeInterval in this case explicitly.

If you want to calc throughput (which should be done direclty after the sources) you can use the data rate operator

https://wiki.odysseus.offis.uni-oldenburg.de/display/ODYSSEUS/Data+rate

and add the datarate metadata

Code: Select all

#METADATA TimeInterval
#METADATA Latency
#METADATA Datarate


Here is a feature, that allow to automate some of the processing:
https://wiki.odysseus.offis.uni-oldenburg.de/display/ODYSSEUS/Evaluation+Feature

Greetings,

Marco

stefan
Posts: 85
Joined: Tue Jul 12, 2016 1:03 pm

Re: Odysseus Benchmarking

Postby stefan » Fri Jan 13, 2017 2:25 pm

Hallo Marco,

thank you very much for this information!
My idea was to messure different performance indicators (throughput etc.) of the query at different steps to see the overall performance as well as the performance of specific code segments.

Odysseus reads user requests that are processed and sent to the message queue. After some processing in python the results will be read by Odysseus again. One user request will result after the processing in python in several new tuples that are read by Odysseus again. Is it possible to match these new tuples with the corresponding user request again?

According to the evaluation feature:
As far as I understand the evaluation feature allows to make multiple runs with different values. In my case the Neo4j Database is changed during the processing and therefore the first run could not be compared with the following ones. Furthermore I think I will not have a value that should change during processing. I just want to make a performance measurement. Should I do this only by operators?
Otherwise some plots would be interesting. :)

Furthermore I am using two query files with approx. 4 queries. Is it possible to measure the values in different queries or should I create a new query that combines all processing steps in just one query?

Thank you very much!
Stefan

User avatar
Marco Grawunder
Posts: 256
Joined: Tue Jul 29, 2014 10:29 am
Location: Oldenburg, Germany
Contact:

Re: Odysseus Benchmarking

Postby Marco Grawunder » Fri Jan 13, 2017 3:54 pm

Odysseus reads user requests that are processed and sent to the message queue. After some processing in python the results will be read by Odysseus again. One user request will result after the processing in python in several new tuples that are read by Odysseus again. Is it possible to match these new tuples with the corresponding user request again?


I am not sure, what you mean ...

According to the evaluation feature:
As far as I understand the evaluation feature allows to make multiple runs with different values. In my case the Neo4j Database is changed during the processing and therefore the first run could not be compared with the following ones. Furthermore I think I will not have a value that should change during processing. I just want to make a performance measurement. Should I do this only by operators?
Otherwise some plots would be interesting. :)


As fas as I know, there are always multiple runs. If you change the state inbetween this will not work, I guess.

If you just use the operators it should not be to difficult to ask Excel for a diagram ;-)

Furthermore I am using two query files with approx. 4 queries. Is it possible to measure the values in different queries or should I create a new query that combines all processing steps in just one query?

If you do this by hand, of cource. When using the evaluation feature this is not possible (but you can use #INCLUDE or copy text from other files into the current query file)

stefan
Posts: 85
Joined: Tue Jul 12, 2016 1:03 pm

Re: Odysseus Benchmarking

Postby stefan » Fri Jan 13, 2017 4:30 pm

Hi,

I try to explain:
The user requests are preprocessed and sent to a message queue. After Python did some processing it sends multiple tuples back to Odysseus:

Odysseus --> Python --> Odysseus
1 Request - - - - - - - - - - - > n Responses

The n Responses can be matched with the 1 Request by their uuid.
If I want to know how long it takes from receiving the user request in Odysseus till the processing of these n Responses reach the sink, I could measure the latency. But therefore I have to know that the requests and the responses are the same "thing". For Odysseus it looks like its something different because the processing is interrupted by the message queue/Python processing.
If that is hard to doI will find a different way to get good results. :)

To the multiple runs and Excel:
Yes, that is what I thought also. I will play around a bit but I guess I will not be able to use the evaluation feature here.
...and yes, Excel would do it also... :D

To the last point:
Ok, that answers my question. I think its a good idea to use the input or only one file, even if I will not need it for the evaluation.

User avatar
Marco Grawunder
Posts: 256
Joined: Tue Jul 29, 2014 10:29 am
Location: Oldenburg, Germany
Contact:

Re: Odysseus Benchmarking

Postby Marco Grawunder » Fri Jan 13, 2017 5:37 pm

I think that there is no way to check this. You could add the latency with a MAP operator to the input of the python processing... and then again to odysseus?

stefan
Posts: 85
Joined: Tue Jul 12, 2016 1:03 pm

Re: Odysseus Benchmarking

Postby stefan » Fri Jan 13, 2017 5:58 pm

Hmm, ok, good to know. I think it will work also if I calculate the latency for the different parts. I have to think about that. Thanks. :)

Ok, there are some smaller questions I have to clarify:

Lets say I use the data rate, I assume for the latency its quite similar:
If my regular query looks like this:

Code: Select all

#PARSER PQL

#DROPALLQUERIES
#DROPALLSINKS
#DROPALLSOURCES

#RUNQUERY

name1 = ACCESS(...)

name2 = OPERATOR(..., name1)

name3 = OPERATOR(..., name2)

name4 = SENDER(..., name3)



... and if I want to measure each operator, my new query would look like this:

Code: Select all

#PARSER PQL

#DROPALLQUERIES
#DROPALLSINKS
#DROPALLSOURCES

#METADATA TimeInterval
#METADATA Latency
#METADATA Datarate

#RUNQUERY

name1 = ACCESS(...)

name1_drate = DATARATE({UPDATERATE = 3, KEY='name1}, name1)

name2 = OPERATOR(..., name1_drate)

name2_drate = DATARATE({UPDATERATE = 3, KEY='name2'}, name2)

name3 = OPERATOR(..., name2_drate)

name3_drate = DATARATE({UPDATERATE = 3, KEY='name3'}, name3)

name4 = SENDER(..., name3_drate)



Is that correct? This means I have to use the data rate operator output as input for the following tuples. But I dont have to care about that metadata in the operator (e. g. for project etc.) because the metadata is not touched.

If I use an updaterate = 3 (its small of course, just for this example) and I write that to a csv file I would get this output (I used values of my output):

attributes of tuple 1, [name1|0.0, name2|0.0, name3|0.0]
attributes of tuple 2, [name1|0.0, name2|0.0, name3|0.0]
attributes of tuple 3, [name1|141.51524810288194, name2|141.7870288877838, name3|141.25019321849348]
attributes of tuple 4, [name1|141.51524810288194, name2|141.7870288877838, name3|141.25019321849348]
attributes of tuple 5, [name1|141.51524810288194, name2|141.7870288877838, name3|141.25019321849348]
attributes of tuple 6, [name1|15725.660608792741, name2|15585.872964744756, name3|15796.457381157985]

I am not quite sure what this is telling me. I know the Datarate operator calculates the data rate in tuples per second.
But does this mean that at this point of time at tuple 3 the name1 operator was able to process 141,52 tuples per second? And name2 141,79? And why it is much more at tuple 6?

If I would do the same for latency I would see the timestamp for each operator and the calculated latency to do this operation. Correct?

Sorry, this could be a very stupid question but especially in the data rate case I am not really sure what the values are telling me... :)

greetings,
Stefan

User avatar
Marco Grawunder
Posts: 256
Joined: Tue Jul 29, 2014 10:29 am
Location: Oldenburg, Germany
Contact:

Re: Odysseus Benchmarking

Postby Marco Grawunder » Fri Jan 13, 2017 6:07 pm

Is that correct? This means I have to use the data rate operator output as input for the following tuples. But I dont have to care about that metadata in the operator (e. g. for project etc.) because the metadata is not touched.


Yes.

I am not quite sure what this is telling me. I know the Datarate operator calculates the data rate in tuples per second.
But does this mean that at this point of time at tuple 3 the name1 operator was able to process 141,52 tuples per second? And name2 141,79? And why it is much more at tuple 6?


This is the algorithm for the datarate calcuation:

Code: Select all

            if (elementsRead == updateRate) {
               long now = System.nanoTime();
               long lastPeriodNano = now - lastTimestamp;

               double lastDataRateNano = updateRate / (double)lastPeriodNano;

               currentDatarate = lastDataRateNano * 1000000000.0;
               lastTimestamp = now;
               elementsRead = 0;
            }


So this rate is always normalized to one second. The value will get meaningfull when one second is over.

stefan
Posts: 85
Joined: Tue Jul 12, 2016 1:03 pm

Re: Odysseus Benchmarking

Postby stefan » Fri Jan 27, 2017 12:08 am

Hi,

in one of my cases I read data from a csv file, transform it from a tuple to keyvalue and send this to RabbitMQ.
Up to now I check the latency and the data rate before converting to keyvalue. Adding new attributes based on the measured data for a keyvalue object did not work form me:

mrPreparedResult = MAP({
expressions = [
['Latency.minlstart', 'Latency_minlstart'],
['Latency.maxlstart', 'Latency_maxlstart'],
['Latency.lend', 'Latency_lend'],
['Latency.latency', 'Latency_latency'],
['toDouble(toString(elementAt(Measurements[1],1)))','mrCsvSource'],
['toDouble(toString(elementAt(Measurements[0],1)))','mrRmqRequestKvo']
]///,
///KEEPINPUT = true
},
mrRmqRequestKvo_latency
)

Do I have to use a $. like expression? I read in the Wiki that I have to replace the KeepInput also with $, but how is the exact syntax? I tried e. g. ['$', 'payload'].

Another topic is: Is there a way to measure how fast the sender operator is or whats the latency for the full processing (source --> sink).
As far as I understood, the datarate before the sender is (nearly) the datarate the sender is able to process. If the datarate of the previous operators would be higher, it would be internally buffered.
Without going too deep, is that correct?

greetings,
Stefan

User avatar
Marco Grawunder
Posts: 256
Joined: Tue Jul 29, 2014 10:29 am
Location: Oldenburg, Germany
Contact:

Re: Odysseus Benchmarking

Postby Marco Grawunder » Fri Jan 27, 2017 10:42 am

Up to now I check the latency and the data rate before converting to keyvalue. Adding new attributes based on the measured data for a keyvalue object did not work form me:

mrPreparedResult = MAP({
expressions = [
['Latency.minlstart', 'Latency_minlstart'],
['Latency.maxlstart', 'Latency_maxlstart'],
['Latency.lend', 'Latency_lend'],
['Latency.latency', 'Latency_latency'],
['toDouble(toString(elementAt(Measurements[1],1)))','mrCsvSource'],
['toDouble(toString(elementAt(Measurements[0],1)))','mrRmqRequestKvo']
]///,
///KEEPINPUT = true
},
mrRmqRequestKvo_latency
)

Do I have to use a $. like expression? I read in the Wiki that I have to replace the KeepInput also with $, but how is the exact syntax? I tried e. g. ['$', 'payload'].


The reading of meta data currently only works for Tuple not for KeyValue.
Another topic is: Is there a way to measure how fast the sender operator is or whats the latency for the full processing (source --> sink).


No, there is no way to measure the sender as this is a sink operator.

As far as I understood, the datarate before the sender is (nearly) the datarate the sender is able to process. If the datarate of the previous operators would be higher, it would be internally buffered.
Without going too deep, is that correct?


Yes, you can assume that the datarate directly before the sender is the rate the sender can process the data. Of course, if the sender itself or the transport protocol has a buffer, the processing will be faster. E.g. a tcp queue could be filled.


phpbb 3.1 style demo

Return to “Other Topics”

Who is online

Users browsing this forum: No registered users and 1 guest