"What I should have done": Reflection on engineering in research

2024. 12. 21. 15:29미국박사유학

In this post, I will reflect myself and research process from a specific perspective: "what I should have done". There are several reasons that this post will be useful. If I can avoid the same mistake or more optimize the research trajectory in next projects, I can be more productive and less exhausted, eventually more consistent progress and sustainable work.

In this post, I will focus on engineering part. 

 

This could be surprisingly and not surprisingly the most consuming and demotivating part of the research process. Not surprisingly because engineering always takes time, but surprisingly engineering itself often is not the goal of research (but necessary). Hence, if you can find a faster way to implement the idea, you should take the path. You should pursue it, unless it is incorrect or affects the design or future productivity.

This is (again) surprisingly not easy even if it sounds straightforward. Often, it is not clear the current implementation plan is good or not. It is hard to judge. Sometimes, you don't even know you have to judge. This is true because to do it you have to evaluate every engineering decision before you execute which is not very plausible. Note that implementation itself is not the goal of your research project. Even if you make a rational judgement based on your knowledge and take the fastest path that you 'believe', it might turn out to be very bad decision, consuming a lot more time or completely useless.

 

Example: workload generator (the least important, the most time consuming)

Surprisingly, I spent too much time on workload generator. I used four different workload generator including three open sources and my own implementation.

 

wrk2: it does not give you the per request result. It only gives you the detailed percentile. It is hard to debug without per request and also impossible to plot fancy timeline graph.

 

my own workload generator (namely client): perfect except that it was not resource efficient. one thread was used for each request. and when app can't handle the load, a lot of threads hang there and exhaust the CPU resource.

It especially caused the resource exhaustion problem when we were benchmarking our own file write application which is slow. 

 

locust: didn't support changing rps for different headers dynamically during experiment. I also tried to walk around by running multiple locust processes for separate workload for each header (west,east, ...). But its internal implementation uses master thread with a static TCP port. Two processes were conflicting each other. 

I  evaluated it before I fully transited to it but couldn't fully figure out what it supports and what it does not. It didn't support dynamic load for requests of different headers. It supports only static ratio between different headers not configured RPSs. 

 

vegeta: it does not support exponential inter arrival time. but it checks all the other boxes. So we ended up using it.

 

The problem was it took multiple months to reach the vegeta. 

And different workload generator changes the latency profile a lot and whenever we use a new workload generator, we needed to profile again, if something goes wrong, needs to debug. And the entire experiment script which has 1000 lines of python code and more bash scripts should be updated which is A LOT OF UNNCESSARY work. (while I am writing, I still feel so bad about all the hours that I put on it..)

The first "What I CLEARLY shouldn't have done" was implementing my own workload code. It is not straightforward to implement resource efficient workload generator at all. I thought I could do with a little bit of help of chatGPT. I was wrong. and I didn't know what I was doing.

 

"What I should have done" was just using wrk2. 

 

+Lesson: implementing something by yourself should be avoided always if possible.

 

Example: Thinking vs Implementing something

Thinking too much is also harmful. If you know what to implement especially something basic before you start to implement the final algorithm. When the design appears to be outlined, then focus on "how" to implement. It requires switching the mental model. Different ways of implementing the design can come up in your mind. Then you need to evaluate two things in steps. The first thing you need to evaluate the potential implementation is "does it 'comply' with your design? Is it actually the 'correct' implementation of your design?" The second thing is "will it create problem in the future even if it does not now?" The problem that you need consider will be scalability and extensibility. Scalability as in 'when the system scales out. scalability means more node, bigger more complicated benchmarks, higher load', Extensibility as in 'when you want to add more features, will it be easy?'. It is not different from other engineering consideration for non-research project.

 

 

However, there is a case where you might want to spend as enough time as possible before executing it. When the way that it is implemented affects design, it should be evaluated more carefully with enough time. That will eventually save the time. Surprisingly, this case is more common than the case that executing fast is better. For example, the above workload generator case needed more evaluation(thinking = evaluation). Sometimes plotting the graph needs more thoughts when the plot script will generate plots that will go to the paper. For example, the order of legend, color, marker because you will want to make all the plots consistent.

Example for more  thoughts needed: Implementing jumping routing algorithm in data plane vs control plane.

We started to implement the jumping routing algorithm in data plane. It was more straightforward and easy to implement. However, by implementing jumping in data plane, it makes the algorithm distributed. What we needed to do was centralized jumping algorithm that minimizes the latency globally. Data plane implementation was doing from local view. And this is design choice. Since what we wanted was centralized design, we should implement only the control plane implementation. We stopped the data plane implementation version shortly and moved to control plane implementation. Ideally, we even didn't need to do data plane implementation at the first place.