Development Retrospective
Background
I made another post because retrospective already has long contents so I extract this post from there.
Development Case Retrospective
Contents
I’ve already included the content below in the retrospective, and now I want to write freely, following the flow of thoughts without any specific format.
- The importance of generalization and standardization.
- Traces must be left somewhere.
- Technology is not everything.
- Resources cannot be completely trusted.
- What you worry about can happen someday.
- Distinguishing between DEV, STG, and PROD is important.
- Test is very important.
- Don’t trust people.
- Even when busy, you must do what’s necessary.
- Context switching takes a lot of time.
- The policies of high level organizations and the circumstances at the time.
- You need to maintain physical strength and stay mentally focused.
The importance of generalization and standardization.
Since joining the Big Data Center of the company, I have been working on projects related to company wide data. One of the most significant realizations I’ve had here is the importance of generalization and standardization. The opposite of these is fragmentation.
As I became part of an enterprise level organization, I got a task which creating a system to collect data across the company. However, before proceeding with this task, I reviewed previous projects and noticed the challenges that had been encountered. A major issue was the lack of consistency data collection methods and formats differed across services, and these discrepancies had not been unified. While reading this, you might ask, “Why?” There were likely reasons for this, even though I may not know all of them. I’ll share more thoughts on this later.
Through my experience, I’ve learned that when standardization and generalization are not achieved, fragmentation occurs. As a service grows, managing and maintaining it becomes increasingly difficult. I believe that standardization and generalization are essential as services expand.
For leaders or managers, it’s not enough to simply complete a project; they should aim to generalize as much as possible to ease the workload for workers and improve quality. If you just adapt entire demands of collaborating parties, it would be a SI not a collaboration. Of course, there might be valid reasons to consider such requests, but in cases where there’s no alternative, I think it’s better to postpone or make adjustments rather than applying everything.
For small services that are easy to manage such as under ten systems, fragmentation may not pose much of an issue. However, when the service exceeds this scale and takes on the characteristics of a platform, generalization and standardization become crucial. In the case of a large platform, I believe it is appropriate for a higher level department to set the direction for generalization and standardization.
Traces must be left somewhere.
If the person who developed a specific feature were to manage it for life, this discussion might not be necessary. However, unlike programs, people can change at any time. This has made me realize firsthand the importance of documenting why certain tasks were done. There are various ways to do this: organizing it in a document, or leaving detailed comments and function names in the code. While some argue that comments are not ideal as they compensate for what code fails to express, if something cannot be conveyed through code, leaving comments is still necessary. This can save time by avoiding the need for a new person to ask, “Why is this done this way?” and having to track down someone to explain. It also allows for handling situations even when the person in charge is unavailable.
Expressing intent as much as possible in the code and documenting it elsewhere when that’s not feasible is crucial for team efficiency.
This year, there were times when I was so busy that such situations occurred occasionally. Having experienced it myself, I’ve now developed a habit of leaving records as much as possible.
Technology is not everything.
This is something I also documented earlier this year in my retrospective: [Retrospective](EN) The first half retrospective). Since there haven’t been any major changes, I’ve brought it over as is.
It’s not all about technical skills. Something that could be resolved in 10 minutes technically might take 10 days due to emotional issues. There was a case that work time becomes longer due to newly created process in service part even though development were done. There was a case that owner told us that they cannot corporate but we asked to high level manager to corporate it. After that issue was solved. It seems that sometimes problems can be resolved not only through development but also by communication with people.
Resources cannot be completely trusted.
It could be due to my lack of experience in developing large scale services, but I’ve noticed a tendency to trust resources while developing. For example, I once assumed there was sufficient memory when writing code, only to encounter an OOM (Out of Memory) error. In another instance, an OOM error occurred because I set the options excessively. Although I’ve used memory as an example, I’ve also faced issues with storage, as well as load related problems on the network or database as the service grows.
For instance, I experienced OOM errors while running Pods and bottlenecks caused by frequent database access in Airflow. There were also cases where Pods crashed because options were set too big during infrastructure setup. Additionally, responses were either too slow or too large, causing Swagger to become unresponsive. As the service scaled, there were several instances where we hit AWS Quotas and had to request increases to AWS.
While there may not be enough bandwidth to worry about memory in the early stages of development, it’s crucial to keep in mind that problems can arise as the service grows. Whenever possible, it’s better to anticipate and prepare for these issues in advance.
What you worry about can happen someday.
There were moments when I thought, “Ah… this could become a problem later,” and those concerns did, in fact, happen.
For instance, there was a part in the cloud that automatically generates resources. Initially, I excluded it more broadly and later modified it. Although there were comments that it seemed oddly configured and it wasn’t documented, the potential issue I had been concerned about actually occurred. In another case, we initially managed infrastructure setup values as a single file. As the service grows, deployments became slower, and management became increasingly difficult. Eventually, a team member took the initiative to revise and improve the system, although it required significant effort.
This may seem obvious and like something everyone already knows, but I thought it was worth emphasizing again. It’s better to address these concerns in advance if possible. Of course, it’s easier said than done in practice… haha. Finding the right balance is, of course, necessary.
Distinguishing between DEV, STG, and PROD is important.
As the number of components I developed increased, I came to realize firsthand the importance of environment.
While testing, there were cases where my work conflicted with parts developed by others, or issues arose during integration testing. These problems occurred because the DEV and STG environments partially overlapped.
Although this may seem obvious, I wanted to note it down as it’s something that can happen in new project environments.
Test is very important.
This year, I was responsible for testing and CI/CD tasks, and I deeply felt the difference in quality and the level of anxiety between having tests and not having them.
Having unit tests and integration tests provided significant peace of mind during deployments.
I found it incredibly appealing how tests not only improve quality but also reduce anxiety for everyone involved.
While I understand the argument that there’s no time to write tests because of a busy schedule, I still firmly believe that writing tests is essential. They are invaluable when it comes to future development.
Don’t trust people.
It might be misunderstood, but this is not about saying that people are untrustworthy. It’s about the fact that people can make mistakes, so we shouldn’t rely solely on them.
During a collaboration, there was a case where someone promised to handle tasks manually every day, but eventually issues arose, causing a chain reaction of stress. Recently, there have also been instances where data wasn’t provided or was given in a different format when it needed to be collected. In such cases, I believe it’s better to build a system that allows the team to directly access database.
In the end, everything should be automated. It’s better for everyone if we systematize tasks for mutual convenience. For example, when building a data pipeline, a single team should be able to monitor everything from direct access to the database to data loading, which would reduce the need for communication. Another example is when people manually enter passwords or information, errors can occur, so a system should be created to allow the person entering the data to perform the validation. Lastly, even if people perform all the tests, the final testing should be done by machines to ensure quality is reliably guaranteed.
Even when busy, you must do what’s necessary.
Some of this overlaps with what I mentioned earlier. This is also a matter of compromise, and it can vary depending on the situation, but for critical areas like testing, it’s essential to make time and make them.
During a migration, I once overlooked an option, which led to a situation where I had to handle large data volumes. I kept postponing it, and eventually, the data size grew too large, causing issues. If you have a strong suspicion that something will definitely happen or is something that needs to be done, even if you’re busy, it’s better to address it right away. This was something that also involved the service side, so it should’ve been handled in advance.
Again, I fully understand that, as mentioned, it’s not easy.
Context switching takes a lot of time.
The “context change” referred to here means a change in tasks. The work you’re assigned could change, or the tasks you need to complete may shift.
This year, frequent interruptions caused me to often switch tasks, making it hard to focus and difficult to get back into the flow of work when doing again.
In the first half of the year, we reorganized our tasks, and starting new work in the middle of a busy period took a lot of time to get up to speed. When tasks change, it seems to take a lot of time to learn everything again. So, if such changes are planned, it’s a good idea to leave enough buffer time.
The policies of high level organizations and the circumstances at the time.
When working, I often ask myself, “Why was it made like that?” This question doesn’t come from arrogance, but rather from the feeling that things were made in a somewhat roundabout way. This is similar to the fragmentation I mentioned earlier. When I talked to the people in charge at the time, I found that there were reasons for everything. Sometimes, the team lacked the necessary organization power, and decisions had to be made to accommodate political or power related issues. There were also situations where it was a matter of, “If you don’t do it this way, we won’t cooperate.” The requirements were constantly changing, there was no standardization, and there were many other reasons. In situations like this, external factors beyond development can influence the direction of the project, making things harder. While there may have been areas where engineers missed or couldn’t address something, external influences played a large role as well.
When my team shifted to a larger, enterprise wide organization, and a powerful leader within the organization publicly announced a change, it became something everyone had to follow. Everyone responded. Even with strong technical capabilities, without support from the organization at the enterprise level, it wouldn’t have moved forward. From this experience, I learned that in a strong organization, policies need to be created for progress, and that’s the right approach. It’s much more efficient for the CEO to give a directive for the whole company than for an executive at the director level to try to negotiate.
You need to maintain physical strength and stay mentally focused.
Physical stamina is incredibly important, and no matter how much it’s emphasized, it still can’t be stressed enough. You need stamina to stay focused. When your energy drops, your concentration decreases, and efficiency suffers. When you’re lacking stamina, your efficiency in any task drops.
In the past, even if issues arose, they weren’t a huge problem. But now there are many services and users which mistakes are unacceptable. So, I have to stay focused every time I work.
When issues arise and the impact is significant, you can’t afford to make mistakes, so it’s important to stay alert. And to maintain that focus, you need physical stamina.
Conclusion
All of this reflects what I’ve felt this year. I wanted to go into more detail and share more specific examples to gain deeper empathy, but since many of the stories are related to the company, I couldn’t elaborate as much as I would have liked. Even so, I believe there are many relatable examples in the title and the content. Of course, these situations can vary depending on the context, so it’s important to think flexibly and respond accordingly. This are not definite answers.
While my body and mind went through some tough times, I try to think positively because there are still several things I’ve learned.