Skip to content

Home

在AWS上設置三層架構

今天,我將演示如何使用虛擬私人雲(VPC)服務在亞馬遜網路服務(AWS)上設置三層架構。以下是說明設置的架構圖,主要分為三層。

第一層是演示層,用戶可以通過網關直接訪問公共子網。第二層是邏輯層,主要處理商業邏輯。此層位於私人子網中,以限制訪問並坐落於負載平衡器後面。負載平衡器能夠靈活且水平擴展以處理不同時間的變化的流量需求。第三層是數據層,其中包含一個位於私人子網的MySQL數據庫。只允許通過第二層訪問。為了提高可用性,我在兩個可用性區域中部署了架構,並將數據庫備份到另一個區域。這確保了如果一個可用性區域失敗,應用服務將保持運行。

首先,我將創建一個名為victorleungtwdemo的VPC網絡。按照架構圖,我將選擇172.17.0.0/16的CIDR區塊。此/16子網將給我提供65,535個IP地址,為未來的擴展提供了空間。

接著,我將創建六個子網。第一個子網命名為pub-subnet-1。我將它與我剛剛創建的VPC關聯,並選擇適當的可用區域(Zone A)。我也將指定這個子網的IP地址範圍。為了確保可擴展性,我將將其設置為/24子網。

繼續此過程,我將創建其餘的五個子網。如下圖所示,我現在在不同的可用區域有六個子網。

接著,我將創建一個新的網路閘道,命名為victorleungtw-igw

創建完成後,我將把它連接到我的victorleungtwdemoVPC。

現在,讓我們看看路由表。創建VPC時,系統會自動生成一個默認路由表。我創建的所有子網默認都會連接到這個表。

然後,我將創建一個名為pub-route的新路由表,該表將管理數據路由到公共網絡。此外,我將將原始路由表重命名為priv-route。對於數據庫子網,我將創建另一個名為nat-route的路由表。

此時,我有三個路由表。每個表都有一個默認路由。

對於pub-route表,我將向victorleungtw-igw添加一個路由0.0.0.0/0。這允許所有在關聯子網中的機器訪問公共互聯網。

接著,我將把我的公共子網,pub-subnet-1pub-subnet-2,與pub-route表關聯。

然後,我將添加nat-route,並將其關聯到priv-subnet-1priv-subnet-2

最後,不需要擔心剩下的私有路由表;所有子網默認都會連接到priv-route

現在,讓我們繼續創建一個網路地址轉換(NAT)閘道。我將選擇priv-subnet-2作為其子網並為其創建一個彈性IP(EIP)。

我將NAT閘道加入NAT路由表並設置路由為0.0.0.0/0

至此,我已經完成了約70%的網絡架構。接下來,我將配置所有相關的安全組設置。

為了安全起見,我將為我的堡壘主機、負載平衡器、Web服務器和數據庫創建單獨的安全組。

現在,該啟動相關的EC2實例了。

在創建RDS服務器之前,我也將創建一個RDS子網組。

設置好RDS服務器後,帶有適當的VPC和安全組,我們可以返回到我們的EC2實例。

接著,我將為Web服務器設置一個目標組並將它們添加到其中。

最後,我將創建一個應用負載平衡器。

設置好一切以後,我們現在可以測試系統。從堡壘主機,我可以SSH到兩台Web伺服器,啟動一個Nginx服務器,並驗證從應用層訪問數據庫的訪問。

這就結束了在AWS上設置三層架構的指南。如果你有任何問題,請隨時聯繫我.

Stories of my career

I started my career working as an Assistant Marketing Manager in Brisbane, but then a year later I took a big risk and moved back to Hong Kong to start my own startup with my all savings for the year. I see an opportunity to build a website which allows users to find restaurants nearby.

I was able to do programming since I took classes on Pascal and Java back in university. However, I had limited knowledge of HTML, CSS and Javascript, so I had to pick up and learn in order to build the website for my startup. I learned a lot of technical skills, using Meteror.js and connecting with mongo databases. It was difficult since it was only me coding all day and night. I wrote a business proposal and submitted that to the government for funding. However, after a month of work, only 5 restaurant owners signed up for the website, with total revenue of a few hundred dollars. It failed because there was no way to attract more customers and we were running out of funds quickly.

If I would do it differently, I would learn about the lean startup process. I did not know what is MVP (Most Valuable Product) back in the beginning, and I built a lot of cool features, such as real-time maps and geolocation for finding restaurants nearby. It ends up nobody using it at all as it was not an important feature. Most of the website traffic was driven from Google search, so I should have focused on SEO instead of geolocation search at that time. I would do differently by using the agile scrum methodology and iterate on the product, getting feedback from the customers and prioritising the features based on the feedback.

I learn a lot of technical skills and business knowledge. The product was a full stack development by myself using Javascript, and I was able to find another job in software engineering after the failure of starting my own business.

Later on, since my startup failed, I found a full-time job, working at an Australian consulting firm Industrie IT as a software engineer, my first client was Riot Games, League of Legends, which is owned by Tencent. The project was a redesign of the game store that buy skins for champions. I was the sole front-end developer responsible for the implementation and I was committed in the sprint to writing a feature, with CSS animation on the purchase button.

I underestimated the complexity of this feature, and I promised to finish this within one week. However, there were some risks, because the design of the animation and the assets are provided by the designer. We were working in different time zones, where I was based in Hong Kong and the designer was based in Latin America. There was little overlap time between our working hours and their progress was delayed since the product owner could not decide which animation works better.

One day before the sprint end demo session, I finally received the animation file, and it was provided using photoshop, which I did not have a licence to open. The format of the animation was not compatible with the legacy browser, which is the safari 4 in-app browser. The latest CSS syntax was not supported and it does not render correctly as expected. There was no way I was able to demo the finished animation, given that I was stuck in some technical issues. What I did was raise the blocker in the standup meeting for a status update to the team, so that it becomes visible that it was blocked. Also, during the sprint demo session, I was able to demo one of the buttons that implemented the animation, despite I wasn’t able to complete all the buttons. It is still valuable to get feedback from the stakeholders to see how one of the buttons would look and improve. And in the sprint retrospective, I self criticised myself for the underestimation of the effort and promised to improve in the upcoming sprint.

As a result, the stakeholders are able to provide valuable feedback on the progress, at the same time, understand the effort has been provided and attempted to deliver on time. Since the project was not urgent to release, the stakeholders are okay with the delay and appreciate my transparency of the status, instead of trying to hide the fault.

This project, it was using an old safari 4 in-app browser embedded in the adobe platform. My task is to implement the new design and launch it to China users. I made the mistake did not perform testing on all platforms, but only on Mac and Windows 7. During the beta launch, customers who are using Windows XP started to complain and said they got a blue screen of death from a computer crash.

What I did is to investigate the root cause. It was not easy, because the Windows XP crash blue screen did not give a lot of useful information on why the system failed. And it seems to be a deep platform issue on the operating system level, which I was not an expert on. I tried to ask questions on stack overflow, and comments are mocking me for supporting Windows XP, which should be deprecated. However, the majority of the internet cafes were still using Windows XP, so I had to support it.

Finally, I have to try to collect as much data as possible, including trying to set up a new virtual machine with windows XP to replicate the issue. More importantly, I was using a binary search method like in computer science, by commenting out half of the code, and seeing if it crash or not, to see which half of the code may be causing the problem. Then further comment out half of the portion and finally find the root cause was a library called font-awesome, for rendering icons. The issue could then be rectified using png images instead of icons. And I learnt the lesson to test my application on different platforms.

For the second client, I was helping my client to implement mobile applications, both iOS and android platforms. It was an IoT project with an app to connect with luggage locks. In the beginning, we started to implement both iOS and android apps at the same time. However, after the end of the sprint, the customer realises some of the differences in user experience due to the difference between the two platforms, such as the back button. Also, the startup wants to get investor funding so they want to get the app to be ready for demo as soon as possible.

Therefore, I have to make a short-term sacrifice and focus on developing one platform first on iOS instead of Android. This is because the main customers would be Apple users, as the product enters the high-end, luxury demographic. By sacrificing the android platform in the short term, I was able to deliver the iOS app at a much faster speed with more features, better proof of the concept and polish on the user experience. Once the investment funding was raised, I was able to spend more time on the android platform with double the speed, since it avoids some of the re-work and it has a more stable set of requirements within a shorter period of time. The investment fund raised benefits for the long-term goals and let the start-up sustain itself.

This client was a small startup on Bluetooth lock. My task was to deliver the mobile applications to connect to it. One time the client expressed that he is not fully satisfied with the sprint end release that I delivered. The expectation is pixel perfect, with the colour matching exactly like the design, gradients and animation of the buttons need to be perfect. I was not able to meet the expectations as I did not notice the details in the few pixels alignment.

I pay extra attention to the pixels and colour, using different tools to measure the hex values and making sure correct alignment on different phone screen sizes. I reserved time and asked the designer for help to notice any issues she spots. I sat down with our designer and the client and explained the reasons for the few deviations in the design from what the client directly requested.

Although it took a lot of extra time and effort for those minor changes, it does improve and meet the client’s expectations. The client thanked me for going the extra mile to make sure he gets a perfect outcome. The start-up with the size of 2 people. They are working on an IoT project and building luggage with a lock connected with Bluetooth to the mobile app. I was a software engineer helping the customer to build the frontend application.

After the first version of the app was built and ready, the hardware side of the project was delayed. Therefore, the product owner started to think about redesigning the app without even launching it to the market. He was not a designer but started requesting a change of the design, and a couple of days later, he would change his mind requesting a change on a different design, which could lead to inconsistency. I was frustrated by frequent changes in the requirements and re-work.

What I did is to coach the client with agile methodology. The gist of it was to launch something to the market and get feedback from the customer so that we can deliver and prioritise the main features. It is a common pitfall for new entrepreneurs to want to launch the perfect product in the market with the best user experience. At the same time, it became the biggest psychological barrier for the product owner to launch something on the market. Instead, I introduced an A/B testing tool, so that I could build two versions of the app, and collect user feedback, to see which one is better.

Therefore, I was able to get data on the design of the mobile app. I was not frustrated because the change has data to prove it is something better, rather than based on the customer’s indecisiveness. The customer is happy as well because he knows which version is better and does not worry about different designs and re-work, focusing on the feature that is most important to the customer. Being apologetic and actively sympathetic is important when handling such situations, and providing a solution as soon as possible.

Later on, I decided to move to a bigger consulting firm and change my job working in Accenture, the client was Cathay Pacific Airways. The project is a migration of the legacy hybrid app to the native mobile, with better design and performance. It was a very tight timeline because the old app on the Kony platform has a license expiring in a few months. I have to finish re-writing all the logic in the new platform.

Near the end of the old platform license expiry, we still got a lot of features unable to finish. It is a hard deadline even though we are running on an agile scrum methodology. In order to meet the tight timeline with the fixed scope, and not sacrifice the quality, I have to compromise and work overtime, weekends and public holidays. I planned a trip to go to Cambodia with my girlfriend, but I have to cancel the flight and hotel. As a result, I make sure one of my biggest customers meets its year-end goal. The costs it saves from the licence and the 5 stars review on the mobile app store made from that project made it all worth it.

The higher management put a lot of pressure on internal reporting. It was many hours lost of productivity and administration, instead of focusing on the delivery of values and features to the customer. I escalated the issue and asked the higher management to follow the agile scrum approach. They are welcome to join the standup and sprint demo introspective of the progress. It would be an anti-pattern to use the waterfall approach and reporting.

Eventually, the customer is happy with the progress. And the higher management appreciates customer satisfaction, thus reducing internal progress report meetings that waste time and productivity.

There was one important feature needed to build a customer to online purchase flight tickets from the mobile app. It needs to integrate with one of the booking engines, called Amadeus. The task I need to do is implement the online flight ticket purchase without any bugs. However, I made a significant professional failure by using an open source library to calculate the decimal places. Most countries follow ISO 4217 standards for currency precision, which can range from zero decimals to three decimals. For example, USD amounts have two digits to the right of the decimal, and JPY has none. I did not realise the problem when using the library and it went through the QA process without any problems.

When the mobile app was launched to the public customer, Cathay Pacific Airways start to receive complaints that the online purchase via the mobile app got charged ten times more than the original amount! Luckily it did not happen to all countries, only to a few one. Turns out, not all countries strictly follow the ISO standard, such as Russia Ruble. It was supposed to be 2 decimal places while the system returns and counts with 3.

I immediately inform my manager about the issue and estimated the impact. The impact would be customers paying 10 times the money to buy a flight ticket. I need to communicate to stakeholders, with the customer service department about the technical issues, such that they would be able to handle customer queries. I also need to think of a plan, to roll back the bug and patch the error. I would need to notify the customer and create a force update plan for the mobile app, apologies for the inconvenient costs.

As a result, the company needs to refund the customer and apologies. What I learn from the situation is to pay extra attention to decimal places, and the situation comes up in my professional career a couple of times working in the fintech industry. More importantly, I should write more test cases and think of all the edge cases, such as different currency decimal places to make sure the system is working correctly. It is really customer money and the bug could lead to significant financial loss if the edge cases are not being considered carefully. I was able to minimize the problem and resolve it within 1 week of time. There was some customer affected in production and requested a refund and compensation. What I learn is to be extra careful with decimal places as it is significant. And I take responsibility to resolve the problem.

In the next phase of my career, I switched job and working at EY, my client was HSBC at that time. It was difficult because it involved many different vendors and stakeholders. Every day, in the standup meeting, instead of timeboxing it to 15 minutes, it took more than one hour and moved to a discussion section without any agenda. And then someday, when there is no particular progress, the customer starts yelling and blaming the team member for not delivering. The customer turns the environment toxic instead of being productive.

What I did is to set up a 1:1 session, and coach the client on why timeboxing a standup to 15 minutes is necessary. It is not a discussion session for random topics and there is a business impact, given the consultants are billing the banks with a couple of thousand dollars per hour. I manage the customer by asking him questions and helping him to find a better solution.

His reaction was the first defensive. This was because he was quite senior with many years of experience in the industry before the agile practice was common in the industry. Therefore, I have to coach him and help him to figure out why he got this behaviour. Turn out the client was stressed about the delivery timeline and worried that the team could not deliver on time. The more he worries, the more time he spends chasing progress in standup and the less productivity. As I help him to find out the root cause, I help him by suggesting a better approach, which is to schedule discussion sessions offline if necessary and only include relevant members to join. Also, I remind everyone to physically stand up in the standup meeting, so that they would feel tired from standing and shorten the unnecessary status update.

The outcome of this was positive. The team appreciates the time saved to focus on writing software. The client was actually less stressed because the team is actually able to deliver more. I was able to transform the toxic work environment into a healthier environment so that the project was able to deliver on time eventually. The project manager ended up being great friends with me in the end, and appreciate my feedback to help the team.

After the app was developed and launched, the users were not happy and leave negative reviews on the app store. I have to investigate why the users were not happy with the app. Because the mobile app has defects, different from the original design and product specifications. Why? People the developer were outsourced to other countries and there is a loss in miscommunication. Why? Because the developer was not an English speaker and did not fully understand the requirement. Why not one find it out early? Because there was no significant QA process. So I ensure the quality by adding test cases and making sure it aligns with the original user story acceptance criteria. As a result, less bug and 5 stars review on the app as everything work smoothly with an improved user experience.

While working in EY, I also joined a hackathon and pitch an idea for a project to check suspicious bitcoin transactions. Users are anonymous on blockchain and use it for illegal activities, such as money laundering and drug dealing. I did a POC proof of consent project in an attempt to get the wallet address that has a high risk of suspicious activities. As a result, it got further implemented into real-world applications. It was innovative because it does help with the regulation of the technology and facilitates the development of fintech.

Moving on, I switched jobs due to a friend referral and I was working at Dynatrce, my difficult interaction with a client was the Hong Kong Jockey club. I was a consultant and helped them install a new software monitoring system in their data center. There was only one week available to complete all the installation and it was difficult because the client has a tight time window and only can do changes during summertime when there is no horse racing.

It was my first time working in a physical data centre, where it required a security check and it was freezing cold to work inside. There are multiple stacks of machines and I have to remember which one I was doing the installation and plug-in with the KVM (keyboard, video, mouse) system. After a week of successful installation of 16 different machines in 2 data centres, and before the horse racing day resumed, the security team performed hardening of the machine. And suddenly at midnight, I received an urgent call and request for production support.

I took a taxi to rush to the data centre and saw my client crying there. Turns out the security team removed the root users to access the database, and due to some misconfiguration, it did not provide the right user access to access the database for the software monitoring software. What I did in this situation is to calm down my client and start doing the re-installation. This is because we were running out of time and we had to re-install all 16 machines before the race day started. Therefore, I had to work overnight and at weekends, squeezing all the workloads from 1 week to 3 days.

Due to the hard work, the software monitoring software is able to complete the installation before the horse racing day starts. My client who was crying appreciated my effort to go the extra mile working on weekends and at midnight to help complete the task by re-doing all the installations.

My other big customer was Huwai. The project is to integrate the software monitoring product into their cloud environment. There are many feature requests coming in at the same time before the SoW is signed. I was trying to help the customer as much as possible and take the lead to aggregate all the requirements. They want to customise the frontend UI, however, our product is English only and did not have internationalisation.

I help to raise a feature request to the engineering team in Europe. It was a difficult process as they did not understand the necessity to customise the front-end UI. A lot of pushback happened and they said no capacity and it was not in priority. I escalated the issue as Huawei is one of our biggest customers in the China region. I got the attention of the CTO and got his support on this feature request.

As a result, the feature got delivered, with a customisable UI. When Huwai cloud users launch a virtual machine, there is an option for them to select if it comes with software monitoring, and there is a customised UI in the Chinese language. I listened with understanding and took them through the process step by step with patience and care. I focused on making sure that the customer clearly understands and was able to use the product in the Chinese language.

In this Huawai project, I need to integrate the software monitoring application into their cloud platform. I was a consultant sent to the Shenzhen onsite office, while my manager was in Australia, who did not understand the customer. The client's request was to track the IP address of every user with personal identity information, which I have to say push back since this violates the GDPR (General Data Protection Regulation of the product.

I, personally, believe that there is always a solution to a problem. Regardless of how difficult it is, I will truly try my best to fulfil a request by a customer or at least meet them halfway. I would say no to a customer only if their request is simply not realistic, or expected me to directly break company policies. I asked the product engineering team to consider having a separate implementation, one for Europe customers who need to follow GDPR while one for China customers who have different regulations.

Consequently, the user frontend has to pass in a different header to identify as a China user. It would be difficult to analyse the IP by geolocation as the analysis may potentially violate the GDPR already. Meanwhile, it is also beneficial to the customer to think about the longer term and if they would consider a different implementation for global customers in different geolocation regions.

3 years ago, I leave my job and changed career to work at HSBC bank as a technical lead, the project I was working on was Malaysia FPX (Financial Process Exchange). It was a merchant online payment solution, where customers can go to Shoppee to pay and select debiting from their HSBC account using FPX. The situation was an existing platform using a legacy system, and I need to drive from improvement to increase the success rate.

The existing platform has a transaction successful rate of 30%. It was a very low rate, given that the regulatory requirement is a 70% success rate. My task is to increase the success rate by rebuilding the platform using a newer technology stack, such as Angular JS for the frontend and Java spring boot for backend applications.

After a few months of hard work in the project, the new platform was finally launched, I had a hope that the transaction success rate would significantly improve and reach the 70% target. However, it turned out the new successful rate was only 60%. I do not know why, so I have to conduct an investigation by using customer feedback. Firstly, I tried to use the tool called Splunk to collect all the transaction logs and count all the errors to identify what are the common error reasons, such as timeout. Also, I tried to reach out to the customer service team to find any user feedback. Turns out I found out we had an accessibility issue on the website. If you are a customer who is visually impaired, you would not be able to read the screen and know the transaction needs to be completed within 10 minutes. They take time and struggle to navigate the page, and it would time out and fail to complete the transaction.

Because of the feedback collected, I fixed the errors due to system issues and frontend accessibility issues. This increased the transaction success rate to 70%. This is important because it is a regulatory requirement from the Malaysian government, and it avoids a business impact on the bank by paying a large amount of penalty. The project also reminds me of customers who have accessibility issues, such as those who are visually impaired, any fix the details that would have every single customer. I have to approach every project with a clear and open mind to take the customer feedback.

During my time there, I was also proud to join an Open banking hackathon. I encourage HSBC to take the risk and adopt banking login as a service for other platforms. It could be security risky for the bank, but it meets the business goals since the bank account has already done all the KYC. I got to the senior CTO level and pitch the idea and show the demo. The CTO was impressed and my team win one of the awards for the hackathon.

When I was living in Hong Kong, I took a calculated risk and moved to Singapore. This is because Singapore has a lot more opportunities in financial technology in the Asia Pacific and it would fulfil my professional goal with overseas work experience. The trade-off requires me to move out of my comfort zone as I had a stable job in the bank HSBC, working as a technical lead in the corporate world. I have to leave my family and friends in Hong Kong to get to an unknown startup environment. It was a calculated risk because, in the worst-case scenario, I could still move back to Hong Kong if I did not enjoy the work environment in Singapore. However, the upside is unlimited, as I could see different opportunities and work environments, even in this challenging covid19 situation and required for 14 days of quarantine in the hotel. The outcome is I really enjoy the work environment in Singapore, it’s green, well planned and organised. There are also many work opportunities in the fintech industry.

Currently, I work at Thought Machine as a client engineering manager. It is a post-sales role, which means I am billable by the hours and service provided to the customer. Once the project is finished, and no longer billable, it is handed over to the customer and they are on their own.

However, in most of the scenarios, the client would have a need to ask some questions or production support queries. They would be able to contact me via the slack channel since I had built a personal relationship with the customer. At the same time, the needs of my business are to focus on billable hours and filling in timesheets for utilisation. The time it takes to answer customer queries would be out of scope and not billable.

To balance it, I would still try my best to find out the answer for the customer, even if it may mean I need to work some extra hours to compensate for it. It may not be billable right away, but the need of answering the customer's questions should be a high priority.

Overall, client is happy to get technical queries resolved within a short period of time. Instead of going through all the processes to raise support tickets, wait for a few days to get feedback. And after I have answered more than 10 technical queries, the customer was happy to compensate me by paying me for billable hours and satisfied the needs of both businesses. My approach is always thinking about long-term benefits for the business and retention of the customer rather than making a short-term profit and losing the customer. Sometimes, customers ask questions about a product issue. The balance was not deducted correctly. I felt they need an answer right now, but I think they need the correct answer instead of a quick answer.

I get on a call and set the expectation that I was trying to collect more information instead of resolving the issue right now, as I may need help from the UK team at different time zones. Communication is key here and I set the expectation that I will get back to them as soon as possible. And I have to work overtime staying late till the UK hours to help troubleshoot if required. Therefore, while the customer gathered more log information in the call, they realised something was wrong with the upstream system and sent the wrong balance amount in the request. The root cause was identified and resolved while trying to troubleshoot the issue.

One of my clients now is a Singapore Digital bank. After I delivered the python code for them to implement the current account business logic, it was tested and it was going to launch to production in beta. And before the launch, the monetary authority of Singapore would be required to perform load testing by a third-party auditor, which is EY. They were doing performance testing on AWS on our platform and when it hits 15 transactions per second (TPS) to one single account, they started to see response time degrade and get longer latency on response time. The customer raised a production incident ticket and I had to reply within the Service Level Agreement (SLA), as I was in the first line of support helping out the customer.

What I did is to request the relevant information quickly, including the Grafana dashboard for metrics data, kibana for logging data, as well as the latest source code from production. I was able to analyse the data and identify the main issue is a feature on user transaction daily limit. It was a feature where customers could only spend a fixed maximum amount per day, such as $1000 SGD. The key issue is the implementation was trying to loop through one day of transactions, and aggregate the total amount in order to check the current spending. It becomes an issue with 15 transactions per second, which becomes 900 per minute and 54000 per hour! I respond immediately to the situation by providing the root cause analysis and mitigation, where I changed the implementation by using a variable to store a record of the running total.

Therefore, the code was much more performant than the original implementation. I was able to resolve the production incident within a short period of time, and it helps the client perform the third-party audit check and complete the time-sensitive regulatory requirement within a short window.

Besides, the sales team will always ask me for an estimation on a prospect, and see how long would it takes, how many resources and how much it would cost for the prospect if they choose our project team to deliver the result. It usually required a quick judgement, since the prospect would only give a few days to respond and the requirement provided is incompleted and only a few lines, such as multiple currency support of the account with compliance, without any details.

I would need to make a quick judgement of the project estimation. Because if the estimation cost is too high, the prospect would not choose my company to deliver the project. However, if the estimation is too low, I would be in trouble as not having enough resources and time to deliver within the tight timeline. And usually, the prospect did not know what they want as well, and the procurement team is not the one who comes up with the product requirement.

What I did is try to approach the prospect, set up a call and clarify as much as possible. Then I would make some assumptions and validate them again with customer feedback, such as assuming the client environment has the infrastructure ready, the pipeline setup etc. And then I would also reach out to my other project team with similar project scope to collect data points on their delivery. Besides, I would make the best cases scenario and worst-case scenario estimation with some buffer to compensate for the lack of deep analysis of the requirement.

Eventually, the sales team was able to get back to the Vietnam prospect, and they were happy with my proposal. They selected my company product out of 4 other vendors for their core banking system. More importantly, the project was able to deliver on time because it was a realistic timeline and reasonable assumptions were made.

After I started working in my current job, I needed to learn about Kubernetes and Cloud technology since it is used in the product. I want to get certified and validated by knowledge, so I took multiple exams, such as CKA, CKAD and CKS for Kubernetes. Google Cloud professional architect certifications, Azure and also AWS cloud certificates. It took a lot of time to take online courses, watch videos, and reading blog posts, reading books about all the best practices. And more importantly, I do hands-on practice and try to solve real-world problems, applying knowledge in practical solutions. Recently, I got the certificates, not only learned the concepts but forgot about them after exams. Instead, I learn by doing and still use them to solve real-world problems.

My job now is client facing and need to do a product demo. After doing the product demo at a couple of sprint end, I set a goal for myself to improve my presentation skills and be a better storyteller. I recognised that communication was the heart of the job, so I really wanted to become a master communicator and impress my client. I spent a lot of time practising and joining the toastmaster club, paying attention to feedback on what did well and what did not do well, such that I can improve on it. I even enrolled on a storytelling class to help me advance. My efforts eventually led me to present a sales pitch in front of a Taiwan prospect, where I was able to show our product demo and answer technical queries. As a result, they selected my company's product for their new project.

It has been a long journey since I started my career 10 years ago. I am writing all these stories as a practice to improve my communication skills, which is essential in order to reach the next stage of my career. Hope you enjoy the stories.

關於我的職業生涯的故事

我開始我的職業生涯時在布里斯班擔任助理市場經理,但一年後我冒了很大的風險,回到香港,用我一年的積蓄創辦自己的初創公司。我看到一個機會,就是建立一個讓用戶能夠找到附近餐廳的網站。

我在大學時候學習Pascal和Java,因此能夠進行編程。然而,我對HTML、CSS和Javascript的知識有限,所以我必須學習這些技能,以便為我的初創公司建立網站。我學到了許多技術技能,比如使用Meteror.js並連接到mongo數據庫。這很困難,因為我每天都在寫程式碼,而且我自己一個人寫。我寫了一份商業提案,提交給政府申請資金。然而,經過一個月的工作,只有5個餐館老板登記到我們的網站,總收入僅幾百元。我們失敗了,因為我們無法吸引更多的客戶,而且我們的資金也很快就用完了。

如果我可以重新來過,我會學習關於精益創業的程序。我一開始並不知道什麼是MVP(最重要的產品)。我建立了許多很酷的功能,如即時地圖和附近餐館的地理定位。然而最終沒有人使用這些功能,因為它們並不是重要的特性。大部分的網站流量都是來自Google搜索,所以我應該在當時更專注於SEO而不是地理位置搜索。我會用敏捷的Scrum方法進行迭代,並獲得客戶的反饋,根據反饋來優先執行功能。

我學到了許多技術技能和商業知識。我能夠獨力使用JavaScript進行全棧開發,並且在我自己的業務失敗後,我能夠找到另一份軟體工程師的工作。

後來,因為我的初創公司失敗,我找到了一份全職工作,成為了澳洲企業諮詢公司Industrie IT的軟體工程師。我的第一個客戶是Riot Games,擁有《英雄聯盟》的遊戲公司,該公司由騰訊擁有。這個項目是重新設計遊戲商店,用來購買角色皮膚。我是唯一負責實作的前端開發者。我一直致力於寫一個功能,並在購買按鈕上添加CSS動畫。

我低估了這個功能的複雜性,我承諾在一週內完成這個功能。然而,這裡有一些風險,因為動畫設計和資產都是由設計師提供的。我們工作的時區不同,我在香港,設計師在拉丁美洲。我們的工作時間幾乎沒有重疊,設計師的進度也因為產品所有者無法決定哪種動畫更好而被推遲了。

在展示會結束的前一天,我終於收到了動畫檔案,但這個文件是用photoshop提供的,我沒有許可證可以開啟。這種動畫的格式與舊的瀏覽器不兼容,比如Safari 4的內置瀏覽器。它不支援最新的CSS語法,並且它不能正確地顯示

Introducing Amazon Web Services (AWS)

Hello everyone, my name is Victor Leung and I am an AWS community builder. In this article, I would like to introduce Amazon Web Service (AWS). You may be wondering, what is AWS? It is the world's most comprehensive and well-adopted cloud platform. Customers trust AWS to power their infrastructure and applications. Organisations of every type and size are using AWS to lower costs, become more agile and innovate faster.

AWS provides on-demand delivery of technology services via the internet with pay-as-you-go pricing. You can use these services to run any type of application without upfront costs or ongoing commitments. You only pay for what you use.

Moreover, AWS gives you more services and more features within those services than any other cloud provider. This makes it faster, easier and more cost-effective to move your existing application to the cloud and to build anything you can imagine.

You can rely on AWS's globally deployed infrastructure to scale your application to meet growing demand. There are so many regions in the world, how to choose? You can start with the region closest to you and your customer. A region is a physical location in the world that consists of multiple Availability Zones. Each availability zones consist of one or more discrete data centres, each with redundant power, networking, and connectivity, housed in separate facilities. In the future, if your company expands to other regions, you can take advantage of AWS facilities as well. The AWS Cloud spans 84 Availability Zones within 26 geographic regions around the world, with announced plans for 24 more Availability Zones and 8 more AWS Regions.

As for the computing power on the cloud platform, there are several types to choose from. You can use the EC2 virtual server service to deploy your server on the platform. And there are so many types of EC2, how to choose? In fact, it is decided according to your needs, the four aspects are the CPU, memory, storage and network performance. According to the type, age, capability and size, there are certain naming conventions, such as M5d.xlarge.

Generally speaking, for the instance selection process, you can start with the best-guess instance. Then determine the constrained resources. For example, C5 instances are optimised for compute-intensive workloads. It is suited for high-performance web servers. It has cost-effective high performance at a low price per compute the ratio. Besides, for M5 instances, they are general purpose instances. It has a balance of compute, memory, and network resources. It is a good choice for many applications.

Once you started an EC2 instance, you may change the instance type as well. You can resize for over-utilized (the instance type is too small) or under-utilized (the instance type is too large) cases. This only works for EBS-backed instances. The steps are 1. Stop instance 2. Instance Settings -> Change Type 3. Start Instance. You cannot change the instance type of a Spot Instance and you cannot change the instance type if hibernation is enabled.

There are a couple of available CloudWatch metrics for your EC2 instances:

  • CPUUtilization: the percentage of allocated EC2 compute units
  • DiskReadOps: completed read operations from all instance store volumes
  • DiskWriteOps: completed write operations to all instance store volumes
  • DiskReadBytes: bytes read from all instance store volumes
  • DiskWriteBytes: bytes written to all instance store volumes
  • MetadataNoToken: number of times the instance metadata service was successfully accessed using a method
  • NetworkIn: number of bytes received by the instance
  • NetworkOut: number of bytes sent out by the instance
  • NetworkPacketsIn: number of packets received by the instance
  • NetworkPacketsOut: number of packets sent out by the instance

Besides, you can install the CloudWatch agent to collect memory metrics and log files.

When purchasing EC2, there are many options. You can start with an on-demand instance first, billed by the second, with no long-term contract. After you try it out in the future, you can choose a more cost-effective reserved instance and pay for a long-term lease of one to three years, which will save you money in the long run.

After choosing the purchase method, you can put the EC2 virtual machine into the auto-scaling group. When the demand increases, the number of EC2s can be increased at the same time, thereby increasing the computing power. When the peak period is over, such as when there is no traffic in the early morning, the number of EC2s can be automatically reduced. This automatic scaling function can be scaled independently according to different indicators, and this function is free to use.

For EC2 Load Balancing, by default, the round robin routing algorithm is used to route requests at the target group level. It is a good choice when the requests and targets are similar, or if you need to distribute requests equally among targets. You can specify the least outstanding requests routing algorithm, with consideration for capacity or utilization, to prevent over-utilization or under-utilization of targets in target groups when requests had varied processing times or targets were frequently added or removed. If you enable sticky sessions, the routing algorithm of the target group is overridden after the initial target selection.

Elastic Load Balancer (ELB) can be used to automatically assigned to one or more availability zones, and at the same time, it can check the health status of the backend servers, and increase or decrease resources horizontally according to traffic requirements. There are also several different options for load balancers. For Application Loan Balancer (ALB), which is configured according to the OSI layer 7, which is HTTP. Other load balancer can also be distributed through the fourth layer of the network OSI, using the protocols of TCP and UDP, as well as the distributor of the gateway.

Suppose your business is unlucky to encounter a large-scale accident, such as a natural disaster, an earthquake, damage to a data centre, a technical impediment, or a human error, such as an employee running a command rm -rf deletes all the data, so what should you do? Actually, there are different methods, and there are also different restoration times and origins.

As for the different methods, different costs would be charged. The higher the cost, the faster the recovery. If your business can tolerate a few hours of service interruption, a normal backup and restore scenario is fine. But if it doesn't work, and it takes a few minutes to restore service, then it's a matter of replicating an identical environment in a different region, and in a standby state.

Let me give you an example, such as deploying a website to an environment in Singapore, and deploying a backup environment in Hong Kong at the same time. Through the Route53 domain name system, the domain name is pointed to the Singapore region. When a problem occurs in the Singapore area and the entire area cannot be used, the domain name can be transferred to the Hong Kong area immediately, and normal services can be resumed. The process can be changed manually or automatically, or even distributed proportionally or on a user-by-user basis.

However, operating in two regions is relatively expensive. For generally small-scale problems, such as component failures, network issues, or sudden increases in traffic, deploying to two or more Availability Zones is fine. When a zone is unavailable, it is immediately moved to another available zone, and data can be independently replicated.

Regarding to database, you can choose RDS, which is compatible with MySQL database and can be resized. RDS is a hosted service that handles patching, backup and restore functions for you. In the future, you can also consider using Aurora. The throughput can reach about three times, but the price is also more expensive, depending on whether you want to achieve the performance of a business database.

RDS allows multi-AZ deployments, which provides enterprise-grade high availability, fault tolerance across multiple data centres with automatic failover, and synchronous replication and enabled with one click. When failing over, Amazon RDS simply flips the canonical name record (CNAME) for your DB instance to point at the standby, which is in turn promoted to become the new primary.

The RDS read replicas provide read scaling and disaster recovery. It relieve pressure on your master node with additional read capacity. It bring data close to your application in different regions You can promote a read replica to a master for faster recovery in the event of disaster.

If you need strict read-after-write consistency (what you read is what you just wrote) then you should read from the main DB Instance. Otherwise, you should spread out the load and read from one of the read replicas. The read replicas track all of the changes made to the source DB Instance. This is an asynchronous operation. Read Replicas can sometimes be out of date with respect to the source. This phenomenon is called replication lag. Replica Lag metric in Amazon CloudWatch to allow you to see how far it has fallen behind the source DB Instance.

Amazon RDS encrypted DB instances use the industry standard AES-256 encryption algorithm to encrypt your data on the server that hosts your Amazon RDS DB instances. To protect data in transit, all AWS service endpoints support TLS to create a secure HTTPS connection to make API requests. Manage secrets, API keys, and credentials with AWS Key Management Service (AWS KMS). As the team expands, with AWS Identity and Access Management (IAM), you can specify who or what can access services and resources in AWS, centrally manage fine-grained permissions, and analyze access to refine permissions across AWS. Multi-factor authentication (MFA) in AWS is a simple best practice that adds an extra layer of protection on top of your user name and password. Firewalls (web application, network) and DDoS protection. Thread detection, manage secret alerts, and configure security controls for individual AWS services using AWS Security, Identity & Compliance.

Amazon CloudFront is a content delivery network (CDN) service built for high performance, security, and developer convenience. It speeds up the distribution of your web content to your users, through a worldwide network of data centres called edge locations. The user request is routed to the edge location that provides the lowest latency (time delay), so that content is delivered with the best possible performance. For example, the first client sends a request in the United States, and then needs to cross half the world to Singapore to get the content, but for the second request, it is good to get the previous cache file in the data centre near the United States, which greatly reduces the distance and feedback time.

For dynamic content acceleration, you can use standard cache control headers you set on your files to identify static and dynamic content. Dynamic content is not cacheable, it proxied by CDN to the origin and back. Faster response time = Reduced DNS Time (Route 53) + Reduced Connection Time (Keep-Alive Connections & SSL Termination)+ Reduced First Byte Time (Keep-Alive Connections)+ Reduced Content Download Time (TCP/IP Optimization). It can further optimise using Latency-based Routing (LBR), run multiple stacks of the application in different Amazon EC2 regions around the world, create LBR records for each location and tag the location with geo information. Route 53 will route end users to the endpoint that provides the lowest latency.

AWS CodePipeline is a fully managed continuous delivery service that helps you automate your release pipelines for fast and reliable applications. It can use to prevent issues by running tests and performing ng quality check. Amazon CloudWatch is a monitoring and observability service. It provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, and optimize resource utilization. Upon detection of abnormal patterns or healthh check returns error, you can trigger an alarm or actions, which could further trigger AWS Lambda, it is serverless, event-driven compute service that lets you mitigate the issue, such as restart the server or revert to the previous stable version. You can then recover from failed service instances.

For storage of objects, there are 6 choices of Amazon Simple Storage Services (Amazon S3) storage classes: S3 standard, S3 standard-IA, S3 One Zone-IA, S3 intelligent-tiering, S3 Glacier, S3 Glacier Deep Archive. The Amazon S3 Glacier storage classes are purpose-built for data archiving, providing you with the highest performance, most retrieval flexibility, and the lowest cost archive storage in the cloud.

For S3 Data Consistency, the New Objects (PUTS) has Read After Write Consistency. When you upload a new S3 object you are able to read it immediately after writing. Overwrite (PUTS) or Delete Objects got Eventual Consistency. When you overwrite or delete an object, it takes time for S3 to replicate versions to AZs. If you read it immediately, S3 may return you an old copy. You need to generally wait a few seconds before reading.

Another storage option is EBS. What is Amazon Elastic Block Storage (EBS)? Block storage volumes as a service attached to Amazon EC2 instances. It is flexible storage and performance for dynamic workloads such as stateful containers. It can be created, attached, and manage volumes through API, SDK, or AWS console. It has point-in-time snapshots and tools to automate backup and retention via policies.

gp3, General Purpose SSD are great for boot volumes, low-latency applications, and bursty databases.

  • IOPS: 3,000 - 16,000 IOPS
  • Throughput: 128 - 1,000 MiB/s
  • Latency: Single-digit ms
  • Capacity: 1 GiB to 16 TiB
  • I/O Size: Up to 256 KiB (logical merge)

io2, Block Express are ideal for critical applications and databases with sustained IOPS. It’s next-generation architecture provides 4x throughput and 4x IOPS.

  • Up to 4,000 MiB/s
  • Up to 256,000 IOPS
  • 1,000:1 IOPS to GB
  • 4x volume size up to 64 TB per volume
  • < 1-millisecond latency

st1, Throughput optimized are ideal for large-block, high-throughput sequential workloads.

  • Baseline: 40 MiB/s per TiB, up to 500 MiB/s
  • Burst: 250 MiB/s per TiB, up to 500 MiB/s
  • Capacity: 125 GiB to 16 TiB
  • I/O Size: Up to 1 MiB (logical merge)

sc1, Cold HDD are ideal for sequential throughput workloads, such as logging and backup.

  • Baseline: 12 MiB/s per TiB, up to 192 MiB/s
  • Burst: 80 MiB/s per TiB, up to 250 MiB/s
  • Capacity: 125 GiB to 16 TiB
  • I/O Size: Up to 1 MiB (logical merge)

For EBS availability, EBS volume data is replicated across multiple servers in an Availability Zone to prevent the loss of data from the failure of any single component. Protect against failures with 99.999% availability, including replication within Availablity Zone (AZs), and 99.999% durability with io2 Block Express volumes. EBS Snapshots are stored in S3, which stores data across three availability zones within a single region.

Besides, there is Amazon Elastic File System (Amazon EFS). It is serverless shared storage - no provisioning, scale capacity, connections and IOPS. It is elastic - pay only for the capacity used. Performance build-in scales with capacity. It has high durability and availability - designed for 11 9s of durability and 99.99% availability SLA.

AWS CloudFormation is a service that helps you model and set up your AWS resources so that you can spend less time managing those resources and more time focusing on your applications that run in AWS. Infrastructure as code (IaC). Consistent across accounts and regions. Dev/test environments on demand. An Amazon Machine Image (AMI) is a supported and maintained image provided by AWS that provides the information required to launch an instance.

Finally, to sum up, there are many AWS services to archive well architecture with operational excellence, security, performance efficiency, reliability and cost optimisation. There is so much to learn and let’s keep learning. Thank you very much for taking the time to read this article. Let me know if you got any questions, happy to connect

Introducing Amazon Web Services (AWS)

Welcome to Continuous Improvement, the podcast where we dive deep into the world of Amazon Web Services (AWS). I'm your host, Victor Leung, an AWS community builder, here to help you navigate the vast world of cloud computing. In today's episode, we'll be discussing the fundamental aspects of AWS and how it can transform your business. So, let's get started!

So, what is AWS? Well, it's the world's most comprehensive and well-adopted cloud platform. Customers across the globe trust AWS to power their infrastructure and applications, allowing them to lower costs, become more agile, and innovate faster.

AWS provides on-demand delivery of technology services via the internet with pay-as-you-go pricing. This means that you only pay for the services you use, without any upfront costs or ongoing commitments. It's a flexible and cost-effective solution for running any type of application.

One of the key advantages of AWS is the wide range of services and features it offers. With AWS, you have access to more services and features than any other cloud provider, making it faster, easier, and more cost-effective to move your existing applications to the cloud and build new innovative solutions.

When it comes to scaling your applications, AWS has you covered with its globally deployed infrastructure. With 84 Availability Zones within 26 geographic regions worldwide and more on the way, you can easily scale your application to meet growing demand. You can choose the region closest to you and your customers, ensuring low latency and high performance.

Now, let's talk about computing power on the AWS platform. One of the key services for deploying servers is Amazon EC2, also known as Elastic Compute Cloud. With EC2, you have a wide range of instance types to choose from, each optimized for different use cases. The selection process involves considering the CPU, memory, storage, and network performance requirements of your application.

But what happens if you choose the wrong instance type? No worries! With AWS, you have the flexibility to change the instance type even after you've started an EC2 instance. Simply stop the instance, change the type, and start it again. It's that easy.

To ensure optimal performance and stability, AWS provides various CloudWatch metrics for your EC2 instances. These metrics include CPU utilization, disk operations, network traffic, and more. You can also install the CloudWatch agent to collect memory metrics and track log files.

Now, let's move on to another critical aspect of AWS: high availability and fault tolerance. AWS offers various features to ensure your applications are always up and running, even in the face of disasters. One of these features is the ability to deploy your application across multiple Availability Zones. By distributing your infrastructure, you can ensure that your application remains available even if one zone goes down.

In case of larger-scale disasters or outages, you can leverage multi-region deployments. For example, deploying your website in Singapore and setting up a backup environment in Hong Kong. By utilizing Amazon Route53, you can easily switch your domain name and redirect

Introducing Amazon Web Services (AWS)

Hello everyone, my name is Victor Leung and I am an AWS community builder. In this article, I would like to introduce Amazon Web Service (AWS). You may be wondering, what is AWS? It is the world's most comprehensive and well-adopted cloud platform. Customers trust AWS to power their infrastructure and applications. Organisations of every type and size are using AWS to lower costs, become more agile and innovate faster.

AWS provides on-demand delivery of technology services via the internet with pay-as-you-go pricing. You can use these services to run any type of application without upfront costs or ongoing commitments. You only pay for what you use.

Moreover, AWS gives you more services and more features within those services than any other cloud provider. This makes it faster, easier and more cost-effective to move your existing application to the cloud and to build anything you can imagine.

You can rely on AWS's globally deployed infrastructure to scale your application to meet growing demand. There are so many regions in the world, how to choose? You can start with the region closest to you and your customer. A region is a physical location in the world that consists of multiple Availability Zones. Each availability zones consist of one or more discrete data centres, each with redundant power, networking, and connectivity, housed in separate facilities. In the future, if your company expands to other regions, you can take advantage of AWS facilities as well. The AWS Cloud spans 84 Availability Zones within 26 geographic regions around the world, with announced plans for 24 more Availability Zones and 8 more AWS Regions.

As for the computing power on the cloud platform, there are several types to choose from. You can use the EC2 virtual server service to deploy your server on the platform. And there are so many types of EC2, how to choose? In fact, it is decided according to your needs, the four aspects are the CPU, memory, storage and network performance. According to the type, age, capability and size, there are certain naming conventions, such as M5d.xlarge.

Generally speaking, for the instance selection process, you can start with the best-guess instance. Then determine the constrained resources. For example, C5 instances are optimised for compute-intensive workloads. It is suited for high-performance web servers. It has cost-effective high performance at a low price per compute the ratio. Besides, for M5 instances, they are general purpose instances. It has a balance of compute, memory, and network resources. It is a good choice for many applications.

Once you started an EC2 instance, you may change the instance type as well. You can resize for over-utilized (the instance type is too small) or under-utilized (the instance type is too large) cases. This only works for EBS-backed instances. The steps are 1. Stop instance 2. Instance Settings -> Change Type 3. Start Instance. You cannot change the instance type of a Spot Instance and you cannot change the instance type if hibernation is enabled.

There are a couple of available CloudWatch metrics for your EC2 instances:

  • CPUUtilization: the percentage of allocated EC2 compute units
  • DiskReadOps: completed read operations from all instance store volumes
  • DiskWriteOps: completed write operations to all instance store volumes
  • DiskReadBytes: bytes read from all instance store volumes
  • DiskWriteBytes: bytes written to all instance store volumes
  • MetadataNoToken: number of times the instance metadata service was successfully accessed using a method
  • NetworkIn: number of bytes received by the instance
  • NetworkOut: number of bytes sent out by the instance
  • NetworkPacketsIn: number of packets received by the instance
  • NetworkPacketsOut: number of packets sent out by the instance

Besides, you can install the CloudWatch agent to collect memory metrics and log files.

When purchasing EC2, there are many options. You can start with an on-demand instance first, billed by the second, with no long-term contract. After you try it out in the future, you can choose a more cost-effective reserved instance and pay for a long-term lease of one to three years, which will save you money in the long run.

After choosing the purchase method, you can put the EC2 virtual machine into the auto-scaling group. When the demand increases, the number of EC2s can be increased at the same time, thereby increasing the computing power. When the peak period is over, such as when there is no traffic in the early morning, the number of EC2s can be automatically reduced. This automatic scaling function can be scaled independently according to different indicators, and this function is free to use.

For EC2 Load Balancing, by default, the round robin routing algorithm is used to route requests at the target group level. It is a good choice when the requests and targets are similar, or if you need to distribute requests equally among targets. You can specify the least outstanding requests routing algorithm, with consideration for capacity or utilization, to prevent over-utilization or under-utilization of targets in target groups when requests had varied processing times or targets were frequently added or removed. If you enable sticky sessions, the routing algorithm of the target group is overridden after the initial target selection.

Elastic Load Balancer (ELB) can be used to automatically assigned to one or more availability zones, and at the same time, it can check the health status of the backend servers, and increase or decrease resources horizontally according to traffic requirements. There are also several different options for load balancers. For Application Loan Balancer (ALB), which is configured according to the OSI layer 7, which is HTTP. Other load balancer can also be distributed through the fourth layer of the network OSI, using the protocols of TCP and UDP, as well as the distributor of the gateway.

Suppose your business is unlucky to encounter a large-scale accident, such as a natural disaster, an earthquake, damage to a data centre, a technical impediment, or a human error, such as an employee running a command rm -rf deletes all the data, so what should you do? Actually, there are different methods, and there are also different restoration times and origins.

As for the different methods, different costs would be charged. The higher the cost, the faster the recovery. If your business can tolerate a few hours of service interruption, a normal backup and restore scenario is fine. But if it doesn't work, and it takes a few minutes to restore service, then it's a matter of replicating an identical environment in a different region, and in a standby state.

Let me give you an example, such as deploying a website to an environment in Singapore, and deploying a backup environment in Hong Kong at the same time. Through the Route53 domain name system, the domain name is pointed to the Singapore region. When a problem occurs in the Singapore area and the entire area cannot be used, the domain name can be transferred to the Hong Kong area immediately, and normal services can be resumed. The process can be changed manually or automatically, or even distributed proportionally or on a user-by-user basis.

However, operating in two regions is relatively expensive. For generally small-scale problems, such as component failures, network issues, or sudden increases in traffic, deploying to two or more Availability Zones is fine. When a zone is unavailable, it is immediately moved to another available zone, and data can be independently replicated.

Regarding to database, you can choose RDS, which is compatible with MySQL database and can be resized. RDS is a hosted service that handles patching, backup and restore functions for you. In the future, you can also consider using Aurora. The throughput can reach about three times, but the price is also more expensive, depending on whether you want to achieve the performance of a business database.

RDS allows multi-AZ deployments, which provides enterprise-grade high availability, fault tolerance across multiple data centres with automatic failover, and synchronous replication and enabled with one click. When failing over, Amazon RDS simply flips the canonical name record (CNAME) for your DB instance to point at the standby, which is in turn promoted to become the new primary.

The RDS read replicas provide read scaling and disaster recovery. It relieve pressure on your master node with additional read capacity. It bring data close to your application in different regions You can promote a read replica to a master for faster recovery in the event of disaster.

If you need strict read-after-write consistency (what you read is what you just wrote) then you should read from the main DB Instance. Otherwise, you should spread out the load and read from one of the read replicas. The read replicas track all of the changes made to the source DB Instance. This is an asynchronous operation. Read Replicas can sometimes be out of date with respect to the source. This phenomenon is called replication lag. Replica Lag metric in Amazon CloudWatch to allow you to see how far it has fallen behind the source DB Instance.

Amazon RDS encrypted DB instances use the industry standard AES-256 encryption algorithm to encrypt your data on the server that hosts your Amazon RDS DB instances. To protect data in transit, all AWS service endpoints support TLS to create a secure HTTPS connection to make API requests. Manage secrets, API keys, and credentials with AWS Key Management Service (AWS KMS). As the team expands, with AWS Identity and Access Management (IAM), you can specify who or what can access services and resources in AWS, centrally manage fine-grained permissions, and analyze access to refine permissions across AWS. Multi-factor authentication (MFA) in AWS is a simple best practice that adds an extra layer of protection on top of your user name and password. Firewalls (web application, network) and DDoS protection. Thread detection, manage secret alerts, and configure security controls for individual AWS services using AWS Security, Identity & Compliance.

Amazon CloudFront is a content delivery network (CDN) service built for high performance, security, and developer convenience. It speeds up the distribution of your web content to your users, through a worldwide network of data centres called edge locations. The user request is routed to the edge location that provides the lowest latency (time delay), so that content is delivered with the best possible performance. For example, the first client sends a request in the United States, and then needs to cross half the world to Singapore to get the content, but for the second request, it is good to get the previous cache file in the data centre near the United States, which greatly reduces the distance and feedback time.

For dynamic content acceleration, you can use standard cache control headers you set on your files to identify static and dynamic content. Dynamic content is not cacheable, it proxied by CDN to the origin and back. Faster response time = Reduced DNS Time (Route 53) + Reduced Connection Time (Keep-Alive Connections & SSL Termination)+ Reduced First Byte Time (Keep-Alive Connections)+ Reduced Content Download Time (TCP/IP Optimization). It can further optimise using Latency-based Routing (LBR), run multiple stacks of the application in different Amazon EC2 regions around the world, create LBR records for each location and tag the location with geo information. Route 53 will route end users to the endpoint that provides the lowest latency.

AWS CodePipeline is a fully managed continuous delivery service that helps you automate your release pipelines for fast and reliable applications. It can use to prevent issues by running tests and performing ng quality check. Amazon CloudWatch is a monitoring and observability service. It provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, and optimize resource utilization. Upon detection of abnormal patterns or healthh check returns error, you can trigger an alarm or actions, which could further trigger AWS Lambda, it is serverless, event-driven compute service that lets you mitigate the issue, such as restart the server or revert to the previous stable version. You can then recover from failed service instances.

For storage of objects, there are 6 choices of Amazon Simple Storage Services (Amazon S3) storage classes: S3 standard, S3 standard-IA, S3 One Zone-IA, S3 intelligent-tiering, S3 Glacier, S3 Glacier Deep Archive. The Amazon S3 Glacier storage classes are purpose-built for data archiving, providing you with the highest performance, most retrieval flexibility, and the lowest cost archive storage in the cloud.

For S3 Data Consistency, the New Objects (PUTS) has Read After Write Consistency. When you upload a new S3 object you are able to read it immediately after writing. Overwrite (PUTS) or Delete Objects got Eventual Consistency. When you overwrite or delete an object, it takes time for S3 to replicate versions to AZs. If you read it immediately, S3 may return you an old copy. You need to generally wait a few seconds before reading.

Another storage option is EBS. What is Amazon Elastic Block Storage (EBS)? Block storage volumes as a service attached to Amazon EC2 instances. It is flexible storage and performance for dynamic workloads such as stateful containers. It can be created, attached, and manage volumes through API, SDK, or AWS console. It has point-in-time snapshots and tools to automate backup and retention via policies.

gp3, General Purpose SSD are great for boot volumes, low-latency applications, and bursty databases.

  • IOPS: 3,000 - 16,000 IOPS
  • Throughput: 128 - 1,000 MiB/s
  • Latency: Single-digit ms
  • Capacity: 1 GiB to 16 TiB
  • I/O Size: Up to 256 KiB (logical merge)

io2, Block Express are ideal for critical applications and databases with sustained IOPS. It’s next-generation architecture provides 4x throughput and 4x IOPS.

  • Up to 4,000 MiB/s
  • Up to 256,000 IOPS
  • 1,000:1 IOPS to GB
  • 4x volume size up to 64 TB per volume
  • < 1-millisecond latency

st1, Throughput optimized are ideal for large-block, high-throughput sequential workloads.

  • Baseline: 40 MiB/s per TiB, up to 500 MiB/s
  • Burst: 250 MiB/s per TiB, up to 500 MiB/s
  • Capacity: 125 GiB to 16 TiB
  • I/O Size: Up to 1 MiB (logical merge)

sc1, Cold HDD are ideal for sequential throughput workloads, such as logging and backup.

  • Baseline: 12 MiB/s per TiB, up to 192 MiB/s
  • Burst: 80 MiB/s per TiB, up to 250 MiB/s
  • Capacity: 125 GiB to 16 TiB
  • I/O Size: Up to 1 MiB (logical merge)

For EBS availability, EBS volume data is replicated across multiple servers in an Availability Zone to prevent the loss of data from the failure of any single component. Protect against failures with 99.999% availability, including replication within Availablity Zone (AZs), and 99.999% durability with io2 Block Express volumes. EBS Snapshots are stored in S3, which stores data across three availability zones within a single region.

Besides, there is Amazon Elastic File System (Amazon EFS). It is serverless shared storage - no provisioning, scale capacity, connections and IOPS. It is elastic - pay only for the capacity used. Performance build-in scales with capacity. It has high durability and availability - designed for 11 9s of durability and 99.99% availability SLA.

AWS CloudFormation is a service that helps you model and set up your AWS resources so that you can spend less time managing those resources and more time focusing on your applications that run in AWS. Infrastructure as code (IaC). Consistent across accounts and regions. Dev/test environments on demand. An Amazon Machine Image (AMI) is a supported and maintained image provided by AWS that provides the information required to launch an instance.

Finally, to sum up, there are many AWS services to archive well architecture with operational excellence, security, performance efficiency, reliability and cost optimisation. There is so much to learn and let’s keep learning. Thank you very much for taking the time to read this article. Let me know if you got any questions, happy to connect

Understanding Kubernetes

I've been learning about Kubernetes, also known as K8s, an open-source system for automating the deployment, scaling, and management of containerized applications. Below is a summary of important concepts to understand:

Control Plane — This component makes global decisions about the cluster and consists of the following elements:

  1. ETCD: A strongly consistent, distributed key-value store that provides a reliable way to store data accessible by a distributed system or a cluster of machines.
  2. API Server: Facilitates user interaction via REST, UI, or CLI (kubectl).
  3. Scheduler: Handles resource management, assigning pods to worker nodes while complying with resource restrictions and constraints.

Data Plane — Manages resources, networking, and storage so that container workloads can run.

Namespace — Provides a logical separation of Kubernetes objects for scoping access and dividing the cluster. Every resource scope is either namespaced or cluster-wide.

Node — This can be either a virtual or a physical machine. Multiple machines or VMs constitute the backbone compute resources of a cluster. Nodes are managed by the Control Plane and host Pod objects. Their networks are configured by Service objects. Default components include:

  1. Kubelet: A Control Plane agent.
  2. Container Runtime: Responsible for scheduling Pod containers.
  3. Kube Proxy: Acts as a networking proxy within the cluster.

Pod — The most basic deployable objects in Kubernetes, resembling services or microservices. They run one or more containers with shared storage and network resources. Types of containers in a Pod include:

  1. init-container: Runs before the main container, usually to perform setup tasks.
  2. main container: Hosts the application process running in the container.
  3. sidecar: Runs alongside the main container and is loosely coupled to it.

Pods are rarely created directly; they are usually created via controller resources like Deployments, DaemonSets, Jobs, or StatefulSets.

ReplicaSet — Maintains a stable set of replica Pods running at any given time. It is generally not deployed on its own but managed by a Deployment object.

ConfigMap — Used for storing non-confidential key-value configurations. These can be used by Pods as file mounts or environment variables accessible by containers within a Pod.

Role-based Access Control (RBAC) Resources

  1. ServiceAccount: Provides an identity for all the processes running in a Pod.
  2. ClusterRole/Role: Contains rules that represent a set of permissions. These have to be associated with a ServiceAccount via a ClusterRoleBinding/RoleBinding to take effect.
  3. ClusterRoleBinding/RoleBinding: Grants the permissions defined in a ClusterRole/Role to a given ServiceAccount.

Deployment — Acts as a controller for Pods and any objects associated with them, such as ReplicaSets and ConfigMaps. It continuously reconciles the state as declared in the manifest and manages rollouts to ReplicaSets. It can be configured to execute Canary deployments, which come with garbage collection features.

HorizontalPodAutoscaler — Automatically scales workload resources like Deployments or StatefulSets based on metrics like memory and CPU usage. It can also use custom or external metrics for scaling, such as those from Prometheus.

StorageClass — Describes an abstract class of storage with properties like storage type, provider, and reclamation policies. It is used by PersistentVolume.

PersistentVolumeClaim — A user request for storage for a specific resource and privilege level.

PersistentVolume — Represents a piece of storage that can be attached to Pods and has a lifecycle independent of Pods.

Service — Serves as an abstraction for the network exposure of an application running on a set of Pods. It provides load balancing and makes Pods accessible from other Pods within the cluster.

StatefulSet — A controller for managing stateful applications. It maintains a sticky identity for each of its Pods, unlike Deployment resources, and associates each Pod with a unique instance of persistent storage. Deleting or scaling down a StatefulSet does not delete associated volumes.

Job — Suitable for applications that run tasks ending in successful completion. It deploys one or more Pods and retries until a specified number of Pods have terminated, signaling the task's conclusion.

CronJob — Similar to a Kubernetes Job but operates on a set schedule.

Ingress — Guides the cluster’s external traffic to Pods via Services. It requires an Ingress Controller (such as ingress-nginx) to fulfill the Ingress rules and can include features like external load balancing, SSL termination, and name-based virtual hosting within the cluster.

CustomResourceDefinition — Extends Kubernetes resource types by defining custom resource properties and schema.

CustomResource — An instance of a defined custom resource. CustomResources can be subscribed to by a custom controller or an operator and must have an associated CustomResourceDefinition.

These are my notes on learning about Kubernetes. If you are preparing for the CKAD, CKA, or CKS exams, feel free to reach out with any questions. Happy learning!

Understanding Kubernetes

Welcome back to another episode of Continuous Improvement. I'm your host, Victor, and today we're diving into the world of Kubernetes. In this episode, we'll explore the key concepts you need to understand to navigate this powerful open-source system for automating the deployment, scaling, and management of containerized applications. So grab a cup of coffee, sit back, and let's get started.

Let's start with the Control Plane. This component plays a crucial role in making global decisions about the cluster. It consists of three fundamental elements: ETCD, API Server, and Scheduler.

ETCD acts as a distributed key-value store, providing a reliable way to store data accessible by a distributed system or a cluster of machines. It ensures strong consistency while handling data storage.

The API Server serves as the interface through which users interact with the cluster. Whether it's through REST, a user interface, or the command-line tool kubectl, the API Server facilitates seamless communication and management.

The Scheduler handles resource management, ensuring that Kubernetes efficiently assigns pods to worker nodes while adhering to resource restrictions and constraints.

Moving on to the Data Plane, this is where the resources, networking, and storage management come into play to enable container workloads to run smoothly.

In Kubernetes, a Namespace provides a logical separation of objects within the cluster. It scopes access and divides the cluster, allowing for better organization and management.

Nodes, whether they are virtual or physical machines, form the backbone compute resources of a cluster. They are managed by the Control Plane and host Pod objects. Kubelet is a critical component that acts as a Control Plane agent on each node, while the Container Runtime schedules and manages the Pod containers. Kube Proxy acts as a networking proxy within the cluster, ensuring smooth communication between Pods.

Now, let's talk about Pods. These are the basic deployable objects in Kubernetes and can be seen as services or microservices. Pods can run one or more containers, sharing storage and network resources. Within a Pod, you may have an init-container to perform setup tasks before the main container starts, the main container hosting the application process, and even a sidecar container loosely coupled to the main container.

It's important to note that Pods are typically created via controller resources like Deployments, DaemonSets, Jobs, or StatefulSets. A ReplicaSet, for example, ensures a stable set of replica Pods running at any given time.

ConfigMaps come in handy when storing non-confidential key-value configurations. These can be used by Pods as file mounts or environment variables accessible by the containers within a Pod.

Now let's explore the Role-based Access Control (RBAC) Resources. A ServiceAccount provides an identity for all processes running within a Pod. ClusterRole and Role contain sets of permissions, while ClusterRoleBinding and RoleBinding grant those permissions to a specific ServiceAccount.

Deployments act as controllers for Pods and associated objects like ReplicaSets and ConfigMaps. They continuously monitor and reconcile the state as declared in the manifest, making rollouts to ReplicaSets seamless. Canary deployments, with garbage collection features, can also be executed.

The HorizontalPodAutoscaler automates the scaling of workload resources based on metrics like memory and CPU usage. It can also utilize custom or external metrics for scaling, such as those provided by Prometheus.

StorageClass describes an abstract class of storage with properties like storage type, provider, and reclamation policies. It is utilized by PersistentVolume, which represents storage that can be attached to Pods and has a lifecycle independent of the Pods themselves.

Moving on to Services, they serve as abstractions for network exposure. They provide load balancing and make Pods accessible from other Pods within the cluster.

For stateful applications, there's the StatefulSet controller. Unlike Deployment resources, it maintains a sticky identity for each Pod and associates each Pod with a unique instance of persistent storage. Deleting or scaling down a StatefulSet does not delete associated volumes.

A Job is ideal for applications that run tasks ending in successful completion. It deploys one or more Pods and retries until a specified number of Pods have terminated, signaling the task's conclusion. And for scheduled tasks, we have CronJobs, which operate on a set schedule.

To guide external traffic to Pods via Services, we have Ingress. Ingress requires an Ingress Controller, such as ingress-nginx, to fulfill Ingress rules. It can handle external load balancing, SSL termination, and name-based virtual hosting within the cluster.

If you're looking to extend Kubernetes resource types, CustomResourceDefinition allows you to define custom resource properties and schemas. These can be subscribed to by custom controllers or operators, and each CustomResource must have an associated CustomResourceDefinition.

And there you have it — a whirlwind tour of important Kubernetes concepts. Whether you're preparing for the CKAD, CKA, or CKS exams or simply looking to enhance your Kubernetes knowledge, I hope this episode has given you valuable insights.

Don't forget, if you have any questions or thoughts to share, feel free to reach out. Until next time, keep learning and embracing continuous improvement.

Thank you for tuning in to this episode of Continuous Improvement. If you enjoyed this content, don't forget to subscribe and leave us a review. And remember, your feedback helps us grow and improve. Stay curious and never stop learning. See you in the next episode!

了解 Kubernetes

我一直在學習 Kubernetes,也被稱為 K8s,這是一個用於自動化部署,擴展和管理容器化應用程序的開源系統。以下是理解的重要概念摘要:

控制平面 - 這個組件對集群做出全局決策,包括以下元素:

  1. ETCD:一個強一致性,分佈式鍵值存儲,提供了一種由分佈式系統或機器群組訪問數據的可靠方法。
  2. API 服務器:通過 REST,UI 或 CLI(kubectl)促進用戶互動。
  3. 調度器:處理資源管理,將工作節點分配給工作節點,同時遵守資源限制和約束。

數據平面 - 管理資源,網絡和存儲,以便可以運行容器工作負載。

命名空間 - 在 Kubernetes 物件範圍訪問和劃分集群提供邏輯分離。每個資源範疇都是命名空間或集群範圍的。

節點 - 這可以是虛擬機或實體機。多台機器或 VM 構成集群的主體計算資源。節點由控制平面管理,並託管 Pod 物件。他們的網絡由服務物件配置。默認組件包括:

  1. Kubelet:控制平面代理。
  2. 容器運行時:負責調度 Pod 容器。
  3. Kube Proxy:在集群內擔任網絡代理。

Pod - Kubernetes 中最基本的可部署物件,類似於服務或微服務。他們運行一個或多個具有共享存儲和網絡資源的容器。一個 Pod 中的容器類型包括:

  1. init-container:主要容器運行之前運行,通常用於執行設置任務。
  2. 主容器:托管在容器中運行的應用程序進程。
  3. sidecar:與主容器一起運行,與之鬆散耦合。

Pod 很少直接創建,通常通過控制器資源,如部署,DaemonSets,工作或 StatefulSets 創建。

ReplicaSet - 在任何給定的時間維持一組穩定的副本 Pod 運行。它通常不會單獨部署,而是由部署物件管理。

ConfigMap - 用於存儲非機密鍵值配置。Pod 可以將其用作文件掛載或者環境變量,由 Pod 中的容器訪問。

基於角色的訪問控制 (RBAC) 資源

  1. 服務賬戶:為 Pod 中運行的所有進程提供身份證明。
  2. 集群角色/角色:包含代表一組權限的規則。這些必須通過 ClusterRoleBinding 或 RoleBinding 與服務賬戶關聯以生效。
  3. ClusterRoleBinding / RoleBinding:授予在 ClusterRole / Role 中定義的權限給指定的 ServiceAccount。

部署 - 作為 Pod 和與他們相關的任何物件(如 ReplicaSets 和 ConfigMaps)的控制器。它不斷地調解在清單中聲明的狀態,並管理捲出到 ReplicaSets。可以配置為執行 Canary 部署,這帶有垃圾回收功能。

HorizontalPodAutoscaler - 根據內存和 CPU 使用情況等指標自動調整工作負載資源,如部署或 StatefulSets。它還可以使用像 Prometheus 這樣的自定義或外部指標進行縮放。

StorageClass - 描述具有存儲類型,提供商和回收政策等屬性的抽象存儲類。由 PersistentVolume 使用。

持久性存儲磁卷索賠 - 對特定資源和權限級別的存儲的用戶請求。

PersistentVolume - 代表一塊可以附加到 Pod 的存儲,並有一個獨立於 Pod 的生命週期。

服務 - 作為在一組 Pod 上運行的應用程序的網絡暴露的抽像。它提供負載均衡並使得一個集群內的 Pod 可以訪問其他 Pod。

StatefulSet - 用於管理含有狀態的應用程式的控制器。它為其每個 Pod 保持粘性身份,與部署資源不同,並將每個 Pod 與一個獨特的持久存儲實例相關聯。刪除或縮小 StatefulSet 不會刪除關聯的磁卷。

工作 - 適用於運行結束在成功完成的任務的應用程序。它部署一個或多個 Pod,並重試,直到指定數量的 Pod 已經終止,表示任務的結束。

CronJob - 類似於 Kubernetes 工作,但是按照設置的計劃運行。

Ingress - 通過服務將集群的外部流量引導到 Pod。它需要一個 Ingress 控制器(比如 ingress-nginx)來實現 Ingress 規則,並可以包括像外部負載均衡,SSL 終止和在集群內的基於名稱的虛擬託管等功能。

CustomResourceDefinition - 通過定義自定義資源屬性和模式來擴展 Kubernetes 資源類型。

CustomResource - 自定義資源的實例。CustomResources 可以被自定義控制器或運營商訂閱,並且必須有一個相關的 CustomResourceDefinition。

這些是我在學習 Kubernetes 的筆記。如果你正在準備 CKAD,CKA,或 CKS 考試,隨時與我提出任何問題。開心學習!