When It’s Time to Give REST a Rest
Participate in DZone Research Surveys: You Can Shape Trend Reports! (+ Enter the Raffles)
Modern API Management
When assessing prominent topics across DZone — and the software engineering space more broadly — it simply felt incomplete to conduct research on the larger impacts of data and the cloud without talking about such a crucial component of modern software architectures: APIs. Communication is key in an era when applications and data capabilities are growing increasingly complex. Therefore, we set our sights on investigating the emerging ways in which data that would otherwise be isolated can better integrate with and work alongside other app components and across systems.For DZone's 2024 Modern API Management Trend Report, we focused our research specifically on APIs' growing influence across domains, prevalent paradigms and implementation techniques, security strategies, AI, and automation. Alongside observations from our original research, practicing tech professionals from the DZone Community contributed articles addressing key topics in the API space, including automated API generation via no and low code; communication architecture design among systems, APIs, and microservices; GraphQL vs. REST; and the role of APIs in the modern cloud-native landscape.
Open Source Migration Practices and Patterns
MongoDB Essentials
Snowflake is a leading cloud-based data storage and analytics service that provides various solutions for data warehouses, data engineering, AI/ML modeling, and other related services. It has multiple features and functionalities; one powerful data recovery feature is Time Travel. It allows users to access historical data from the past. It is beneficial when a user comes across any of the below scenarios: Retrieving the previous row or column value before the current DML operation Recovering the last state of data for backup or redundancy Updating or deleting records from the table by mistake Restoring the previous state of the table, schema, or database Snowflake's Continuous Data Protection Life Cycle allows time travel within a window of 1 to 90 days. For the Enterprise edition, up to 90 days of retention is allowed. Time Travel SQL Extensions Time Travel can be achieved using Offsets, Timestamps, and Statements keywords in addition to the AT or BEFORE clause. Offset If a user wants to retrieve past data or recover a table from the older state data using time parameters, then the user can use the query below, where offset is defined in seconds. SQL SELECT * FROM any_table AT(OFFSET => -60*5); -- For 5 Minutes CREATE TABLE recoverd_table CLONE any_table AT(OFFSET => -3600); -- For 1 Hour Timestamp Suppose a user wants to query data from the past or recover a schema for a specific timestamp. Then, the user can utilize the below query. SQL SELECT * FROM any_table AT(TIMESTAMP => 'Sun, 05 May 2024 16:20:00 -0700'::timestamp_tz); CREATE SCHEMA recovered_schema CLONE any_schema AT(TIMESTAMP => 'Wed, 01 May 2024 01:01:00 +0300'::timestamp_tz); Statement Users can also use any unique query ID to get the latest data until the statement. SQL SELECT * FROM any_table BEFORE(STATEMENT => '9f6e1bq8-006f-55d3-a757-beg5a45c1234'); CREATE DATABASE recovered_db CLONE any_db BEFORE(STATEMENT => '9f6e1bq8-006f-55d3-a757-beg5a45c1234'); The command below sets the data retention time and increases or decreases. SQL CREATE TABLE any_table(id NUMERIC, name VARCHAR, created_date DATE) DATA_RETENTION_TIME_IN_DAYS=90; ALTER TABLE any_table SET DATA_RETENTION_TIME_IN_DAYS=30; If data retention is not required, then we can also use SET DATA_RETENTION_TIME_IN_DAYS=0;. Objects that do not have an explicitly defined retention period can inherit the retention from the upper object level. For instance, tables that do not have a specified retention period will inherit the retention period from schema, and schema that does not have the retention period defined will inherit from the database level. The account level is the highest level of the hierarchy and should be set up with 0 days for data retention. Now consider a case where a table, schema, or database accidentally drops, causing all the data to be lost. During such cases, when any data object gets dropped, it's kept in Snowflake's back-end until the data retention period. For such cases, Snowflake has a similar great feature that will bring those objects back with below SQL. SQL UNDROP TABLE any_table; UNDROP SCHEMA any_schema; UNDROP DATABASE any_database; If a user creates a table with the same name as the dropped table, then Snowflake creates a new table, not restore the old one. When the user uses the above UNDROP command, Snowflake restores the old object. Also, the user needs permission or ownership to restore the object. After the Time Travel period, if the object isn't retrieved within the data retention period, it is transferred to Snowflake Fail-Safe, where users can't query. The only way to recover that is by using Snowflake's help, and it stores the data for a maximum of 7 days. Challenges Time travel, though useful, has a few challenges, as shown below. The Time Travel has a default one-day setup for transient and temporary tables in Snowflake. Any objects except tables, such as views, UDFs, and stored procedures, are not supported. If a table is recreated with the same name, referring to the older version of the same name requires renaming the current table as, by default, Time Travel will refer to the latest version. Conclusion The Time Travel feature is quick, easy, and powerful. It's always handy and gives users more comfort while operating production-sensitive data. The great thing is that users can run these queries themselves without having to involve admins. With a maximum retention of 90 days, users have more than enough time to query back in time or fix any incorrectly updated data. In my opinion, it is Snowflake's strongest feature. Reference Understanding & Using Time Travel
In this tutorial, I'll explore how to set up and utilize docTR, the open-source OCR (Optical Character Recognition) solution of the document parsing API startup Mindee. I’ll go through what you need to install docTR on Ubuntu. It accepts PDFs, images, and even a website URL as an input. In this example, I will parse a grocery store receipt. Let’s get started. Setting Up docTR on Ubuntu docTR is compatible with any Linux distribution, macOS, and Windows. It is also available as a Docker image. I will use Ubuntu 22.04 LTS (Jammy Jellyfish) for this tutorial. Hardware-wise, you don’t need anything specific, but if you want to do extensive testing, I recommend using a GPU instance; OVHcloud offers affordable options, with servers starting at less than a dollar per hour. Let’s start by installing Python. At the time of writing, docTR requires Python 3.8 (or higher). Shell sudo apt install -y python3 To avoid messing with system libraries, let’s use a virtual environment. Shell sudo apt install -y python3.10-venv python3 -m venv testing-Mindee-docTR Then we install the OpenGL Mesa 3D Graphics Library, used for the computer vision part of docTR. Shell sudo apt install -y libgl1-mesa-glx We install pango, which is a text layout engine library. Shell sudo apt-get install -y libpango-1.0-0 libpangoft2-1.0-0 Then, we install pip so that we can install docTR. Shell sudo apt install -y python3-pip Finally, we install docTR within our virtual environment. This version is specifically for PyTorch. If you choose to use TensorFlow, change the command accordingly. Shell testing-Mindee-docTR/bin/pip3 install "python-doctr[torch]" Using docTR Now that docTR is installed, let’s start playing with it. In this example, I will test it with a grocery store receipt. You can download the receipt using the command below. Shell wget "https://media.istockphoto.com/id/889405434/vector/realistic-paper-shop-receipt-vector-cashier-bill-on-white-background.jpg?s=612x612&w=0&k=20&c=M2GxEKh9YJX2W3q76ugKW23JRVrm0aZ5ZwCZwUMBgAg=" -O receipt.jpeg Create a testing-docTR.py file and insert the following code into it. Python from doctr.io import DocumentFile from doctr.models import ocr_predictor # Load the grocery receipt doc = DocumentFile.from_images("receipt.jpeg") # Load the OCR model model = ocr_predictor(pretrained=True) # Perform OCR result = model(doc) # Display the OCR result print(result.export()) Note that docTR uses a two-stage approach: First, it performs text detection to localize words. Then, it conducts text recognition to identify all characters in the word. The ocr_predictor function accepts additional parameters to select the text detection and recognition architecture. For simplicity, I used the default ones in this example. You can find information about other models on the docTR documentation. Reading a Receipt Using docTR Now you just need to run your Python script: Shell testing-Mindee-docTR/bin/python3 testing-docTR.py You will get an output such as the one below: JSON {"pages": [{"page_idx": 0, "dimensions": [612, 612], "orientation": {"value": null, "confidence": null}, "language": {"value": null, "confidence": null}, "blocks": [{"geometry": [[0.44140625, 0.1201171875], [0.548828125, 0.14453125]], "lines": [{"geometry": [[0.44140625, 0.1201171875], [0.548828125, 0.14453125]], "words": [{"value": "RECEIPT", "confidence": 0.9695481061935425, "geometry": [[0.44140625, 0.1201171875], [0.548828125, 0.14453125]]}]}], "artefacts": []}]}]} Note that I drastically shortened the JSON output for readability and only kept the part showing the “RECEIPT” word. Here is the JSON structure you’d be looking at without truncating the result. I have expanded the part of the tree that I kept in the JSON output. docTR will provide a bunch of information about the document but the important part is about how it breaks down the document into lines, and for each line, provides an array containing the words it detected along with the degree of confidence. Here, we can see it spotted the word RECEIPT with a confidence of 96%. docTR offers an efficient OCR solution that simplifies text recognition processes. Depending on the document type, you may need to change the text detection and text recognition architectures to improve accuracy. Comprehensive docTR documentation is available here. Considerations When Using docTR Deploying docTR entails certain complexities. First, you must create a dataset and train docTR to achieve satisfactory accuracy. This means dealing with data annotation on many images. Since OCR systems typically serve as backend services for other apps, it may be necessary to integrate docTR via an API and scale it according to the app’s needs. docTR does not provide this out of the box, but there are many open-source technologies that can help facilitate this step. Conclusion Document processing technologies have come a long way since the advent of OCR tools, which are limited to character recognition. Intelligent Document Processing (IDP) platforms represent the next step; they utilize OCR (such as docTR) along with additional layers of intelligence like table reconstruction, document classification, and natural language understanding, to achieve better accuracy and precision. Additionally, for those seeking a scalable IDP solution without the complexities of data collection and model training, I recommend trying out Mindee’s latest solution, docTI. This training-free IDP solution leverages Large Language Models (LLMs) to eliminate the need for data collection, annotation, and the model training process. You can use the free-tier plan, configure an instance, and start querying the API in minutes.
Unit testing is a software testing methodology where individual units or components of software are tested in isolation to check whether it is functioning up to the expectation or not. In Java, it is an essential practice with the help of which an attempt to verify code correctness is made, and an attempt to improve code quality is made. It will basically ensure that the code works fine and the changes are not the point of breakage of existing functionality. Test-Driven Development (TDD) is a test-first approach to software development in short iterations. It is a kind of practice where a test is written before the real source code is written. It pursues the aim of writing code that passes predefined tests and, hence, well-designed, clean, and free from bugs. Key Concepts of Unit Testing Test automation: Use tools for automatic test running, such as JUnit. Asserts: Statements that confirm an expected result within a test. Test coverage: It is the code execution percentage defined by the tests. Test suites: Collection of test cases. Mocks and stubs: Dummy objects that simulate real dependencies. Unit Testing Frameworks in Java: JUnit JUnit is an open-source, simple, and widely used unit testing. JUnit is one of the most popular Java frameworks for unit testing. In other words, it comes with annotations, assertions, and tools required to write and run tests. Core Components of JUnit 1. Annotations JUnit uses annotations to define tests and lifecycle methods. These are some of the key annotations: @Test: Marks a method as a test method. @BeforeEach: Denotes that the annotated method should be executed before each @Test method in the current class. @AfterEach: Denotes that the annotated method should be executed after each @Test method in the current class. @BeforeAll: Denotes that the annotated method should be executed once before any of the @Test methods in the current class. @AfterAll: Denotes that the annotated method should be executed once after all of the @Test methods in the current class. @Disabled: Used to disable a test method or class temporarily. 2. Assertions Assertions are used to test the expected outcomes: assertEquals(expected, actual): Asserts that two values are equal. If they are not, an AssertionError is thrown. assertTrue(boolean condition): Asserts that a condition is true. assertFalse(boolean condition): Asserts that a condition is false. assertNotNull(Object obj): Asserts that an object is not null. assertThrows(Class<T> expectedType, Executable executable): Asserts that the execution of the executable throws an exception of the specified type. 3. Assumptions Assumptions are similar to assertions but used in a different context: assumeTrue(boolean condition): If the condition is false, the test is terminated and considered successful. assumeFalse(boolean condition): The inverse of assumeTrue. 4. Test Lifecycle The lifecycle of a JUnit test runs from initialization to cleanup: @BeforeAll → @BeforeEach → @Test → @AfterEach → @AfterAll This allows for proper setup and teardown operations, ensuring that tests run in a clean state. Example of a Basic JUnit Test Here’s a simple example of a JUnit test class testing a basic calculator: Java import org.junit.jupiter.api.BeforeEach; import org.junit.jupiter.api.Test; import org.junit.jupiter.api.AfterEach; import static org.junit.jupiter.api.Assertions.*; class CalculatorTest { private Calculator calculator; @BeforeEach void setUp() { calculator = new Calculator(); } @Test void testAddition() { assertEquals(5, calculator.add(2, 3), "2 + 3 should equal 5"); } @Test void testMultiplication() { assertAll( () -> assertEquals(6, calculator.multiply(2, 3), "2 * 3 should equal 6"), () -> assertEquals(0, calculator.multiply(0, 5), "0 * 5 should equal 0") ); } @AfterEach void tearDown() { // Clean up resources, if necessary calculator = null; } } Dynamic Tests in JUnit 5 JUnit 5 introduced a powerful feature called dynamic tests. Unlike static tests, which are defined at compile-time using the @Test annotation, dynamic tests are created at runtime. This allows for more flexibility and dynamism in test creation. Why Use Dynamic Tests? Parameterized testing: This allows you to create a set of tests that execute the same code but with different parameters. Dynamic data sources: Create tests based on data that may not be available at compile-time (e.g., data from external sources). Adaptive testing: Tests can be generated based on the environment or system conditions. Creating Dynamic Tests JUnit provides the DynamicTest class for creating dynamic tests. You also need to use the @TestFactory annotation to mark the method that returns the dynamic tests. Example of Dynamic Tests Java import org.junit.jupiter.api.DynamicTest; import org.junit.jupiter.api.TestFactory; import java.util.Arrays; import java.util.Collection; import java.util.stream.Stream; import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.DynamicTest.dynamicTest; class DynamicTestsExample { @TestFactory Stream<DynamicTest> dynamicTestsFromStream() { return Stream.of("apple", "banana", "lemon") .map(fruit -> dynamicTest("Test for " + fruit, () -> { assertEquals(5, fruit.length()); })); } @TestFactory Collection<DynamicTest> dynamicTestsFromCollection() { return Arrays.asList( dynamicTest("Positive Test", () -> assertEquals(2, 1 + 1)), dynamicTest("Negative Test", () -> assertEquals(-2, -1 + -1)) ); } } Creating Parameterized Tests In JUnit 5, you can create parameterized tests using the @ParameterizedTest annotation. You'll need to use a specific source annotation to supply the parameters. Here's an overview of the commonly used sources: @ValueSource: Supplies a single array of literal values. @CsvSource: Supplies data in CSV format. @MethodSource: Supplies data from a factory method. @EnumSource: Supplies data from an Enum. Example of Parameterized Tests Using @ValueSource Java import org.junit.jupiter.params.ParameterizedTest; import org.junit.jupiter.params.provider.ValueSource; import static org.junit.jupiter.api.Assertions.assertTrue; class ValueSourceTest { @ParameterizedTest @ValueSource(strings = {"apple", "banana", "orange"}) void testWithValueSource(String fruit) { assertTrue(fruit.length() > 4); } } Using @CsvSource Java import org.junit.jupiter.params.ParameterizedTest; import org.junit.jupiter.params.provider.CsvSource; import static org.junit.jupiter.api.Assertions.assertEquals; class CsvSourceTest { @ParameterizedTest @CsvSource({ "test,4", "hello,5", "JUnit,5" }) void testWithCsvSource(String word, int expectedLength) { assertEquals(expectedLength, word.length()); } } Using @MethodSource Java import org.junit.jupiter.params.ParameterizedTest; import org.junit.jupiter.params.provider.MethodSource; import java.util.stream.Stream; import static org.junit.jupiter.api.Assertions.assertTrue; class MethodSourceTest { @ParameterizedTest @MethodSource("stringProvider") void testWithMethodSource(String word) { assertTrue(word.length() > 4); } static Stream<String> stringProvider() { return Stream.of("apple", "banana", "orange"); } } Best Practices for Parameterized Tests Use descriptive test names: Leverage @DisplayName for clarity. Limit parameter count: Keep the number of parameters manageable to ensure readability. Reuse methods for data providers: For @MethodSource, use static methods that provide the data sets. Combine data sources: Use multiple source annotations for comprehensive test coverage. Tagging in JUnit 5 The other salient feature in JUnit 5 is tagging: it allows for assigning their own custom tags to tests. Tags allow, therefore, a way to group tests and later execute groups selectively by their tag. This would be very useful for managing large test suites. Key Features of Tagging Flexible grouping: Multiple tags can be applied to a single test method or class, so flexible grouping strategies can be defined. Selective execution: Sometimes it may be required to execute only the desired group of tests by adding tags. Improved organization: Provides an organized way to set up tests for improved clarity and maintainability. Using Tags in JUnit 5 To use tags, you annotate your test methods or test classes with the @Tag annotation, followed by a string representing the tag name. Example Usage of @Tag Java import org.junit.jupiter.api.Tag; import org.junit.jupiter.api.Test; @Tag("fast") class FastTests { @Test @Tag("unit") void fastUnitTest() { // Test logic for a fast unit test } @Test void fastIntegrationTest() { // Test logic for a fast integration test } } @Tag("slow") class SlowTests { @Test @Tag("integration") void slowIntegrationTest() { // Test logic for a slow integration test } } Running Tagged Tests You can run tests with specific tags using: Command line: Run the tests by passing the -t (or --tags) argument to specify which tags to include or exclude.mvn test -Dgroups="fast" IDE: Most modern IDEs like IntelliJ IDEA and Eclipse allow selecting specific tags through their graphical user interfaces. Build tools: Maven and Gradle support specifying tags to include or exclude during the build and test phases. Best Practices for Tagging Consistent tag names: Use a consistent naming convention across your test suite for tags, such as "unit", "integration", or "slow". Layered tagging: Apply broader tags at the class level (e.g., "integration") and more specific tags at the method level (e.g., "slow"). Avoid over-tagging: Do not add too many tags to a single test, which can reduce clarity and effectiveness. JUnit 5 Extensions The JUnit 5 extension model allows developers to extend and otherwise customize test behavior. They provide a mechanism for extending tests with additional functionality, modifying the test execution lifecycle, and adding new features to your tests. Key Features of JUnit 5 Extensions Customization: Modify the behavior of test execution or lifecycle methods. Reusability: Create reusable components that can be applied to different tests or projects. Integration: Integrate with other frameworks or external systems to add functionality like logging, database initialization, etc. Types of Extensions Test Lifecycle Callbacks BeforeAllCallback, BeforeEachCallback, AfterAllCallback, AfterEachCallback. Allow custom actions before and after test methods or test classes. Parameter Resolvers ParameterResolver. Inject custom parameters into test methods, such as mock objects, database connections, etc. Test Execution Condition ExecutionCondition. Enable or disable tests based on custom conditions (e.g., environment variables, OS type). Exception Handlers TestExecutionExceptionHandler. Handle exceptions thrown during test execution. Others TestInstancePostProcessor, TestTemplateInvocationContextProvider, etc. Customize test instance creation, template invocation, etc. Implementing Custom Extensions To create a custom extension, you need to implement one or more of the above interfaces and annotate the class with @ExtendWith. Example: Custom Parameter Resolver A simple parameter resolver that injects a string into the test method: Java import org.junit.jupiter.api.extension.*; public class CustomParameterResolver implements ParameterResolver { @Override public boolean supportsParameter(ParameterContext parameterContext, ExtensionContext extensionContext) { return parameterContext.getParameter().getType().equals(String.class); } @Override public Object resolveParameter(ParameterContext parameterContext, ExtensionContext extensionContext) { return "Injected String"; } } Using the Custom Extension in Tests Java import org.junit.jupiter.api.Test; import org.junit.jupiter.api.extension.ExtendWith; @ExtendWith(CustomParameterResolver.class) class CustomParameterTest { @Test void testWithCustomParameter(String injectedString) { System.out.println(injectedString); // Output: Injected String } } Best Practices for Extensions Separation of concerns: Extensions should have a single, well-defined responsibility. Reusability: Design extensions to be reusable across different projects. Documentation: Document how the extension works and its intended use cases. Unit testing and Test-Driven Development (TDD) offer significant benefits that positively impact software development processes and outcomes. Benefits of Unit Testing Improved Code Quality Detection of bugs: Unit tests detect bugs early in the development cycle, making them easier and cheaper to fix. Code integrity: Tests verify that code changes don't break existing functionality, ensuring continuous code integrity. Simplifies Refactoring Tests serve as a safety net during code refactoring. If all tests pass after refactoring, developers can be confident that the refactoring did not break existing functionality. Documentation Tests serve as live documentation that illustrates how the code is supposed to be used. They provide examples of the intended behavior of methods, which can be especially useful for new team members. Modularity and Reusability Writing testable code encourages modular design. Code that is easily testable is generally also more reusable and easier to understand. Reduces Fear of Changes A comprehensive test suite helps developers make changes confidently, knowing they will be notified if anything breaks. Regression Testing Unit tests can catch regressions, where previously working code stops functioning correctly due to new changes. Encourages Best Practices Developers tend to write cleaner, well-structured, and decoupled code when unit tests are a priority. Benefits of Test-Driven Development (TDD) Ensures test coverage: TDD ensures that every line of production code is covered by at least one test. This provides comprehensive coverage and verification. Focus on requirements: Writing tests before writing code forces developers to think critically about requirements and expected behavior before implementation. Improved design: The incremental approach of TDD often leads to better system design. Code is written with testing in mind, resulting in loosely coupled and modular systems. Reduces debugging time: Since tests are written before the code, bugs are caught early in the development cycle, reducing the amount of time spent debugging. Simplifies maintenance: Well-tested code is easier to maintain because the tests provide instant feedback when changes are introduced. Boosts developer confidence: Developers are more confident in their changes knowing that tests have already validated the behavior of their code. Facilitates collaboration: A comprehensive test suite enables multiple developers to work on the same codebase, reducing integration issues and conflicts. Helps identify edge cases: Thinking through edge cases while writing tests helps to identify unusual conditions that could be overlooked otherwise. Reduces overall development time: Although TDD may initially seem to slow development due to the time spent writing tests, it often reduces the total development time by preventing bugs and reducing the time spent on debugging and refactoring. Conclusion By leveraging unit testing and TDD in Java with JUnit, developers can produce high-quality software that's easier to maintain and extend over time. These practices are essential for any professional software development workflow, fostering confidence and stability in your application's codebase.
The first step in a Cloud Adoption journey for any enterprise is Application Portfolio Analysis. During this assessment, we see custom in-house (Bespoke) applications, Commercial-Off-The-Shelf (COTS) applications, Software-as-a-Service (SaaS) applications, etc. The constitution of these applications in the portfolio varies between enterprises and industries. As an outcome of the assessment, the applications are dispositioned into one of the seven common migration strategies (7-R’s of Migration: Retire, Retain, Refactor, Replatform, Repurchase, Rehost, and Relocate) and arrive at a roadmap for cloud migration. While COTS applications are generally perceived as low-hanging fruits during cloud migrations, they come with their own challenges. For example, the currency of technical stack (Operating System, Product Versions, Frameworks, Databases, etc.), managing licenses and adherence to security requirements in the cloud, etc. Understanding these challenges is critical to arriving at migration strategies. Let us deep dive a bit and uncover them. Challenges and Best Practices From our experiences in cloud migrations, we have observed some common challenges and best practices to mitigate them. These have helped us in successful migrations for many clients. Disposition COTS products are used by enterprises as a ready-made solution for their business needs. However, in many cases, these applications trail other evolving applications and become outdated or difficult to integrate with other modern applications. Some applications do not undergo any upgrades/enhancements and outlive the usual application lifecycle. It becomes challenging to migrate such applications and requires due diligence, stakeholder concurrence, etc. Best Practices Perform a comprehensive assessment, considering the business needs, performance, compatibility, risks, and cost-benefit study. Involving all stakeholders like Business, IT, Independent Software Vendor (ISV), etc. is important for the best outcome of this assessment. Based on the analysis, choose one of the following strategies. Rehost the application and database (lift-and-shift the virtual machine (as-is)) to the Infrastructure-as-a-Service (IaaS) compute instance on the cloud. Rehost the application to IaaS and re-platform the database to a Platform-as-a-Service (PaaS) instance on the cloud. Replatform to ISV Managed SaaS solution. This solution can be from the same ISV or from another ISV that provides an easy migration path to their solution. The following challenges will also contribute to the dispositioning of the COTS applications. ISV Support Pure COTS applications are usually straightforward to migrate provided the vendor supports and certifies the application to run on the target cloud platform. Some ISVs provide customized versions of their product to suit the business needs of an enterprise. For such applications, the Vendor has their share of ownership and responsibility to maintain them on the client's premises. While these applications are managed by an application team, the knowledge of the application, the intrinsic details, and the application roadmap is with the vendor. The ISVs have their own timelines and schedules for their releases, and they would require sufficient notice to have their personnel engaged to support the migration. This impacts the migration plan of the COTS application and its dependencies. Best Practices Start engaging with the ISV early in the migration journey preferably in the planning stage. Some ISVs would require a new professional services contract to be engaged for providing support. Understand the rules of engagement and ensure the contract clearly details the roles and responsibilities of the ISV. Some COTS products require certification by the ISV to run on the cloud. Understand the requirements and clearly document them as this would have cost implications. There are also scenarios when enterprises decide to migrate without vendor support due to cost and they have in-house expertise on the product. Target Cloud Support Most enterprises incorrectly assume the COTS product is compatible with any target platform. However, there are instances where they realize that the products do not work as intended after the migration journey and the vendor refuses to support the chosen platform/technology stack. Some examples are: The ISV might not have an immediate plan for supporting the target operating system or the target Database in the cloud. One of the objectives of moving to the cloud is to provide for High Availability and Disaster Recovery and some of the COTS products may not support front-ending by a Load Balancer or provide support for Clustering. Best Practices During assessment, determine the ISV support for the target technology stack. Some applications would require modifications/upgrades to support. It’s good to ask specific questions regarding the support like Does the COTS product support Windows 2022 in AWS Cloud? If yes, does it require modifications to support it? Does the COTS product which currently uses SQL Server on-premises support migration to RDS SQL Server in AWS Cloud? It’s also good to have the ISV involved in the design of the target architecture. Getting the target architecture approved by the ISV is a good practice. Security Requirements One of the common issues encountered during migrations is that COTS products do not adhere to the Security Policies defined by the Information Security(InfoSec) Team. For Example, privileged credentials used by COTS applications are often embedded in cleartext within application configuration files, database tables, scripts, etc. The credentials would not be changed frequently, and the same would also be used across multiple environments or in other applications as well. This is perceived as a security vulnerability by the InfoSec team. Best Practices Understand from the ISV if there is a direct integration available to a secure vault like Azure Key Vault, AWS Secrets Manager, HashiCorp Vault, etc. Alternatively, the ISV must provide a patch to encrypt the password stored locally on the server, or database. Provide for rotation of credentials based on policies like the criticality of the application or based on data sensitivity requirements. Some COTS products do not provide support for integration with vaults due to the legacy software stack and the application cannot be modified. In such cases, an exception from the InfoSec team is sought with a remediation plan. For example, a product upgrade, or enhancement, say within 6 months after the migrations, as it involves cost, potential change in integrations with other applications, etc. Licenses ISVs use different types of licensing models for their products. Some offer one-time, perpetual licenses, while others require enterprises to renew the licenses (subscription-based). Similarly, some licenses are tied to the server's metadata (IP address, MAC address, or hostname) while others are portable. During migrations, another common issue observed is licensing conflict. Inadequate licenses restrict the application from functioning on the cloud while the license is already tied to the running instance at on-premises or vice-versa. There could also be a change in the licensing model when the COTS product is moved to the cloud. For example, moving from an on-premises deployment to a SaaS model would require a move from a perpetual licensing model to a subscription-based one. Best Practices Understand the current licensing model, the number of licenses available, and request for additional licenses. Some ISVs provide temporary licenses that will allow the application to run simultaneously on-premises and on the target cloud platform. Understand if there are license checks done by the installed software. Some COTS applications send internet egress traffic for license validation. This would help in planning for firewall rules during migration. Team Organization and Co-ordination In the case of business-owned applications, the IT presence will be limited to providing platform support. Also, for customized COTS products, the ISV is a key participant and contributor in the migration. Involving them late in the play is a common mistake and it causes delay, the need for expedited engagement, and in turn proves to be an expensive affair. Along with identifying the contributors from IT (infra and database support), business (testers), etc. to support the migrations, placing the right ownership of actions on the ISV is also important. Best Practices ISV team should have tasks assigned in the project plan and it is necessary to communicate the tasks and the timeline to them as early as possible. ISV should have clearly defined responsibilities. For example, during the installation of the COTS software in the cloud, the application team would perform the installation with support from the ISV or the ISV themselves might perform the installation. These activities should be listed in the runbook against the ISV listed as task owner. ISVs might also require access to the cloud environment during migration or for later support. Access requirements for the ISV can be evaluated again during migration and provided for. Integrations While most modern COTS products support enhanced security controls, you will come across a few products that use non-secure ports or integration mechanisms for communication. For example, http ports (80, 8080) or FTP (21). In the cloud, one of the security controls enforced is the encryption of data in transit. Additionally, the other applications having an affinity with the COTS product may take a modernization path, involving a change in the framework(Struts to Spring Boot), data models(XML to JSON), etc. This may require some changes to the COTS product. Application Remediation for such enhancements would require a considerable amount of time and testing. Best Practices Start the identification of these integrations much earlier and work with the ISV for the changes to the COTS application. Factor efforts for comprehensive testing. Despite the identification of such requirements early in the migration, it’s a possibility that changes couldn’t be made to the COTS product due to timelines and various other factors. In such cases, it's normal to get an exception approval from InfoSec for allowing these ports. We can also understand from the ISV if there are automation possibilities like enabling CI/CD pipelines, configuration management, etc. that will reduce the manual effort and errors in deployment. Automation can also assist in enabling faster recovery in case of an outage. Data Migration Ensuring data remains secure during and after the migration is a significant challenge. This is required as enterprises must consider data encryption, access controls, and compliance requirements such as GDPR, HIPAA, or PCI-DSS. COTS applications often have large volumes of data stored in various formats and structures. Migrating this data to the cloud while maintaining its integrity and consistency can be complex, especially if the data is spread across multiple sources or databases. Best Practices Evaluate the options to encrypt the data in transit and at rest with the ISV and other stakeholders, as it may require changes to the product. Understand the complexity of the data and its structure by doing a thorough analysis. Work with the ISV if there are proprietary tools for migration of the data and obtain clearance for using the tool from the InfoSec organization. This is a longer process hence it is essential to be addressed very early in the migration life cycle. Plan for incremental data migration to ensure data integrity during cutover to the target. Other Challenges Containerization Containerizing a COTS product is a popular solution as the application can benefit from isolation, portability, scalability, and efficient utilization of resources. While the benefits are huge, this migration path is usually tricky because the ISVs may not have container images and even if they agree to build a container image, they may not have the resources to maintain images on a continuous basis. So, it’s necessary to understand these intricacies before proceeding with containerization. Refactor Refactor the COTS application to a custom application. This is usually perceived as a project by itself, driven by a strong business case and it involves considerable cost, time, and manpower. This could lead to build-vs-buy decisions or even buying and building (customizations). It's advised to take this route only when you have in-house knowledge about the application. Conclusion Every migration provides a lot of learnings and insights to carry forward into successive migrations. Based on our migration experiences for various enterprises across industries, we have shared the challenges and the mitigations that have helped us overcome them. As mentioned earlier, the number of COTS applications would vary across enterprises and industries and the challenges are similar. Addressing these challenges early in the migration cycle would help in the cloud journey while ensuring maximum benefits from the cloud for your application portfolio.
The Advantages of Elastic APM for Observing the Tested Environment My first use of the Elastic Application Performance Monitoring (Elastic APM) solution coincides with projects that were developed based on microservices in 2019 for the projects on which I was responsible for performance testing. At that time (2019) the first versions of Elastic APM were released. I was attracted by the easy installation of agents, the numerous protocols supported by the Java agent (see Elastic supported technologies) including the Apache HttpClient used in JMeter and other languages (Go, .NET, Node.js, PHP, Python, Ruby), and the quality of the dashboard in Kibana for the APM. I found the information displayed in the Kibana APM dashboards to be relevant and not too verbose. The Java agent monitoring is simple but displays essential information on the machine's OS and JVM. The open-source aspect and the free solution for the main functions of the tool were also decisive. I generalize the use of the Elastic APM solution in performance environments for all projects. With Elastic APM, I have the timelines of the different calls and exchanges between web services, the SQL queries executed, the exchange of messages by JMS file, and monitoring. I also have quick access to errors or exceptions thrown in Java applications. Why Integrate Elastic APM in Apache JMeter By adding Java APM Agents to web applications, we find the services called timelines in the Kibana dashboards. However, we remain at a REST API call level mainly, because we do not have the notion of a page. For example, page PAGE01 will make the following API calls: /rest/service1 /rest/service2 /rest/service3 On another page, PAGE02 will make the following calls: /rest/service2 /rest/service4 /rest/service5 /rest/service6 The third page, PAGE03, will make the following calls: /rest/service1 /rest/service2 /rest/service4 In this example, service2 is called on 3 different pages and service4 in 2 pages. If we look in the Kibana dashboard for service2, we will find the union of the calls of the 3 calls corresponding to the 3 pages, but we don't have the notion of a page. We cannot answer "In this page, what is the breakdown of time in the different REST calls," because for a user of the application, the notion of page response time is important. The goal of the jmeter-elastic-apm tool is to add the notion of an existing page in JMeter in the Transaction Controller. This starts in JMeter by creating an APM transaction, and then propagating this transaction identifier (traceparent) with the Elastic agent to an HTTP REST request to web services because the APM Agent recognizes the Apache HttpClient library and can instrument it. In the HTTP request, the APM Agent will add the identifier of the APM transaction to the header of the HTTP request. The headers added are traceparent and elastic-apm-traceparent. We start from the notion of the page in JMeter (Transaction Controller) to go to the HTTP calls of the web application (gestdoc) hosted in Tomcat. In the case of an application composed of multi-web services, we will see in the timeline the different web services called in HTTP(s) or JMS and the time spent in each web service. This is an example of technical architecture for a performance test with Apache JMeter and Elastic APM Agent to test a web application hosted in Apache Tomcat. How the jmeter-elastic-apm Tool Works jmeter-elastic-apm adds Groovy code before a JMeter Transaction Controller to create an APM transaction before a page. In the JMeter Transaction Controller, we find HTTP samplers that make REST HTTP(s) calls to the services. The Elastic APM Agent automatically adds a new traceparent header containing the identifier of the APM transaction because it recognizes the Apache HttpClient of the HTTP sampler. The Groovy code terminates the APM transaction to indicate the end of the page. The jmeter-elastic-apm tool automates the addition of Groovy code before and after the JMeter Transaction Controller. The jmeter-elastic-apm tool is open source on GitHub (see link in the Conclusion section of this article). This JMeter script is simple with 3 pages in 3 JMeter Transaction Controllers. After launching the jmeter-elastic-apm action ADD tool, the JMeter Transaction Controllers are surrounded by Groovy code to create an APM transaction before the JMeter Transaction Controller and close the APM transaction after the JMeter Transaction Controller. In the “groovy begin transaction apm” sampler, the Groovy code calls the Elastic APM API (simplified version): Groovy Transaction transaction = ElasticApm.startTransaction(); Scope scope = transaction.activate(); transaction.setName(transactionName); // contains JMeter Transaction Controller Name In the “groovy end transaction apm” sampler, the groovy code calls the ElasticApm API (simplified version): Groovy transaction.end(); Configuring Apache JMeter With the Elastic APM Agent and the APM Library Start Apache JMeter With Elastic APM Agent and Elastic APM API Library Declare the Elastic APM Agent URLto find the APM Agent: Add the ELASTIC APM Agent somewhere in the filesystem (could be in the <JMETER_HOME>\lib but not mandatory). In <JMETER_HOME>\bin, modify the jmeter.bat or setenv.bat. Add Elastic APM configuration like so: Shell set APM_SERVICE_NAME=yourServiceName set APM_ENVIRONMENT=yourEnvironment set APM_SERVER_URL=http://apm_host:8200 set JVM_ARGS=-javaagent:<PATH_TO_AGENT_APM_JAR>\elastic-apm-agent-<version>.jar -Delastic.apm.service_name=%APM_SERVICE_NAME% -Delastic.apm.environment=%APM_ENVIRONMENT% -Delastic.apm.server_urls=%APM_SERVER_URL% 2. Add the Elastic APM library: Add the Elastic APM API library to the <JMETER_HOME>\lib\apm-agent-api-<version>.jar. This library is used by JSR223 Groovy code. Use this URL to find the APM library. Recommendations on the Impact of Adding Elastic APM in JMeter The APM Agent will intercept and modify all HTTP sampler calls, and this information will be stored in Elasticsearch. It is preferable to voluntarily disable the HTTP request of static elements (images, CSS, JavaScript, fonts, etc.) which can generate a large number of requests but are not very useful in analyzing the timeline. In the case of heavy load testing, it's recommended to change the elastic.apm.transaction_sample_rate parameter to only take part of the calls so as not to saturate the APM Server and Elasticsearch. This elastic.apm.transaction_sample_rate parameter can be declared in <JMETER_HOME>\jmeter.bat or setenv.bat but also in a JSR223 sampler with a short Groovy code in a setUp thread group. Groovy code records only 50% samples: Groovy import co.elastic.apm.api.ElasticApm; // update elastic.apm.transaction_sample_rate ElasticApm.setConfig("transaction_sample_rate","0.5"); Conclusion The jmeter-elastic-apm tool allows you to easily integrate the Elastic APM solution into JMeter and add the notion of a page in the timelines of Kibana APM dashboards. Elastic APM + Apache JMeter is an excellent solution for understanding how the environment works during a performance test with simple monitoring, quality dashboards, time breakdown timelines in the different distributed application layers, and the display of exceptions in web services. Over time, the Elastic APM solution only gets better. I strongly recommend it, of course, in a performance testing context, but it also has many advantages in the context of a development environment used for developers or integration used by functional or technical testers. Links Command Line Tool jmeter-elastic-apm JMeter plugin elastic-apm-jmeter-plugin Elastic APM Guides: APM Guide or Application performance monitoring (APM)
Motivation and Background Why is it important to build interpretable AI models? The future of AI is in enabling humans and machines to work together to solve complex problems. Organizations are attempting to improve process efficiency and transparency by combining AI/ML technology with human review. In recent years with the advancement of AI, AI-specific regulations have emerged, for example, Good Machine Learning Practices (GMLP) in pharma and Model Risk Management (MRM) in finance industries, other broad-spectrum regulations addressing data privacy, EU’s GDPR and California’s CCPA. Similarly, internal compliance teams may also want to interpret a model’s behavior when validating decisions based on model predictions. For instance, underwriters want to learn why a specific loan application was tagged suspicious by an ML model. Overview What is interpretability? In the ML context, interpretability refers to trying to backtrack what factors have contributed to an ML model for making a certain prediction. As shown in the graph below, simpler models are easier to interpret but may often produce lower accuracy compared to complex models like Deep Learning and transformer-based models that can understand non-linear relations in the data and often have high accuracy. Loosely defined, there are two types of explanations: Global explanation: is explaining on an overall model level to understand what features have contributed the most to the output? For example, in a finance setting where the use case is to build an ML model to identify customers who are most likely to default, some of the most influential features for making that decision are the customer’s credit score, total no. of credit cards, revolving balance, etc. Local explanation: This can enable you to zoom in on a particular data point and observe the behavior of the model in that neighborhood. For example, for sentiment classification of a movie review use case, certain words in the review may have a higher impact on the outcomes vs the other. “I have never watched something as bad.” What is a transformer model? A transformer model is a neural network that tracks relationships in sequential input, such as the words in a sentence, to learn context and subsequent meaning. Transformer models use an evolving set of mathematical approaches, called attention or self-attention, to find minute relationships between even distance data elements in a series. Refer to Google’s publication for more information. Integrated Gradients Integrated Gradients (IG), is an Explainable AI technique introduced in the paper Axiomatic Attribution for Deep Networks. In this paper, an attempt is made to assign an attribution value to each input feature. This tells how much the input contributed to the final prediction. IG is a local method that is a popular interpretability technique due to its broad applicability to any differentiable model (e.g., text, image, structured data), ease of implementation, computational efficiency relative to alternative approaches, and theoretical justifications. Integrated gradients represent the integral of gradients with respect to inputs along the path from a given baseline to input. The integral can be approximated using a Riemann Sum or Gauss Legendre quadrature rule. Formally, it can be described as follows: Integrated Gradients along the i — th dimension of input X. Alpha is the scaling coefficient. The equations are copied from the original paper. The cornerstones of this approach are two fundamental axioms, namely sensitivity and implementation invariance. More information can be found in the original paper. Use Case Now let’s see in action how the Integrated Gradients method can be applied using the Captum package. We will be fine-tuning a question-answering BERT (Bidirectional Encoder Representations from Transformers) model, on the SQUAD dataset using the transformers library from HuggingFace, review notebook for a detailed walkthrough. Steps Load the tokenizer and pre-trained BERT model, in this case, bert-base-uncased Next is computing attributions w.r.t BertEmbeddings layer. To do so, define baseline/references and numericalize both the baselines and inputs. Python def construct_whole_bert_embeddings(input_ids, ref_input_ids, \ token_type_ids=None, ref_token_type_ids=None, \ position_ids=None, ref_position_ids=None): Python input_embeddings = model.bert.embeddings(input_ids, token_type_ids=token_type_ids, position_ids=position_ids) Python ref_input_embeddings = model.bert.embeddings(ref_input_ids, token_type_ids=ref_token_type_ids, position_ids=ref_position_ids) Python return input_embeddings, ref_input_embeddings Now, let's define the question-answer pair as an input to our BERT model Question = “What is important to us?” text = “It is important to us to include, empower and support humans of all kinds.” Generate corresponding baselines/references for question-answer pair The next step is to make predictions, one option is to use LayerIntegratedGradients and compute the attributions with respect to BertEmbedding. LayerIntegratedGradients represents the integral of gradients with respect to the layer inputs/outputs along the straight-line path from the layer activations at the given baseline to the layer activation at the input. Python start_scores, end_scores = predict(input_ids, \ token_type_ids=token_type_ids, \ position_ids=position_ids, \ attention_mask=attention_mask) Python print(‘Question: ‘, question) print(‘Predicted Answer: ‘, ‘ ‘.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])) Python lig = LayerIntegratedGradients(squad_pos_forward_func, model.bert.embeddings) Output: Question: What is important to us? Predicted Answer: to include , em ##power and support humans of all kinds Visualize attributes for each word token in the input sequence using a helper function Python # storing couple samples in an array for visualization purposes Python start_position_vis = viz.VisualizationDataRecord( attributions_start_sum, torch.max(torch.softmax(start_scores[0], dim=0)), torch.argmax(start_scores), torch.argmax(start_scores), str(ground_truth_start_ind), attributions_start_sum.sum(), all_tokens, delta_start) Python print(‘\033[1m’, ‘Visualizations For Start Position’, ‘\033[0m’) viz.visualize_text([start_position_vis]) Python print(‘\033[1m’, ‘Visualizations For End Position’, ‘\033[0m’) viz.visualize_text([end_position_vis]) From the results above we can tell that for predicting the start position, our model is focusing more on the question side. More specifically on the tokens ‘what’ and ‘important’. It has also a slight focus on the token sequence ‘to us’ on the text side. In contrast to that, for predicting end position, our model focuses more on the text side and has relatively high attribution on the last end position token ‘kinds’. Conclusion This blog describes how explainable AI techniques like Integrated Gradients can be used to make a deep learning NLP model interpretable by highlighting positive and negative word influences on the outcome of the model. References Axiomatic Attribution for Deep Networks Model Interpretability for PyTorch Towards Better Understanding of Gradient-Based Attribution Methods for Deep Neural Networks
The world of Telecom is evolving at a rapid pace, and it is not just important, but crucial for operators to stay ahead of the game. As 5G technology becomes the norm, it is not just essential, but a strategic imperative to transition seamlessly from 4G technology (which operates on OpenStack cloud) to 5G technology (which uses Kubernetes). In the current scenario, operators invest in multiple vendor-specific monitoring tools, leading to higher costs and less efficient operations. However, with the upcoming 5G world, operators can adopt a unified monitoring and alert system for all their products. This single system, with its ability to monitor network equipment, customer devices, and service platforms, offers a reassuringly holistic view of the entire system, thereby reducing complexity and enhancing efficiency. By adopting a Prometheus-based monitoring and alert system, operators can streamline operations, reduce costs, and enhance customer experience. With a single monitoring system, operators can monitor their entire 5G system seamlessly, ensuring optimal performance and avoiding disruptions. This practical solution eliminates the need for a complete overhaul and offers a cost-effective transition. Let's dive deep. Prometheus, Grafana, and Alert Manager Prometheus is a tool for monitoring and alerting systems, utilizing a pull-based monitoring system. It scrapes, collects, and stores Key Performance Indicators (KPI) with labels and timestamps, enabling it to collect metrics from targets, which are the Network Functions' namespaces in the 5G telecom world. Grafana is a dynamic web application that offers a wide range of functionalities. It visualizes data, allowing the building of charts, graphs, and dashboards that the 5G Telecom operator wants to visualize. Its primary feature is the display of multiple graphing and dashboarding support modes using GUI (Graphical user interface). Grafana can seamlessly integrate data collected by Prometheus, making it an indispensable tool for telecom operators. It is a powerful web application that supports the integration of different data sources into one dashboard, enabling continuous monitoring. This versatility improves response rates by alerting the telecom operator's team when an incident emerges, ensuring a minimum 5G network function downtime. The Alert Manager is a crucial component that manages alerts from the Prometheus server via alerting rules. It manages the received alerts, including silencing and inhibiting them and sending out notifications via email or chat. The Alert Manager also removes duplications, grouping, and routing them to the centralized webhook receiver, making it a must-have tool for any telecom operator. Architectural Diagram Prometheus Components of Prometheus (Specific to a 5G Telecom Operator) Core component: Prometheus server scrapes HTTP endpoints and stores data (time series). The Prometheus server, a crucial component in the 5G telecom world, collects metrics from the Prometheus targets. In our context, these targets are the Kubernetes cluster that houses the 5G network functions. Time series database (TSDB): Prometheus stores telecom Metrics as time series data. HTTP Server: API to query data stored in TSDB; The Grafana dashboard can query this data for visualization. Telecom operator-specific libraries (5G) for instrumenting application code. Push gateway (scrape target for short-lived jobs) Service Discovery: In the world of 5G, network function pods are constantly being added or deleted by Telecom operators to scale up or down. Prometheus's adaptable service discovery component monitors the ever-changing list of pods. The Prometheus Web UI, accessible through port 9090, is a data visualization tool. It allows users to view and analyze Prometheus data in a user-friendly and interactive manner, enhancing the monitoring capabilities of the 5G telecom operators. The Alert Manager, a key component of Prometheus, is responsible for handling alerts. It is designed to notify users if something goes wrong, triggering notifications when certain conditions are met. When alerting triggers are met, Prometheus alerts the Alert Manager, which sends alerts through various channels such as email or messenger, ensuring timely and effective communication of critical issues. Grafana for dashboard visualization (actual graphs) With Prometheus's robust components, your Telecom operator's 5G network functions are monitored with diligence, ensuring reliable resource utilization, tracking performance, detection of errors in availability, and more. Prometheus can provide you with the necessary tools to keep your network running smoothly and efficiently. Prometheus Features The multi-dimensional data model identified by metric details uses PromQL (Prometheus Querying Language) as the query language and the HTTP Pull model. Telecom operators can now discover 5G network functions with service discovery and static configuration. The multiple modes of dashboard and GUI support provide a comprehensive and customizable experience for users. Prometheus Remote Write to Central Prometheus from Network Functions 5G Operators will have multiple network functions from various vendors, such as SMF (Session Management Function), UPF (User Plane Function), AMF (Access and Mobility Management Function), PCF (Policy Control Function), and UDM (Unified Data Management). Using multiple Prometheus/Grafana dashboards for each network function can lead to a complex and inefficient 5G network operator monitoring process. To address this, it is highly recommended that all data/metrics from individual Prometheus be consolidated into a single Central Prometheus, simplifying the monitoring process and enhancing efficiency. The 5G network operator can now confidently monitor all the data at the Central Prometheus's centralized location. This user-friendly interface provides a comprehensive view of the network's performance, empowering the operator with the necessary tools for efficient monitoring. Grafana Grafana Features Panels: This powerful feature empowers operators to visualize Telecom 5G data in many ways, including histograms, graphs, maps, and KPIs. It offers a versatile and adaptable interface for data representation, enhancing the efficiency and effectiveness of your data analysis. Plugins: This feature efficiently renders Telecom 5G data in real-time on a user-friendly API (Application Programming Interface), ensuring operators always have the most accurate and up-to-date data at their fingertips. It also enables operators to create data source plugins and retrieve metrics from any API. Transformations: This feature allows you to flexibly adapt, summarize, combine, and perform KPI metrics query/calculations across 5G network functions data sources, providing the tools to effectively manipulate and analyze your data. Annotations: Rich events from different Telecom 5G network functions data sources are used to annotate metrics-based graphs. Panel editor: Reliable and consistent graphical user interface for configuring and customizing 5G telecom metrics panels Grafana Sample Dashboard GUI for 5G Alert Manager Alert Manager Components The Ingester swiftly ingests all alerts, while the Grouper groups them into categories. The De-duplicator prevents repetitive alerts, ensuring you're not bombarded with notifications. The Silencer is there to mute alerts based on a label, and the Throttler regulates the frequency of alerts. Finally, the Notifier will ensure that third parties are notified promptly. Alert Manager Functionalities Grouping: Grouping categorizes similar alerts into a single notification system. This is helpful during more extensive outages when many 5G network functions fail simultaneously and when all the alerts need to fire simultaneously. The telecom operator will expect only to get a single page while still being able to visualize the exact service instances affected. Inhibition: Inhibition suppresses the notification for specific low-priority alerts if certain major/critical alerts are already firing. For example, when a critical alert fires, indicating that an entire 5G SMF (Session Management Function) cluster is not reachable, AlertManager can mute all other minor/warning alerts concerning this cluster. Silences: Silences are simply mute alerts for a given time. Incoming alerts are checked to match the regular expression matches of an active silence. If they match, no notifications will be sent out for that alert. High availability: Telecom operators will not load balance traffic between Prometheus and all its Alert Managers; instead, they will point Prometheus to a list of all Alert Managers. Dashboard Visualization Grafana dashboard visualizes the Alert Manager webhook traffic notifications as shown below: Configuration YAMLs (Yet Another Markup Language) Telecom Operators can install and run Prometheus using the configuration below: YAML prometheus: enabled: true route: enabled: {} nameOverride: Prometheus tls: enabled: true certificatesSecret: backstage-prometheus-certs certFilename: tls.crt certKeyFilename: tls.key volumePermissions: enabled: true initdbScriptsSecret: backstage-prometheus-initdb prometheusSpec: retention: 3d replicas: 2 prometheusExternalLabelName: prometheus_cluster image: repository: <5G operator image repository for Prometheus> tag: <Version example v2.39.1> sha: "" podAntiAffinity: "hard" securityContext: null resources: limits: cpu: 1 memory: 2Gi requests: cpu: 500m memory: 1Gi serviceMonitorNamespaceSelector: matchExpressions: - {key: namespace, operator: In, values: [<Network function 1 namespace>, <Network function 2 namespace>]} serviceMonitorSelectorNilUsesHelmValues: false podMonitorSelectorNilUsesHelmValues: false ruleSelectorNilUsesHelmValues: false Configuration to route scrape data segregated based on the namespace and route to Central Prometheus. Note: The below configuration can be appended to the Prometheus mentioned in the above installation YAML. YAML remoteWrite: - url: <Central Prometheus URL for namespace 1 by 5G operator> basicAuth: username: name: <secret username for namespace 1> key: username password: name: <secret password for namespace 1> key: password tlsConfig: insecureSkipVerify: true writeRelabelConfigs: - sourceLabels: - namespace regex: <namespace 1> action: keep - url: <Central Prometheus URL for namespace 2 by 5G operator> basicAuth: username: name: <secret username for namespace 2> key: username password: name: <secret password for namespace 2> key: password tlsConfig: insecureSkipVerify: true writeRelabelConfigs: - sourceLabels: - namespace regex: <namespace 2> action: keep Telecom Operators can install and run Grafana using the configuration below. YAML grafana: replicas: 2 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: "app.kubernetes.io/name" operator: In values: - Grafana topologyKey: "kubernetes.io/hostname" securityContext: false rbac: pspEnabled: false # Must be disabled due to tenant permissions namespaced: true adminPassword: admin image: repository: <artifactory>/Grafana tag: <version> sha: "" pullPolicy: IfNotPresent persistence: enabled: false initChownData: enabled: false sidecar: image: repository: <artifactory>/k8s-sidecar tag: <version> sha: "" imagePullPolicy: IfNotPresent resources: limits: cpu: 100m memory: 100Mi requests: cpu: 50m memory: 50Mi dashboards: enabled: true label: grafana_dashboard labelValue: "Vendor name" datasources: enabled: true defaultDatasourceEnabled: false additionalDataSources: - name: Prometheus type: Prometheus url: http://<prometheus-operated>:9090 access: proxy isDefault: true jsonData: timeInterval: 30s resources: limits: cpu: 400m memory: 512Mi requests: cpu: 50m memory: 206Mi extraContainers: - name: oauth-proxy image: <artifactory>/origin-oauth-proxy:<version> imagePullPolicy: IfNotPresent ports: - name: proxy-web containerPort: 4181 args: - --https-address=:4181 - --provider=openshift # Service account name here must be "<Helm Release name>-grafana" - --openshift-service-account=monitoring-grafana - --upstream=http://localhost:3000 - --tls-cert=/etc/tls/private/tls.crt - --tls-key=/etc/tls/private/tls.key - --cookie-secret=SECRET - --pass-basic-auth=false resources: limits: cpu: 100m memory: 256Mi requests: cpu: 50m memory: 128Mi volumeMounts: - mountPath: /etc/tls/private name: grafana-tls extraContainerVolumes: - name: grafana-tls secret: secretName: grafana-tls serviceAccount: annotations: "serviceaccounts.openshift.io/oauth-redirecturi.first": https://[SPK exposed IP for Grafana] service: targetPort: 4181 annotations: service.alpha.openshift.io/serving-cert-secret-name: <secret> Telecom Operators can install and run Alert Manager using the configuration below. YAML alertmanager: enabled: true alertmanagerSpec: image: repository: prometheus/alertmanager tag: <version> replicas: 2 podAntiAffinity: hard securityContext: null resources: requests: cpu: 25m memory: 200Mi limits: cpu: 100m memory: 400Mi containers: - name: config-reloader resources: requests: cpu: 10m memory: 10Mi limits: cpu: 25m memory: 50Mi Configuration to route Prometheus Alert Manager data to the Operator's centralized webhook receiver. Note: The below configuration can be appended to the Alert Manager mentioned in the above installation YAML. YAML config: global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'null' routes: - receiver: '<Network function 1>' group_wait: 10s group_interval: 10s group_by: ['alertname','oid','action','time','geid','ip'] matchers: - namespace="<namespace 1>" - receiver: '<Network function 2>' group_wait: 10s group_interval: 10s group_by: ['alertname','oid','action','time','geid','ip'] matchers: - namespace="<namespace 2>" Conclusion The open-source OAM (Operation and Maintenance) tools Prometheus, Grafana, and Alert Manager can benefit 5G Telecom operators. Prometheus periodically captures all the status of monitored 5G Telecom network functions through the HTTP protocol, and any component can be connected to the monitoring as long as the 5G Telecom operator provides the corresponding HTTP interface. Prometheus and Grafana Agent gives the 5G Telecom operator control over the metrics the operator wants to report; once the data is in Grafana, it can be stored in a Grafana database as extra data redundancy. In conclusion, Prometheus allows 5G Telecom operators to improve their operations and offer better customer service. Adopting a unified monitoring and alert system like Prometheus is one way to achieve this.
When ChatGPT became a global phenomenon, books, papers, or articles about AI (Artificial Intelligence) appeared in countless numbers, but most of them were heavy on theory and mathematics. The series of articles "Introduction to Artificial Intelligence with Code" is a compilation of the most fundamental aspects of AI for beginners, presented with a combination of theory and code (C#) to help readers gain a better understanding of the concepts and ideas discussed in these articles. In the first article of the series, we will introduce propositional logic. Theory: An Introduction to Propositional Logic The rules of logic provide precise meanings for propositions. These rules are used to distinguish between valid and invalid mathematical arguments. Alongside its significance in understanding mathematical reasoning, logic also has many applications in computer science, such as designing computer networks, programming, checking program correctness, and so on. Propositions are the building blocks of the logical edifice of propositional logic. A proposition is a statement that is either true or false but cannot be both true and false simultaneously. The truth value of a proposition (in propositional logic) is referred to as its logical value (true or false). Letters are used to symbolize propositions, as well as to represent variables in programming. The commonly used conventions for these letters are p, q, r, s, and so on. Many mathematical propositions are created by combining one or more propositions we already have. These new propositions are called compound propositions (denoted temporarily as F), and they are formed from existing propositions using logical operators. Some basic logical operators are AND, OR, and NOT. A classical application of logical operators in computer science is to design logic gates. To check the truth value of a compound proposition, we need to apply the rules of logic and consider the truth values of the individual propositions along with the logical operators used. Coding: Checking the Truth Value of a Compound Proposition (F) We’ll create a set of classes, all related by inheritance, that will allow us to obtain the output of any F from inputs defined a priori. Here is the first class: C# public abstract class F { public abstract bool Check(); public abstract IEnumerable<Prop> Props(); } The abstract F class states that all its descendants must implement a Boolean method Check() and an IEnumerable<Prop> method Props(). The first will return the evaluation of the compound proposition and the latter the propositions contained within it. Because logical operators share some features, we’ll create an abstract class to group these features and create a more concise, logical inheritance design. The Op class, which can be seen in the code below, will contain the similarities that every logical operator shares: C# public abstract class Op : F { public F P { get; set; } public F Q { get; set; } public Op(F p, F q) { P = p; Q = q; } public override IEnumerable<Prop> Props() { return P.Props().Concat(Q.Props()); } } The first logical operator, the AND, is illustrated: C# public class AND: Op { public AND(F p, F q): base(p, q) { } public override bool Check() { return P.Check() &&Q.Check(); } } The implementation of the AND class is pretty simple. It receives two arguments that it passes to its parent constructor, and the Check method merely returns the logic AND that is built into C#. Very similar are the OR, NOT, and Prop classes, which are shown below: C# //OR class public class OR : Op { public OR(F p, F q): base(p, q) { } public override bool Check() { return P.Check() || Q.Check(); } } //NOT class public class NOT : F { public F P { get; set; } public NOT(F p) { P = p; } public override bool Check() { return !P.Check(); } public override IEnumerable<Prop> Props() { return new List<Prop>(P.Props()); } } The Prop class is the one we use for representing propositions in compound propositions. It includes a truthValue field, which is the truth value given to the proposition (true, false), and when the Props() method is called it returns a List<Prop> whose single element is itself: C# public class Prop : F { public bool truthValue { get; set; } public Prop(bool truthvalue) { truthValue = truthvalue; } public override bool Check() { return truthValue; } public override IEnumerable<Prop> Props() { return new List<Prop>() { this }; } } Creating and checking F = NOT(p) OR q: C# var p = new Prop(false); var q = new Prop(false); var f = new OR(new NOT(p), q); Console.WriteLine(f.Check()); p.trueValue = true; Console.WriteLine(f.Check()); The result looks like this: Summary In this article, we introduced a basic logic — propositional logic — and we also described C# code for representing compound propositions (propositions, logical operators, and so on). In the next article, we’ll introduce a very important logic that extends propositional logic: first-order logic.
Microsoft's SQL Server is a powerful RDBMS that is extensively utilized in diverse industries for the purposes of data storage, retrieval, and analysis. The objective of this article is to assist novices in comprehending SQL Server from fundamental principles to advanced techniques, employing real-world illustrations derived from nProbe data. nProbe is a well-known network traffic monitoring tool that offers comprehensive insights into network traffic patterns. Getting Started With SQL Server 1. Introduction to SQL Server SQL Server provides a comprehensive database management platform that integrates advanced analytics, robust security features, and extensive reporting capabilities. It offers support for a wide range of data types and functions, enabling efficient data management and analysis. 2. Installation Begin by installing SQL Server. Microsoft offers different editions, including Express, Standard, and Enterprise, to cater to varying needs. The Express edition is free and suitable for learning and small applications. Here is the step-by-step guide to install the SQL server. 3. Basic SQL Operations Learn the fundamentals of SQL, including creating databases, tables, and writing basic queries: Create database: `CREATE DATABASE TrafficData;` Create table: Define a table structure to store nProbe data: MS SQL CREATE TABLE NetworkTraffic ( ID INT PRIMARY KEY, SourceIP VARCHAR(15), DestinationIP VARCHAR(15), Packets INT, Bytes BIGINT, Timestamp DATETIME ); Intermediate SQL Techniques 4. Data Manipulation Inserting Data To insert data into the `NetworkTraffic` table, you might collect information from various sources, such as network sensors or logs. MS SQL INSERT INTO NetworkTraffic (SourceIP, DestinationIP, Packets, Bytes, Timestamp) VALUES ('10.0.0.1', '192.168.1.1', 150, 2048, '2023-10-01T14:30:00'); Batch insert to minimize the impact on database performance: MS SQL INSERT INTO NetworkTraffic (SourceIP, DestinationIP, Packets, Bytes, Timestamp) VALUES ('10.0.0.2', '192.168.1.2', 50, 1024, '2023-10-01T15:00:00'), ('10.0.0.3', '192.168.1.3', 100, 1536, '2023-10-01T15:05:00'), ('10.0.0.4', '192.168.1.4', 200, 4096, '2023-10-01T15:10:00'); Updating Data You may need to update records as new data becomes available or corrections are necessary. For instance, updating the byte count for a particular traffic record: MS SQL UPDATE NetworkTraffic SET Bytes = 3072 WHERE ID = 1; Update multiple fields at once: MS SQL UPDATE NetworkTraffic SET Packets = 180, Bytes = 3072 WHERE SourceIP = '10.0.0.1' AND Timestamp = '2023-10-01T14:30:00'; Deleting Data Removing data is straightforward but should be handled with caution to avoid accidental data loss. MS SQL DELETE FROM NetworkTraffic WHERE Timestamp < '2023-01-01'; Conditional delete based on network traffic analysis: MS SQL DELETE FROM NetworkTraffic WHERE Bytes < 500 AND Timestamp BETWEEN '2023-01-01' AND '2023-06-01'; Querying Data Simple Queries: Retrieve basic information from your data set. MS SQL SELECT FROM NetworkTraffic WHERE SourceIP = '10.0.0.1'; Select specific columns: MS SQL SELECT SourceIP, DestinationIP, Bytes FROM NetworkTraffic WHERE Bytes > 2000; Aggregate Functions Useful for summarizing or analyzing large data sets. MS SQL SELECT AVG(Bytes), MAX(Bytes), MIN(Bytes) FROM NetworkTraffic WHERE Timestamp > '2023-01-01'; Grouping data for more detailed analysis: MS SQL SELECT SourceIP, AVG(Bytes) AS AvgBytes FROM NetworkTraffic GROUP BY SourceIP HAVING AVG(Bytes) > 1500; Join Operations In scenarios where you have multiple tables, joins are essential. Assume another table `IPDetails` that stores additional information about each IP. MS SQL SELECT n.SourceIP, n.DestinationIP, n.Bytes, i.Location FROM NetworkTraffic n JOIN IPDetails i ON n.SourceIP = i.IPAddress WHERE n.Bytes > 1000; Complex Queries Combining multiple SQL operations to extract in-depth insights. MS SQL SELECT SourceIP, SUM(Bytes) AS TotalBytes FROM NetworkTraffic WHERE Timestamp BETWEEN '2023-01-01' AND '2023-02-01' GROUP BY SourceIP ORDER BY TotalBytes DESC; Advanced SQL Server Features 5. Indexing for Performance Optimizing SQL Server performance through indexing and leveraging stored procedures for automation is critical for managing large databases efficiently. Here’s an in-depth look at both topics, with practical examples, particularly focusing on enhancing operations within a network traffic database like the one collected from nProbe. Why Indexing Matters Indexing is a strategy to speed up the retrieval of records from a database by reducing the number of disk accesses required when a query is processed. It is especially vital in databases with large volumes of data, where search operations can become increasingly slow. Types of Indexes Clustered indexes: Change the way records are stored in the database as they sort and store the data rows in the table based on their key values. Tables can have only one clustered index. Non-clustered indexes: Do not alter the physical order of the data, but create a logical ordering of the data rows and use pointers to physical rows; each table can have multiple non-clustered indexes. Example: Creating an Index on Network Traffic Data Suppose you frequently query the `NetworkTraffic` table to fetch records based on `SourceIP` and `Timestamp`. You can create a non-clustered index to speed up these queries: MS SQL CREATE NONCLUSTERED INDEX idx_networktraffic_sourceip ON NetworkTraffic (SourceIP, Timestamp); This index would particularly improve performance for queries that look up records by `SourceIP` and filter on `Timestamp`, as the index helps locate data quickly without scanning the entire table. Below are additional instructions on utilizing indexing effectively. 6. Stored Procedures and Automation Benefits of Using Stored Procedures Stored procedures help in encapsulating SQL code for reuse and automating routine operations. They enhance security, reduce network traffic, and improve performance by minimizing the amount of information sent to the server. Example: Creating a Stored Procedure Imagine you often need to insert new records into the `NetworkTraffic` table. A stored procedure that encapsulates the insert operation can simplify the addition of new records: MS SQL CREATE PROCEDURE AddNetworkTraffic @SourceIP VARCHAR(15), @DestinationIP VARCHAR(15), @Packets INT, @Bytes BIGINT, @Timestamp DATETIME AS BEGIN INSERT INTO NetworkTraffic (SourceIP, DestinationIP, Packets, Bytes, Timestamp) VALUES (@SourceIP, @DestinationIP, @Packets, @Bytes, @Timestamp); END; Using the Stored Procedure To insert a new record, instead of writing a full insert query, you simply execute the stored procedure: MS SQL EXEC AddNetworkTraffic @SourceIP = '192.168.1.1', @DestinationIP = '10.0.0.1', @Packets = 100, @Bytes = 2048, @Timestamp = '2024-04-12T14:30:00'; Automation Example: Scheduled Tasks SQL Server Agent can be used to schedule the execution of stored procedures. For instance, you might want to run a procedure that cleans up old records every night: MS SQL CREATE PROCEDURE CleanupOldRecords AS BEGIN DELETE FROM NetworkTraffic WHERE Timestamp < DATEADD(month, -1, GETDATE()); END; You can schedule this procedure to run automatically at midnight every day using SQL Server Agent, ensuring that your database does not retain outdated records beyond a certain period. By implementing proper indexing strategies and utilizing stored procedures, you can significantly enhance the performance and maintainability of your SQL Server databases. These practices are particularly beneficial in environments where data volumes are large and efficiency is paramount, such as in managing network traffic data for IFC systems. 7. Performance Tuning and Optimization Performance tuning and optimization in SQL Server are critical aspects that involve a systematic review of database and system settings to improve the efficiency of your operations. Proper tuning not only enhances the speed and responsiveness of your database but also helps in managing resources more effectively, leading to cost savings and improved user satisfaction. Key Areas for Performance Tuning and Optimization 1. Query Optimization Optimize queries: The first step in performance tuning is to ensure that the queries are as efficient as possible. This includes selecting the appropriate columns, avoiding unnecessary calculations, and using joins effectively. Query profiling: SQL Server provides tools like SQL Server Profiler and Query Store that help identify slow-running queries and bottlenecks in your SQL statements. Example: Here’s how you can use the Query Store to find performance issues: MS SQL SELECT TOP 10 qt.query_sql_text, rs.avg_duration FROM sys.query_store_query_text AS qt JOIN sys.query_store_plan AS qp ON qt.query_text_id = qp.query_text_id JOIN sys.query_store_runtime_stats AS rs ON qp.plan_id = rs.plan_id ORDER BY rs.avg_duration DESC; 2. Index Management Review and adjust indexes: Regularly reviewing the usage and effectiveness of indexes is vital. Unused indexes should be dropped, and missing indexes should be added where significant performance gains can be made. Index maintenance: Rebuilding and reorganizing indexes can help in maintaining performance, especially in databases with heavy write operations. Example: Rebuild an index using T-SQL: MS SQL ALTER INDEX ALL ON dbo.YourTable REBUILD WITH (FILLFACTOR = 90, SORT_IN_TEMPDB = ON, STATISTICS_NORECOMPUTE = OFF); 3. Database Configuration and Maintenance Database settings: Adjust database settings such as recovery model, file configuration, and buffer management to optimize performance. Routine maintenance: Implement regular maintenance plans that include updating statistics, checking database integrity, and cleaning up old data. Example: Set up a maintenance plan in SQL Server Management Studio (SSMS) using the Maintenance Plan Wizard. 4. Hardware and Resource Optimization Hardware upgrades: Sometimes, the best way to achieve performance gains is through hardware upgrades, such as increasing memory, adding faster disks, or upgrading CPUs. Resource allocation: Ensure that the SQL Server has enough memory and CPU resources allocated, particularly in environments where the server hosts multiple applications. Example: Configure maximum server memory: MS SQL EXEC sp_configure 'max server memory', 4096; RECONFIGURE; 5. Monitoring and Alerts System monitoring: Continuous monitoring of system performance metrics is crucial. Tools like System Monitor (PerfMon) and Dynamic Management Views (DMVs) in SQL Server provide real-time data about system health. Alerts setup: Configure alerts for critical conditions, such as low disk space, high CPU usage, or blocking issues, to ensure that timely actions are taken. Example: Set up an alert in SQL Server Agent: MS SQL USE msdb ; GO EXEC dbo.sp_add_alert @name = N'High CPU Alert', @message_id = 0, @severity = 0, @enabled = 1, @delay_between_responses = 0, @include_event_description_in = 1, @notification_message = N'SQL Server CPU usage is high.', @performance_condition = N'SQLServer:SQL Statistics|Batch Requests/sec|_Total|>|1000', @job_id = N'00000000-1111-2222-3333-444444444444'; GO Performance tuning and optimization is an ongoing process, requiring regular adjustments and monitoring. By systematically addressing these key areas, you can ensure that your SQL Server environment is running efficiently, effectively supporting your organizational needs. Conclusion Mastering SQL Server is a journey that evolves with practice and experience. Starting from basic operations to leveraging advanced features, SQL Server provides a powerful toolset for managing and analyzing data. As your skills progress, you can handle larger datasets like those from nProbe, extracting valuable insights and improving your network's performance and security. For those looking to dive deeper, Microsoft offers extensive documentation and a community rich with resources to explore more complex SQL Server capabilities. Useful References nProbe SQL Server SQL server performance tuning
Failures in software systems are inevitable. How these failures are handled can significantly impact system performance, reliability, and the business’s bottom line. In this post, I want to discuss the upside of failure. Why you should seek failure, why failure is good, and why avoiding failure can reduce the reliability of your application. We will start with the discussion of fail-fast vs. fail-safe, this will take us to the second discussion about failures in general. As a side note, if you like the content of this and the other posts in this series check out my Debugging book that covers this subject. If you have friends that are learning to code I'd appreciate a reference to my Java Basics book. If you want to get back to Java after a while check out my Java 8 to 21 book. Fail-Fast Fail-fast systems are designed to immediately stop functioning upon encountering an unexpected condition. This immediate failure helps to catch errors early, making debugging more straightforward. The fail-fast approach ensures that errors are caught immediately. For example, in the world of programming languages, Java embodies this approach by producing a NullPointerException instantly when encountering a null value, stopping the system, and making the error clear. This immediate response helps developers identify and address issues quickly, preventing them from becoming more serious. By catching and stopping errors early, fail-fast systems reduce the risk of cascading failures, where one error leads to others. This makes it easier to contain and resolve issues before they spread through the system, preserving overall stability. It is easy to write unit and integration tests for fail-fast systems. This advantage is even more pronounced when we need to understand the test failure. Fail-fast systems usually point directly at the problem in the error stack trace. However, fail-fast systems carry their own risks, particularly in production environments: Production disruptions: If a bug reaches production, it can cause immediate and significant disruptions, potentially impacting both system performance and the business’s operations. Risk appetite: Fail-fast systems require a level of risk tolerance from both engineers and executives. They need to be prepared to handle and address failures quickly, often balancing this with potential business impacts. Fail-Safe Fail-safe systems take a different approach, aiming to recover and continue even in the face of unexpected conditions. This makes them particularly suited for uncertain or volatile environments. Microservices are a prime example of fail-safe systems, embracing resiliency through their architecture. Circuit breakers, both physical and software-based, disconnect failing functionality to prevent cascading failures, helping the system continue operating. Fail-safe systems ensure that systems can survive even harsh production environments, reducing the risk of catastrophic failure. This makes them particularly suited for mission-critical applications, such as in hardware devices or aerospace systems, where smooth recovery from errors is crucial. However, fail-safe systems have downsides: Hidden errors: By attempting to recover from errors, fail-safe systems can delay the detection of issues, making them harder to trace and potentially leading to more severe cascading failures. Debugging challenges: This delayed nature of errors can complicate debugging, requiring more time and effort to find and resolve issues. Choosing Between Fail-Fast and Fail-Safe It's challenging to determine which approach is better, as both have their merits. Fail-fast systems offer immediate debugging, lower risk of cascading failures, and quicker detection and resolution of bugs. This helps catch and fix issues early, preventing them from spreading. Fail-safe systems handle errors gracefully, making them better suited for mission-critical systems and volatile environments, where catastrophic failures can be devastating. Balancing Both To leverage the strengths of each approach, a balanced strategy can be effective: Fail-fast for local services: When invoking local services like databases, fail-fast can catch errors early, preventing cascading failures. Fail-safe for remote resources: When relying on remote resources, such as external web services, fail-safe can prevent disruptions from external failures. A balanced approach also requires clear and consistent implementation throughout coding, reviews, tooling, and testing processes, ensuring it is integrated seamlessly. Fail-fast can integrate well with orchestration and observability. Effectively, this moves the fail-safe aspect to a different layer of OPS instead of into the developer layer. Consistent Layer Behavior This is where things get interesting. It isn't about choosing between fail-safe and fail-fast. It's about choosing the right layer for them. E.g. if an error is handled in a deep layer using a fail-safe approach, it won't be noticed. This might be OK, but if that error has an adverse impact (performance, garbage data, corruption, security, etc.) then we will have a problem later on and won't have a clue. The right solution is to handle all errors in a single layer, in modern systems the top layer is the OPS layer and it makes the most sense. It can report the error to the engineers who are most qualified to deal with the error. But they can also provide immediate mitigation such as restarting a service, allocating additional resources, or reverting a version. Retry’s Are Not Fail-Safe Recently I was at a lecture where the speakers listed their updated cloud architecture. They chose to take a shortcut to microservices by using a framework that allows them to retry in the case of failure. Unfortunately, failure doesn't behave the way we would like. You can't eliminate it completely through testing alone. Retry isn't fail-safe. In fact: it can mean catastrophe. They tested their system and "it works", even in production. But let's assume that a catastrophic situation does occur, their retry mechanism can operate as a denial of service attack against their own servers. The number of ways in which ad-hoc architectures such as this can fail is mind-boggling. This is especially important once we redefine failures. Redefining Failure Failures in software systems aren't just about crashes. A crash can be seen as a simple and immediate failure, but there are more complex issues to consider. In fact, crashes in the age of containers are probably the best failures. A system restarts seamlessly with barely an interruption. Data Corruption Data corruption is far more severe and insidious than a crash. It carries with it long-term consequences. Corrupted data can lead to security and reliability problems that are challenging to fix, requiring extensive reworking and potentially unrecoverable data. Cloud computing has led to defensive programming techniques, like circuit breakers and retries, emphasizing comprehensive testing and logging to catch and handle failures gracefully. In a way, this environment sent us back in terms of quality. A fail-fast system at the data level could stop this from happening. Addressing a bug goes beyond a simple fix. It requires understanding its root cause and preventing reoccurrence, extending into comprehensive logging, testing, and process improvements. This ensures that the bug is fully addressed, reducing the chances of it reoccurring. Don't Fix the Bug If it's a bug in production you should probably revert, if you can't instantly revert production. This should always be possible and if it isn't this is something you should work on. Failures must be fully understood before a fix is undertaken. In my own companies, I often skipped that step due to pressure, in a small startup that is forgivable. In larger companies, we need to understand the root cause. A culture of debriefing for bugs and production issues is essential. The fix should also include process mitigation that prevents similar issues from reaching production. Debugging Failure Fail-fast systems are much easier to debug. They have inherently simpler architecture and it is easier to pinpoint an issue to a specific area. It is crucial to throw exceptions even for minor violations (e.g. validations). This prevents cascading types of bugs that prevail in loose systems. This should be further enforced by unit tests that verify the limits we define and verify proper exceptions are thrown. Retries should be avoided in the code as they make debugging exceptionally difficult and their proper place is in the OPS layer. To facilitate that further, timeouts should be short by default. Avoiding Cascading Failure Failure isn't something we can avoid, predict, or fully test against. The only thing we can do is soften the blow when a failure occurs. Often this "softening" is achieved by using long-running tests meant to replicate extreme conditions as much as possible with the goal of finding our application's weak spots. This is rarely enough, robust systems need to revise these tests often based on real production failures. A great example of a fail-safe would be a cache of REST responses that lets us keep working even when a service is down. Unfortunately, this can lead to complex niche issues such as cache poisoning or a situation in which a banned user still had access due to cache. Hybrid in Production Fail-safe is best applied only in production/staging and in the OPS layer. This reduces the amount of changes between production and dev, we want them to be as similar as possible, yet it's still a change that can negatively impact production. However, the benefits are tremendous as observability can get a clear picture of system failures. The discussion here is a bit colored by my more recent experience of building observable cloud architectures. However, the same principle applies to any type of software whether embedded or in the cloud. In such cases we often choose to implement fail-safe in the code, in this case, I would suggest implementing it consistently and consciously in a specific layer. There's also a special case of libraries/frameworks that often provide inconsistent and badly documented behaviors in these situations. I myself am guilty of such inconsistency in some of my work. It's an easy mistake to make. Final Word This is my last post on the theory of debugging series that's part of my book/course on debugging. We often think of debugging as the action we take when something fails, it isn't. Debugging starts the moment we write the first line of code. We make decisions that will impact the debugging process as we code, often we're just unaware of these decisions until we get a failure. I hope this post and series will help you write code that is prepared for the unknown. Debugging, by its nature, deals with the unexpected. Tests can't help. But as I illustrated in my previous posts, there are many simple practices we can undertake that would make it easier to prepare. This isn't a one-time process, it's an iterative process that requires re-evaluation of decisions made as we encounter failure.
Breaking Barriers and Empowering Women in Tech: Insights From Boomi World 2024
May 10, 2024 by CORE
Mastering Unit Testing and Test-Driven Development in Java
May 8, 2024 by
Navigating the AI Revolution: Strategies for Success in 2024
May 11, 2024 by CORE
Convert Your Code From Jupyter Notebooks To Automated Data and ML Pipelines Using AWS
May 11, 2024 by
Explainable AI: Making the Black Box Transparent
May 16, 2023 by CORE
Convert Your Code From Jupyter Notebooks To Automated Data and ML Pipelines Using AWS
May 11, 2024 by
When It’s Time to Give REST a Rest
May 10, 2024 by CORE
Low Code vs. Traditional Development: A Comprehensive Comparison
May 16, 2023 by
Convert Your Code From Jupyter Notebooks To Automated Data and ML Pipelines Using AWS
May 11, 2024 by
When It’s Time to Give REST a Rest
May 10, 2024 by CORE
Navigating the AI Revolution: Strategies for Success in 2024
May 11, 2024 by CORE
Optimize Your Machine Learning Batch Inference Jobs Using AWS
May 10, 2024 by
Five IntelliJ Idea Plugins That Will Change the Way You Code
May 15, 2023 by