Blogs

Organizations Look to Adopt Integrated IT GRC Solutions to Ward Off Cyberattacks, Survey Finds

IT Risk & Cyber Risk
29 July 21
BLOG ADMIN

3 min read

Introduction

Cyber risk has undoubtedly moved up the priority list and taken the center stage in boardroom discussions with the rapid pace of digital transformation of organizations and amplified data-dependency and interconnectedness. The COVID-19 pandemic and the resulting remote working environment have only aggravated the challenges for security teams as the entire workforce moved home—beyond the reach of the office firewall. In these unprecedented times, ensuring robust cyber defense infrastructure to protect critical assets is of paramount importance.

We recently conducted a survey to take a pulse of the current state of IT and cyber risk management programs at organizations. Here are the key takeaways from the survey:

Most survey respondents (45%) identified a lack of visibility on cyber risks across the enterprise as the major challenge faced by their organization.
A majority of organizations (83%) still depend on basic office productivity software, knowledge management software, and point solutions for their cyber risk management requirements. The implementation of an integrated IT GRC solution is at a low level across industries.
Only 28% of the respondents said that their organization's cyber risk and compliance program is fully aligned with the broader enterprise risk and compliance management programs.
Most respondents (45%) said that they changed their plans and approaches to cyber risk and compliance management and reprioritized their activities to contend with the pandemic-driven new operational landscape, while 33% of the participants said they deployed new tools and systems to enhance their efficiency.
41% of the respondents said that they are going to implement specific solutions in FY 2021 to ensure compliance with regulatory requirements and standards.
30% of the respondents said that they are interested in implementing a centralized cyber risk and compliance solution.

It is encouraging to see that switching to digitized and centralized GRC solutions is among the top priorities of organizations this year. These solutions can help improve risk visibility and foresight, facilitate continuous monitoring of IT and cyber controls, and streamline overall cyber risk and compliance management. Innovative features, such as support for mobility, real-time reporting, advanced risk analytics, regulatory notifications, and more, further assist executive management and board in quick and efficient decision-making.

“The ultimate goal isn’t to avoid cyber risk but rather transforming it into strategic advantage—because things can and will inevitably go wrong at some point. But if organizations build their cyber resilience—the ability to not just prevent cyberattacks but also minimizing the impact of security incidents and ensuring continued business operations in the aftermath of attacks—that’s when they can truly thrive and create business value,” an excerpt from the report reads.

Cyber Risk to Dominate Risk Strategies

Our flagship event, GRC Summit, was held recently and brought together the best in the industry to share risk management strategies and best practices, and how to build better governed, more risk-aware, compliant, and resilient enterprises that thrive on risk.

Unsurprisingly, cyber risk has emerged as one of the top risks faced by organizations today, and risk leaders believe that it will continue to dominate the risk strategies going forward. To that end, security experts discussed some of the key considerations for ensuring a robust cybersecurity program:

Aligning cyber strategy with business goals and objectives.
Positioning CISO and security leaders at the right level so that they can better focus on their core responsibilities.
Ensuring that CISOs and the security team provide frequent updates on the cybersecurity posture to the board so that there is no communication gap.
Quantifying cyber risks for prioritizing risks and controls and determining how much to spend on each control.
Increasing transparency into employee communications so that they have clarity not only on corporate policies but also how a crisis is being managed

The best-prepared organizations in the world today are those that use risk as their competitive advantage. Quantifying cyber risks in a manner that makes sense to the executive board and helps them make sound cybersecurity investment decisions is critical for organizations to thrive in today’s digital world. The Cyber Risk Quantification capability of MetricStream IT and Cyber Risk Management can make it considerably easier for organizations to quantify cyber risks in monetary terms, which can then be easily communicated to the top management and board.

To download the report, click here. To watch the summit, click here.

Jump to Topic

BLOG ADMIN

Read more about the latest happenings in the GRC universe. MetricStream experts share their valuable insights on how organizations can turn risk into a strategic advantage and thrive on risk.

Blogs

Improving Platform Resiliency - Error Handling Approaches for Kafka-based Systems

Technology
16 July 21
Zishan Shaikh

11 min read

Introduction

Kafka is an open-source real-time streaming messaging system built around the publish-subscribe system. In a service-oriented architecture, instead of subsystems establishing direct connections with each other, the producer subsystem communicates information via a distributed server, which brokers information and helps move enormous number of messages with low-latency and fault tolerance and allows one or more consumers to concurrently consume these messages.

Kafka does an excellent job with respect to fault-tolerance and ensuring that the messages that are delivered are not lost by partitioning, replication, and distributing the data across multiple brokers.

In distributed systems failures are inevitable, whether it be DB connection failure, or network call failure, or outages in downstream dependencies, especially in a microservices ecosystem.

Failure in Consumers

There are multiple issues that could occur on the consumer side that need special handling. When implementing the Kafka Consumer, there are some scenarios that need to be considered that need special handling:

Downstream Service or Data Store Failure

Consumer is not able to process the message because a downstream microservice API is unavailable or returns an error, or a DB it's trying to connect to is down or unresponsive.

This blog post discusses some of the error handling mechanisms that we implemented as a part of the MetricStream Platform to improve the robustness and resiliency of the Platform.

Data Format Changes or Event Version Incompatibility

The consumer is expecting the message payload to be in a certain format, whereas the producer has changed the format of the message e.g., a required field is removed, i.e., for example the consumer is unable to deserialize the message which is sent by the producer in a certain format.

Producer Failures

Unable to reach Kafka cluster

The producer may fail to push message to a topic due to a network partition or unavailability of the Kafka cluster, in such cases there are high chances of messages being lost, hence we need a retry mechanism to avoid loss of data. So, the approach we take here is to store the message in a temporary secondary store DB/Cache and retry the messages from the secondary store and try to write the message to the main topic.

Problem with Simple Retries

Clogged processing

When we are required to process many messages in real time, repeatedly failed messages can clog processing. The worst offenders consistently exceed the retry limit, which also means that they take the longest and use the most resources. Without a success response, the Kafka consumer will not commit a new offset and the batches with these bad messages would be blocked, as they are re-consumed again and again.

Difficulty retrieving retry metadata

It can be cumbersome to obtain metadata on the retries, such as timestamps and nth retry. If requests continue to fail retry after retrying, we want to collect these failures in a DLQ for visibility and diagnosis. A DLQ should allow listing for viewing the contents of the queue, purging for clearing those contents, and merging for reprocessing the dead-lettered messages, allowing comprehensive resolution for all failures affected by a shared issue.

Improving Platform Resiliency 1

Processing Retry Records in Separate Topics

To address the problem of blocked batches, we set up a distinct retry queue using a separately defined Kafka topic. Under this paradigm, when a consumer handler returns a failed response for a given message after a certain number of retries, the consumer publishes that message to its corresponding retry topic. The handler then returns true to the original consumer, which commits its offset.

Error Handling on The Producer Side

Improving Platform Resiliency 2

Here are some of the possible scenarios why the Producer API is unable to send the message.

Kafka cluster itself is down and unavailable.
If Kafka producer configuration “acks” is configured to “all” and some brokers are unavailable.
If Kafka producer configuration “min.insyn.replicas” is specified as 2 and only one broker is available. Here min.insync.replicas and acks allow you to enforce greater durability guarantees. A typical scenario would be to create a topic with a replication factor of 3, set min.insync.replicas to 2, and produce with acks of “all”. This will ensure that the producer raises an exception if many replicas do not receive a write.

Detailed Explanation of Producer Retry Mechanism

The approach to recover from the above errors involves building a retry mechanism within the producer to ensure that there is an auto-retry process to try and re-deliver messages and a dead-letter store to save messages that were undeliverable even after the auto-retry process.

The steps involved are (see diagram above):

1. Client invokes the Kafka client's producer API to push a message to the main topic (configured in the producer API).

2. If there is an exception thrown by Kafka while pushing the message to the topic, then we need a way of handling the error and managing the message in way that we don't lose the message (prevent data loss).

3. When there is an exception returned by Kafka, then the message will be written to a secondary store.

4. Retry policy defines three key things,

Number of retries: This will be a positive integer value which defines how many times the handler will try to send the message to the main topic. If the number of attempted retries exceeds this value, the message will be pushed to the dead letter store, which will have to be then manually processed by adding a consumer by the developer.
Back-off period: This defines the delay between each retry, this can be a fixed delay or variable delay which grows after every retry. This is important to slow down the rate of error processing, such that we don't spend too many resources in error processing and other healthy messages can be processed as well, instead of just doing error processing.
Recovery callback: If developer wants to implement some additional logic for recovery or push the message to some persistent store or just log the error messages, then he/she can provide a recovery callback which will be called when all the retries are exhausted

5. Based on the retry policy the message will be pushed to a secondary store (DB) till the max retry limit is not reached, once the max retry count is reached, the message will be pushed to the dead letter store.

6. The retry consumer implemented internally as part of the framework will read the messages from the retry store and invoke the producer API to push the message to the main topic.

7. The retry will be done by a separate set of threads from a dedicated retry thread pool, which will not interfere with the main threads pushing the data to Kafka topic or consuming data from Kafka topics.

8. The error handling is controlled through a flag, which the producer can set at the API level, as certain messages may not be as important as the others, such that we can allow the messages from being lost, e.g., log messages.

9. There will be a flag "enableRetry" which will be enabled by default; this can be set at the producer API level to enable/disable error handling.

Error Handling on The Consumer Side

Improving Platform Resiliency 3

Some of the scenarios where the consumer process could run into errors are:

Errors may occur in the consumer while processing the record received from the topic this consumer is listening on. Exceptions could be of any type e.g., IO Exception due to DB connectivity failure or error writing to a file.
The error handling/retry mechanism provided in the diagram above will prevent every consumer from implementing their own business logic for handling errors and will provide a standard retry mechanism for every consumer extending from the framework.

To handle these errors, the following mechanisms can be followed to improve resiliency:

Following are a series of steps to be followed for retrying in case of a failure on the consumer end as shown in the above figure:

1. Kafka consumer listener in Service A tries to consume an event/message from the main topic.

2. The consumer in Service A has a dependency on another service e.g., Service B or a data store to complete the processing, e.g., it may try to invoke another API on a microservice to fetch or update some data.

3. There is an exception thrown while making a call to the microservice i.e., Service B due to some network failure or the service throws an exception due to some internal service failure (i.e., Internal Server Error or Service not available)

3.1. If the retry consumer is unable to process the event after repeated retries and reaches the max retries, then it will push the event to the dead letter topic.
3.2. If retry is enabled, then on exception, the event is pushed to a retry topic.
3.3. Retry consumer will consume the event from the retry topic and try to re-process the event with some delay.

Detailed Explanation of the Consumer Retry Mechanism

The consumer can fail while processing an invalid record or due to some runtime error, which could occur due to a failed connection to DB or failed network call to another microservice.
The framework mentioned in the above diagram provides a configurable way for error handling and retry mechanism, such that the consumers don't have to do explicit do the error handling.
The consumer will be able to configure the exception or set of exceptions for which it wants to retry the message delivery.
The consumer can configure the retry policy i.e., the number of times the message processing should be retried, along with back-off period, which will add a delay to every retry of fixed or variable interval based on the configuration.
When there is an exception in the consumer service, the Kafka consumer handler will trigger a call to retry the message delivery in a separate retry thread, which is part of a separate thread pool, not interfering with the main thread pool.
The retry thread will try sending the message by invoking the producer API at certain intervals based on the retry policy.
It is important not to simply re-attempt failed requests immediately one after the other; doing so will amplify the number of calls, spamming bad requests. Rather, each subsequent level of retry consumers can enforce a processing delay, in other words, a timeout that increases as a message steps down through each retry topic.
The delay/backoff is added while consuming the message from the retry topic, the consumer will delay processing of the message from the topic, hence the records are still part of the retry topic and not loaded in memory once the delay time interval elapses consumer will read the message from the topic and invoke the producer API.
Once the total number of allowed retries are exhausted, the message will be pushed to the dead letter topic.
The state of the count is maintained in the message itself i.e., part of the message header, for every retry the count is updated in the header of the message, that is how we can know the exact retry count, similarly backoff period can also be maintained in the message header to track the backoff period of each event in case of backoff strategy like exponential backoffs wherein the backoff increases with every retry etc.
The retry handler will not try to send the message again post the number of retries are exhausted.
It may optionally implement circuit breaker as well i.e., if the consumer is failing for an extended period, then it can kill the main thread consuming from the main topic.
Optionally, we may provide a recovery call-back handler to allow the developer to implement any specific business logic for error handling e.g., storing the errors to the DB or logging the errors.
The approach of separating out the main consumer & retry thread is designed to not block the main consumer thread and allow it process the regular valid events, otherwise the main thread will be spending time doing retry processing, which will prevent the other valid messages from being processed till the time retry is completed, there are chances that some event may be erroneous and may fail for all retries, this will starve the main consumer threads whose job is to process the valid events, hence the retry mechanism is designed to push the invalid/failed event to another topic i.e. retry topic and a separate retry consumer thread is created whose job is to just process the failed events and does not starve the main thread, hence both the main thread and retry thread can run parallelly and process records without starving each other.

Dead Letter Queue/Topic

If a consumer of the retry topic still does not return success after completing the configured number of retries, then it will publish that message to the dead letter topic.
From there, several techniques that can be employed for listing, purging, and merging from the topic, such as creating a command-line tool backed by its own consumer that uses offset tracking.
Dead letter messages are merged to re-enter processing by being published back into the first retry topic. This way, they remain separate from, and are unable to impede, live traffic.

Simple Flow Diagram: Explaining Consumer Retry Mechanism

Improving Platform Resiliency 4

Additional Considerations

Naming Convention for topics

Valid Characters for Kafka topics:

ASCII alphanumeric, ‘.’, ‘_’, and ‘-‘ (a-z, A-Z, 0-9, . (dot), _ (underscore), and - (dash))

Max Allowed Topic Name Length:

The topic name can be up to 255 characters in length

Main topics convention:

<namespace/organisation prefix>.<product/module/package>.<event-type>
<namespace/organisation prefix>.<product/module/package>.<data-type>.<event-type>

Example: Topic Naming Convention for Orders Created, following is how the topic name is constructed.

Namespace/Organisation Prefix – org
Product – orders
Event type – created

Resulting Topic Name: org.orders.created

Retry Policy Parameters

Back-off period (in milliseconds, represents the fixed delay between each retry) - Default: 3000 ms
Retry Count (Numeric value, represent the number of times a message should be retried before being exhausted, once exhausted it should be moved to dead letter topic) - Default: 3

Example of retry policy configuration in KafkaListener annotation:

@KafkaListener(name = "workflow", topics = "forms", group = "workflow", retryPolicy = @SimpleRetryPolicy(retryBackoffMs = 5000, retryCount = 2, exceptions = {KafkaConsumerException.class, IOException.class}))

Other considerations (other retry policy types)

Following mechanisms can be optionally added to the producer/consumer retry policy.

Circuit Breaker Retry Policy: Trips circuit open after a given number of failures and stays open until a set timeout elapses.
Exponential Backoff Policy: Increases back off period exponentially. The initial interval and multiplier are configurable.

Jump to Topic

Zishan Shaikh Principal Engineer, MetricStream

Blogs

Key Compliance Areas to Focus On: A 360-degree View

GRC
08 February 21
BLOG ADMIN

4 min read

Introduction

As the pandemic continues to batter right through into 2021 and businesses return to the next normal with vaccines making their way into our lives, staying on course with compliance becomes even more critical. Why so?

Regulatory and Corporate compliance, closely tied to brand image and reputation, tops any organization’s priority today to steer clear of penalties, work stoppages or lawsuits in an environment where regulatory complexities are growing. Chief Compliance Officers (CCO) recognize that the cost of non-compliance is too high to bear in a world that is still facing the scourge of COVID-19 crisis. CCOs, tasked with guaranteeing adherence while pre-empting risks, understand the value of putting together a risk-based, integrated compliance strategy.

So, let’s look at what makes for a comprehensive compliance strategy. Starting with a risk-based and federated approach, it entails tracking regulatory engagements, keeping policies in sync with new regulations, while not taking the eye off integrity and culture needs.

A federated approach to compliance makes room for a holistic view, where departments across the board collaborate and share compliance information and technology, but also ensure that the unique compliance needs of each department are kept in place. This is the sign of a true mature organization because it weeds out duplication of effort, breaks data silos and offers an opportunity to create a common compliance data architecture.

A Risk-based Approach – Winner All the Way

To put together a tightly-knit compliance strategy, organizations must adopt a risk-based approach. The need of the hour, especially post the pandemic, is a risk-based approach that is customized to suit the needs of each industry type. With the COVID-19 crisis, organizations have woken up to the reality that not only are there record-high regulatory fines to deal with in case of non-compliance, but also that not all risks need the same level of protection.

Informed decision making in an evolving landscape requires creating best practices for managing compliance risk. The three key steps organizations can take to carve out a robust compliance risk management program are:

Assess and Prioritize Risks
Determine the Right Controls
Report Findings Early and in Real Time

The pandemic especially requires organizations to reassess and rearchitect their compliance risk profiles, both from a quantitative and a qualitative perspective. What is a good way to acquire a contextual view of risk? It is by putting in place an integrated compliance data model that ensures a link with other risks as well as regulations, policies, processes, controls, objectives, etc. Risks must be linked to their appropriate owners. And, risk computations make it easier for organizations to rank and prioritize compliance risks.

The next steps are choosing the appropriate controls so as to prevent or detect risks better. Well executed controls, stem risks. Compliance management software tools, especially Robotic Process Automation (RPA) tools, have a key role to play here as they help accelerate control assessments by automating and streamlining processes. Compliance management softwarecan help document potential risks and make room for systematic issue investigation and remediation.

For organizations that operate across geographies have their own share of risk reporting complexities to deal with. Real and on-time reporting is feasible with use of advanced reporting tools such as graphical dashboards that help view historical as well as real-time data. Organizations are also exploring the use of advanced analytics and machine learning in detecting and predicting compliance risks so that compliance managers stay clued in to ground realities.

Risk mitigation may be the primary responsibility of compliance experts, but all the three lines of defense must work in tandem on this. The stronger the business ownership of risk, the better positioned an organization is. An integrated and holistic compliance strategy and program puts workflows around policies, cases, compliance assessments and other processes on the fast track. And while this happens, organizations must not lose sight of integrity and culture. Compliance and integrity are two sides of the same coin. Be it the management, board or the frontline, each has a role to play to help the organization imbibe the culture of compliance.While the top management, leads from the front by articulating the organization’s core values in an unambiguous and consistent manner, the middle and lower management are the eyes and ears of the organization. The top managers can lean on tools such as employee reviews and customer surveys, while they help employees gauge the importance of accountability, transparency and desired behaviors. The Board of Directors, on the other hand, can institute formal processes and structures to monitor progress and gaps in compliance to integrity and take corrective actions where necessary.

Keep Policies Aligned with Changing Regulations

The COVID-19 crisis has brought with it a changing compliance landscape. As of May 2020, more than 100 countries issued over 350 regulatory notifications to deal with the COVID-19 crisis. The key challenge for organizations is to ensure compliance without disrupting operational efficiencies. To keep policies in sync with recently-updated regulations both at the global and the federal level, organizations can take to take a few steps, that are outlined in the graphic below :

Key Compliance Areas to Focus On: Blog

Build credibility with regulators with effective regulatory engagement
Organizations need an agile and well-coordinated strategy to effectively track regulatory engagements. To strengthen their regulatory relationships, organizations can:

Be more strategic
Create an internal regulatory engagement community
Keep senior management, board and business in the loop
Enable secure access to regulatory engagement information
Leverage good quality data and automation
Create repeatable processes

MetricStream – A Partner to Lean On

Organizations that Perform with Integrity™ enjoy brand loyalty of customers, partners as well as employees. MetricStream helps customers build more risk-aware and compliant cultures through a range of governance, risk and compliance (GRC) products and solutions built on an integrated risk platform. Our M7 Regulatory Compliance and Corporate Compliance solutions help organizations strengthen compliance by adopting an integrated approach.

As the pressure on compliance and regulatory engagement management teams grows, our solutions will help you:

Jump to Topic

BLOG ADMIN

Read more about the latest happenings in the GRC universe. MetricStream experts share their valuable insights on how organizations can turn risk into a strategic advantage and thrive on risk.

Technology

Organizations Look to Adopt Integrated IT GRC Solutions to Ward Off Cyberattacks, Survey Finds

Introduction

Cyber Risk to Dominate Risk Strategies

Related Resources

Improving Platform Resiliency - Error Handling Approaches for Kafka-based Systems

Introduction

Failure in Consumers

Producer Failures

Problem with Simple Retries

Processing Retry Records in Separate Topics

Error Handling on The Producer Side

Detailed Explanation of Producer Retry Mechanism

Error Handling on The Consumer Side

Detailed Explanation of the Consumer Retry Mechanism

Dead Letter Queue/Topic

Simple Flow Diagram: Explaining Consumer Retry Mechanism

Additional Considerations

Naming Convention for topics

Example: Topic Naming Convention for Orders Created, following is how the topic name is constructed.

Retry Policy Parameters

Other considerations (other retry policy types)

Related Resources

Key Compliance Areas to Focus On: A 360-degree View

Introduction

A Risk-based Approach – Winner All the Way

Keep Policies Aligned with Changing Regulations

MetricStream – A Partner to Lean On

Related Resources