Apache SkyWalking – Profiling

Blog: How AI Changed the Economics of Architecture

Fri, 13 Mar 2026 00:00:00 +0000

SkyWalking GraalVM Distro: A case study in turning runnable PoCs into a repeatable migration pipeline.

The most important lesson from this project is not that AI can generate a large amount of code. It is that AI changes the economics of architecture. When runnable PoCs become cheap to build, compare, discard, and rebuild, architects can push further toward the design they actually want instead of stopping early at a compromise they can afford to implement.

That shift matters a lot in mature open source systems. Apache SkyWalking OAP has long been a powerful and production-proven observability backend, but it also carries all the realities of a large Java platform: runtime bytecode generation, reflection-heavy initialization, classpath scanning, SPI-based module wiring, and dynamic DSL execution that are friendly to extensibility but hostile to GraalVM native image.

SkyWalking GraalVM Distro is the result of treating that challenge as a design-system problem instead of a one-off porting exercise. The goal was not only to make OAP run as a native binary, but to turn GraalVM migration itself into a repeatable automation pipeline that can stay aligned with upstream evolution.

For the full technical design, benchmark data, and getting-started guide, see the companion post: SkyWalking GraalVM Distro: Design and Benchmarks.

From Paused Idea to Runnable System

This journey actually began years ago. Shortly after this repository was created, yswdqz spent several months exploring the transition. The project proved much harder in practice than the individual GraalVM limitations sounded on paper, and the work eventually paused for years.

That pause is important. The missing ingredient was not ideas. Mature maintainers usually have more ideas than time. The real constraint was implementation economics. Even when the architect can see several promising directions, limited developer resources force an earlier trade-off: choose the path that is cheapest to implement, not necessarily the path that is cleanest, most reusable, or most future-proof.

This is a very common reality, not an exceptional one. In open source communities, much of the work depends on volunteers or limited company sponsorship. In commercial products, the pressure is different but the constraint is still real: roadmap commitments, staffing limits, and delivery deadlines keep engineering resources tight. In both worlds, good ideas are often abandoned not because they are wrong, but because they are too expensive to validate and implement thoroughly.

There is another constraint that matters just as much: the architect is usually also a very senior engineer, not a full-time implementation machine. That means limited personal coding energy, fragmented time, and a constant need to explain ideas to other senior engineers before the code exists. Traditionally, that explanation happens through diagrams, documents, and conversations. It is slow, lossy, and unpredictable. We all know some version of the Telephone Game: even simple words are easy to misunderstand, and by the time the misunderstanding becomes visible, a lot of time has already passed.

What changed in late 2025 was that AI engineering made multiple runnable ideas affordable. Instead of picking an early compromise because implementation capacity was scarce, we could switch repeatedly between designs, validate them with code, discard weak directions quickly, and keep iterating until the architecture became solid, practical, and efficient enough to hold.

That design freedom was critical. GraalVM documentation gives clear guidance on isolated limitations, but a mature OSS platform hits them as a connected system. Fixing only one dynamic mechanism is not enough. To make native image practical, we had to turn whole categories of runtime behavior into build-time artifacts and automated metadata generation.

There was also a very concrete mountain in front of us in the early history of this distro. In the first several commits of the repository, upstream SkyWalking still relied heavily on Groovy for LAL, MAL, and Hierarchy scripts. In theory, that was just one more unsupported runtime-heavy component. In practice, Groovy was the biggest obstacle in the whole path. It represented not only script execution, but a whole dynamic model that was deeply convenient on the JVM side and deeply unfriendly to native image.

To bridge the gap, we re-architected the core engines of OAP around an AOT-first model. Earlier experiments had to confront Groovy-era runtime behavior directly and explore alternative script-compilation approaches to get around it. The finalized direction went further: align with the upstream compiler pipeline, move dynamic generation to build time, and add automation so the migration stays controllable as upstream keeps moving. Concretely, that meant turning OAL, MAL, LAL, and Hierarchy generation into build-time precompiler outputs instead of leaving them as startup-time dynamic behavior.

AI Speed Changed the Design Loop

The scale of this transformation was not only about coding faster. AI changed the loop between idea, prototype, validation, and redesign. We could build runnable PoCs for different approaches, throw away weak ones quickly, and preserve the promising abstractions until they formed a coherent migration system.

That does not reduce the role of human architecture. It raises the value of it. Human judgment was still required to decide what should become build-time, what should stay configurable, where to introduce same-FQCN replacements, how to keep upstream sync controllable, and which abstractions were worth preserving. But AI speed made it realistic to pursue those better designs instead of settling for a simpler compromise too early.

This is the real change in the economics of architecture. In the past, an architect might already know the cleaner direction, but limited engineering capacity often forced that vision back toward a cheaper compromise. Now the architect can return much closer to being a fast developer again: building code, shaping high-abstraction interfaces, and using design patterns to prove the vision directly in the real world.

That changes communication as much as implementation. In open source, we often say, talk is cheap, show me the code. With AI engineering, showing the code becomes much more straightforward. The design no longer depends so heavily on a slow top-down translation from idea to documents to interpretation to implementation. The code can appear earlier, and it can run earlier.

Other senior engineers benefit from this too. They do not need to reconstruct the whole design only from diagrams, meetings, or long explanations. They can review the actual abstraction, see the behavior in code, run it, challenge it, and refine it from something concrete. That makes architectural collaboration faster, clearer, and less lossy.

This is also where I think the current AI discussion is often noisy. Many projects are fun, surprising, and worth exploring, but advanced engineering work is not improved merely by attaching an agent to a codebase. The important question is not which demo looks most magical. The important question is which engineering capabilities are actually being accelerated without losing the discipline of software development itself.

For architects and senior engineers, the capabilities that mattered most here were:

Fast comparative prototyping: Building several runnable approaches in code instead of defending one idea with slides and documents.
Large-scale code comprehension: Reading across many modules quickly enough to keep the whole system in view.
Systematic refactoring: Converting reflection-heavy or runtime-dynamic paths into designs that fit AOT constraints.
Automation construction: When a migration step must be repeated every upstream sync, doing it manually once is already expensive. Doing it manually again next time is even more expensive. AI made it practical to invest in generators, inventories, consistency checks, and drift detectors that turn repeated manual work into repeatable automation.
Review at breadth: Checking edge cases, compatibility boundaries, and repeatability across a large surface area.

Those capabilities were visible in the resulting design. Same-FQCN replacements created a controlled boundary for GraalVM-specific behavior. Reflection metadata was generated from build outputs instead of maintained as a hand-written guess list. Inventories and drift detectors turned upstream sync from a vague maintenance risk into an explicit engineering workflow.

For junior engineers, I think the lesson is equally important. AI does not remove the need to learn architecture, invariants, interfaces, testing, or maintenance. It makes those skills more valuable, because they determine whether accelerated implementation produces a durable system or just more code faster. The leverage comes from engineering judgment, not from novelty.

Claude Code and Gemini AI acted as engineering accelerators throughout this process. In the GraalVM Distro specifically, they helped us:

Explore migration strategies as running code: Instead of debating which approach might work, we built and compared multiple real prototypes, discarded the weak ones, and kept what held up.
Refactor reflection-heavy and dynamic code paths: Replace runtime-hostile patterns with AOT-friendly alternatives across the codebase.
Make upstream sync sustainable: Every time the distro pulls from upstream SkyWalking, metadata scanning, config regeneration, and recompilation must happen again. AI helped build the pipeline so that each sync is a controlled, largely automated process rather than a fresh manual effort that grows longer each time.
Review logic and edge cases at scale: Especially in places where feature parity mattered more than raw implementation speed.

The result was not just a large rewrite. It was a repeatable system: precompilers, manifest-driven loading, reflection-config generation, replacement boundaries, and drift detectors that make upstream migration reviewable and automatable.

For the broader methodology behind this style of development, see Agentic Vibe Coding in a Mature OSS Project. This post is the next step in that story: not only enhancing an active mature codebase, but reviving a paused effort and making it actually runnable.

What Actually Changed

The most important outcome of this project is not a benchmark table. The benchmark results belong to the distro itself, and they matter because they prove the system is real. But for this post, the deeper result is methodological: AI engineering changed how architecture could be explored, validated, and refined.

Instead of treating architecture as a mostly document-driven activity followed by a long and expensive implementation phase, we were able to move much faster between idea, prototype, comparison, and redesign. That made it realistic to pursue higher-abstraction solutions, preserve cleaner boundaries, and build the automation needed to keep the migration maintainable over time.

The technical evidence for that work is the SkyWalking GraalVM Distro itself: not only a runnable system, but a migration pipeline expressed as precompilers, generated reflection metadata, controlled replacement boundaries, and drift checks. The benchmark data matter because they prove the system works in practice, but the architectural result is that the migration became a repeatable system rather than a one-time port. For detailed benchmark methodology, per-pod data, and the full technical design, see SkyWalking GraalVM Distro: Design and Benchmarks.

The project is hosted at apache/skywalking-graalvm-distro. We invite the community to test it, report issues, and help move it toward production readiness.

For me, the deeper takeaway is broader than this distro. AI engineering does not make architecture less important. It makes architecture more worth pursuing. When implementation speed rises enough, we can afford to test more ideas in code, keep the good abstractions, and build systems that would previously have been judged too expensive to finish well.

For senior engineers, that means the bottleneck shifts away from raw typing speed and toward taste, system judgment, and the ability to define stable boundaries. For junior engineers, it means the path forward is not to chase every exciting AI workflow, but to become stronger at the fundamentals that let acceleration compound: understanding requirements, reading unfamiliar systems, questioning assumptions, and recognizing what must remain correct as everything around it changes. AI changed the economics of architecture because it lowered the cost of validating better designs without lowering the bar for engineering judgment.

Blog: SkyWalking GraalVM Distro: Design and Benchmarks

Fri, 13 Mar 2026 00:00:00 +0000

A technical deep-dive into how we migrated Apache SkyWalking OAP to GraalVM Native Image — not as a one-off port, but as a repeatable pipeline that stays aligned with upstream.

For the broader story of how AI engineering made this project economically viable, see How AI Changed the Economics of Architecture.

Why GraalVM Is Not Optional

GraalVM Native Image compiles Java applications Ahead-of-Time (AOT) into standalone executables. For an observability backend like SkyWalking OAP, this is not a performance optimization — it is an operational necessity.

An observability platform must be the most reliable component in the infrastructure. It has to survive the failures it is supposed to observe. In cloud-native environments where workloads scale, migrate, and restart constantly, the backend that watches everything cannot itself be the slow, heavy process that takes seconds to recover and gigabytes to idle.

Our benchmarks make the case concrete:

Startup: ~5 ms vs ~635 ms. In a Kubernetes cluster where an OAP pod gets evicted or rescheduled, a 635 ms gap means lost telemetry — traces, metrics, and logs that arrive during that window are simply dropped. At 5 ms, the new pod is receiving data before most clients even notice the disruption.
Idle memory: ~41 MiB vs ~1.2 GiB. Observability backends run 24/7. In a multi-tenant or edge deployment, a 97% reduction in baseline RSS is the difference between fitting the observability stack on a small node and needing a dedicated one.
Memory under load: ~629 MiB vs ~2.0 GiB at 20 RPS. A 70% reduction at production-like traffic means fewer nodes, lower cloud bills, and more headroom before the backend itself becomes a scaling bottleneck.
No warm-up penalty: Peak throughput is available from the first request. The JVM’s JIT compiler needs minutes of traffic before it optimizes hot paths — during that window, tail latency is worse and data processing lags behind. A native binary has no such phase.
Smaller attack surface: No JDK runtime means fewer CVEs to track and patch. For a component that ingests data from every service in the cluster, that matters.

These are not incremental improvements. They change what deployment topologies are practical. Serverless observability backends, sidecar-model collectors, edge nodes with tight memory budgets — all become realistic when the backend is this light and this fast.

The Challenge: A Mature, Dynamic Java Platform

SkyWalking OAP carries all the realities of a large Java platform: runtime bytecode generation, reflection-heavy initialization, classpath scanning, SPI-based module wiring, and dynamic DSL execution. These patterns are friendly to extensibility but hostile to GraalVM native image.

The documented GraalVM limitations are only the beginning. In a mature OSS platform, those limitations are deeply entangled with years of runtime design decisions. Standard GraalVM native images struggle with runtime class generation, reflection, dynamic discovery, and script execution — all of which had deep roots in SkyWalking OAP.

There was also a very concrete mountain in the early history of this distro. Upstream SkyWalking relied heavily on Groovy for LAL, MAL, and Hierarchy scripts. In theory, that was just one more unsupported runtime-heavy component. In practice, Groovy was the biggest obstacle in the whole path. It represented not only script execution, but a whole dynamic model that was deeply convenient on the JVM side and deeply unfriendly to native image.

The Design Goal: Make Migration Repeatable

The final design is not just “run native-image successfully.” It is a system that keeps migration work repeatable:

Pre-compile runtime-generated assets at build time. OAL, MAL, LAL, Hierarchy rules, and meter-related generated classes are compiled during the build and packaged as artifacts instead of being generated at startup.
Replace dynamic discovery with deterministic loading. Classpath scanning and runtime registration paths are converted into manifest-driven loading.
Reduce runtime reflection and generate native metadata from the build. Reflection configuration is produced from actual manifests and scanned classes instead of being maintained as a hand-written guess list.
Keep the upstream sync boundary explicit. Same-FQCN replacements are intentionally packaged, inventoried, and guarded with staleness checks.
Make drift visible immediately. If upstream providers, rule files, or replaced source files change, tests fail and force explicit review.

That is the architectural shift that matters most. Reusable abstraction and foresight did not become less important in the AI era. They became more important, because they determine whether AI speed produces a maintainable system or just a fast-growing pile of code.

Turning Runtime Dynamism into Build-Time Assets

SkyWalking OAP has several dynamic subsystems that are natural in a JVM world but problematic for native image:

OAL generates classes at runtime.
LAL, MAL, and Hierarchy were historically tied to Groovy-heavy runtime behavior, which became one of the biggest practical blockers in the early distro work.
MAL, LAL, and Hierarchy rules depend on runtime compilation behavior.
Guava-based classpath scanning discovers annotations, dispatchers, decorators, and meter functions.
SPI-based module/provider discovery expects a more dynamic runtime environment.
YAML/config initialization and framework integrations depend on reflective access.

In SkyWalking GraalVM Distro, these are not solved one by one as isolated patches. They are pulled into a build-time pipeline.

The precompiler runs the DSL engines during the build, exports generated classes, writes manifests, serializes config data, and generates native-image metadata. That means startup becomes class loading and registration, not runtime code generation. The runtime path is simpler because the build path became richer.

This is also why the project is more than a performance exercise. The design goal was to move complexity into a place where it is easier to verify, easier to automate, and easier to repeat.

Same-FQCN Replacements as a Controlled Boundary

One of the most practical design choices in this distro is the use of same-FQCN replacement classes. We do not rely on vague startup tricks or undocumented ordering assumptions. Instead, the GraalVM-specific jars are repackaged so the original upstream classes are excluded and the replacement classes occupy the exact same fully-qualified names.

This matters for maintainability. It creates a very clear boundary:

the upstream class still defines the behavior contract,
the GraalVM replacement provides a compatible implementation strategy,
and the packaging makes that swap explicit.

For example, OAL loading changes from runtime compilation into manifest-driven loading of precompiled classes. Similar replacements handle MAL and LAL DSL loading, module wiring, config initialization, and several reflection-sensitive paths. The goal is not to fork everything. The goal is to replace only the places where the runtime model is fundamentally unfriendly to native image.

That boundary is then guarded by tests that hash the upstream source files corresponding to the replacements. When upstream changes one of those files, the build fails and tells us exactly which replacement needs review. This is what turns “keeping up with upstream” from an anxiety problem into a visible engineering task.

Reflection Config Is Generated, Not Guessed

In many GraalVM migrations, reflect-config.json becomes a manually accumulated artifact. It grows over time, gets stale, and nobody is fully sure whether it is complete or why each entry exists. That approach does not scale well for a large, evolving OSS platform.

In this distro, reflection metadata is generated from the build outputs and scanned classes:

manifests for OAL, MAL, LAL, Hierarchy, and meter-generated classes,
annotation-scanned classes,
Armeria HTTP handlers,
GraphQL resolvers and schema-mapped types,
and accepted ModuleConfig classes.

This is a much healthier model. Instead of asking people to remember every reflective access path, the system derives reflection metadata from the actual migration pipeline. The build becomes the source of truth.

Keeping Upstream Sync Practical

If this distro were only a one-time engineering sprint, it would be much less interesting. The real challenge is keeping it alive while upstream SkyWalking continues to evolve.

That is why the repo includes explicit inventories and drift detectors:

provider inventories that force new upstream providers to be categorized,
rule-file inventories that force new DSL inputs to be acknowledged,
SHA watchers for precompiled YAML inputs,
and SHA watchers for upstream source files with GraalVM-specific replacements.

Good abstraction is not only about elegant code structure. It is about choosing a migration design that can survive contact with future change.

Benchmark Results

We benchmarked the standard JVM OAP against the GraalVM Distro on an Apple M3 Max (macOS, Docker Desktop, 10 CPUs / 62.7 GB), both connecting to BanyanDB.

Boot Test (Docker Compose, no traffic, median of 3 runs)

Metric	JVM OAP	GraalVM OAP	Delta
Cold boot startup	635 ms	5 ms	~127x faster
Warm boot startup	630 ms	5 ms	~126x faster
Idle RSS	~1.2 GiB	~41 MiB	~97% reduction

Boot time is measured from OAP’s first application log timestamp to the listening on 11800 log line (gRPC server ready).

Under Sustained Load (Kind + Istio 1.25.2 + Bookinfo at ~20 RPS, 2 OAP replicas)

30 samples at 10s intervals after 60s warmup.

Metric	JVM OAP	GraalVM OAP	Delta
CPU median (millicores)	101	68	-33%
CPU avg (millicores)	107	67	-37%
Memory median (MiB)	2068	629	-70%
Memory avg (MiB)	2082	624	-70%

Both variants reported identical entry-service CPM, confirming equivalent traffic processing capability.

Service metrics collected every 30s via swctl for all discovered services: service_cpm, service_resp_time, service_sla, service_apdex, service_percentile.

Full benchmark scripts and raw data are in the benchmark/ directory of the distro repository.

Current Status

The project is a runnable experimental distribution, hosted in its own repository: apache/skywalking-graalvm-distro.

The current distro intentionally focuses on a modern, high-performance operating model:

Storage: BanyanDB
Cluster modes: Standalone and Kubernetes
Configuration: none or Kubernetes ConfigMap
Runtime model: fixed module set, precompiled assets, and AOT-friendly wiring

This focus is deliberate. A repeatable migration system starts by making a clear scope runnable, then expanding without losing control.

Getting Started

Because the SkyWalking GraalVM Distro is designed for peak performance, it is optimized to work with BanyanDB as its storage backend. The current published image is available on Docker Hub, and you can boot the stack using the following docker-compose.yml.

version: '3.8'

services:
  banyandb:
    image: ghcr.io/apache/skywalking-banyandb:e1ba421bd624727760c7a69c84c6fe55878fb526
    container_name: banyandb
    restart: always
    ports:
      - "17912:17912"
      - "17913:17913"
    command: standalone --stream-root-path /tmp/stream-data --measure-root-path /tmp/measure-data --measure-metadata-cache-wait-duration 1m --stream-metadata-cache-wait-duration 1m
    healthcheck:
      test: ["CMD", "sh", "-c", "nc -nz 127.0.0.1 17912"]
      interval: 5s
      timeout: 10s
      retries: 120

  oap:
    image: apache/skywalking-graalvm-distro:0.1.1
    container_name: oap
    depends_on:
      banyandb:
        condition: service_healthy
    restart: always
    ports:
      - "11800:11800"
      - "12800:12800"
    environment:
      SW_STORAGE: banyandb
      SW_STORAGE_BANYANDB_TARGETS: banyandb:17912
      SW_HEALTH_CHECKER: default
    healthcheck:
      test: ["CMD-SHELL", "nc -nz 127.0.0.1 11800 || exit 1"]
      interval: 5s
      timeout: 10s
      retries: 120

  ui:
    image: ghcr.io/apache/skywalking/ui:10.3.0
    container_name: ui
    depends_on:
      oap:
        condition: service_healthy
    restart: always
    ports:
      - "8080:8080"
    environment:
      SW_OAP_ADDRESS: http://oap:12800

Simply run:

docker compose up -d

We invite the community to test this new distribution, report issues, and help us move it toward a production-ready state.

Special thanks to the GraalVM team for the technology foundation.

Blog: Profiling Java application with SkyWalking bundled async-profiler

Mon, 09 Dec 2024 00:00:00 +0000

Background

Apache SkyWalking is an open-source Application Performance Management system that helps users gather logs, traces, metrics, and events from various platforms and display them on the UI. In version 10.1.0, Apache SkyWalking can perform CPU analysis through eBPF, which supports multiple languages, but not Java. This article discusses how Apache SkyWalking 10.2.0 uses async-profiler to collect CPU, memory allocation, and locks for analysis, solving this limitation, and also provides memory allocation and occupancy analysis.

Why use async-profiler

The async-profiler is a low overhead sampling profiler for Java that does not suffer from the Safepoint bias problem. It features HotSpot-specific API to collect stack traces and to track memory allocations. The profiler works with OpenJDK and other Java runtimes based on the HotSpot JVM. The async-profiler also officially supports the instruction set architectures commonly used on Linux and Mac platforms, and the sampling data can be stored in the JFR format. Compared with the JFR tool officially provided by JDK, it supports lower JDK versions (JDK 6).

Architecture diagram

The processes of running a profiling task

A user submits a async-profiler task in the UI
The Java agent retrieves the task from the OAP Server
Java agent excuses the task to collect profiling data sampling through async-profiler
After the profiling is completed, the agent uploads the JFR file to the OAP server.
The server parses the JFR file to generate profiling results and marks the task as completed status.
The user could check the performance analysis result from the UI side.

Demo

You can setup SkyWalking showcase locally to preview this feature. In this demo, we only deploy service, the latest released SkyWalking OAP, and UI.

export FEATURE_FLAGS=java-agent-injector,single-node,elasticsearch
make deploy.kubernetes

After deployment is complete, please run the following script to open SkyWalking UI: http://localhost:8080/.

kubectl port-forward svc/ui 8080:8080 --namespace default

Run the Async Profiling Task Step by Step

After the deployment is complete, users can navigate to the service page where the Java agent is configured. Upon entering the service page, users will be able to see the Async Profiling component. By clicking on this component, users will gain access to the relevant functionality page, where they can perform some operations.

Create a New Task

Clicking New Task on the Async Profiling page will direct you to the following configuration page. The usage of each parameter is explained as follows:

Instance: This parameter allows you to select the instance of the service that will execute the profiling. It supports selecting multiple instances simultaneously for performance analysis.
Duration: Specifies the duration for the task. The default duration is conservatively set to a maximum of 20 minutes, but this can be adjusted through the Java agent configuration.
Async Profiling Events: The profiling events are categorized into three types of sampling, which will be explained below:
- CPU Sampling: CPU, WALL, CTIMER, ITIMER. See the differences between these four CPU sampling types.
- Memory Allocation Sampling: ALLOC.
- Lock Occupancy Sampling: LOCK.
ExecArgs: Extended parameters for async-profiler. Detailed usage instructions are available.

Check the Progresses Of the Task

By clicking the task details icon, users can view the task status logs, relevant parameters, as well as instances where data collection has either failed or been successfully completed. Instances that have successfully completed data collection will be available for subsequent performance analysis.

It is important to note that, in containerized deployments where users have not configured volume mounts, there may be cases where JFR files cannot be received. To address this, the OAP Server by default uses memory to receive and parse JFR files. The maximum acceptable size for JFR files is conservatively set to 30MB by default.

Users can customize the default JFR file size in the OAP configuration and opt to store the files on the filesystem before parsing them, enabling the platform to handle larger JFR files and ensuring smoother memory allocation.

Currently, the JFR parser requires approximately 1GB of memory to process a 200MB JFR file. (Note that this refers only to memory allocation, not the actual memory required for parsing.) Users can use this as a reference when configuring their OAP Server

Performance Analysis

Users can select a task and choose the instances they wish to analyze for performance (multiple instances can be selected for aggregated flame graph analysis). After selecting the desired JFR event type for analysis, users can click the Analyze button to display the corresponding flame graph.

Some Details

Differences in CPU sampling during task creation

The CPU sampling mechanism supports several modes, each representing a different sampling engine implemented by async-profiler. These modes include CPU, WALL, CTIMER, and ITIMER, and differ primarily in how they collect and generate sampling signals. The following provides a detailed description of each sampling:

CPU: cpu mode relies on perf_events. The idea is the same - to generate a signal every N nanoseconds of CPU time, which in this case is achieved by configuring PMU to generate an interrupt every K CPU cycles.
WALL: Same as CPU sampling, but also samples threads in non-runnable state, such as threads in sleep
ITIMER: itimer mode is based on setitimer(ITIMER_PROF) syscall, which ideally generates a signal every given interval of the CPU time consumed by the process.
CTIMER: ctimer aims to address these limitations of perf_events and itimer. ctimer relies on timer_create. It combines benefits of cpu and itimer, except that it does not allow collecting kernel stacks.

For details, please refer to async-profiler

ExecArgs in task creation

By default, task parameters are separated by commas. When creating a task, users should refer to the following example format for input: lock=10us,interval=10ms.

Currently, the following parameters are supported by default:

Option	Description
chunksize=N	approximate size of JFR chunk in bytes (default: 100 MB)
chunktime=N	duration of JFR chunk in seconds (default: 1 hour)
lock[=DURATION]	profile contended locks overflowing the DURATION ns bucket
jstackdepth=N	maximum Java stack depth (default: 2048)
interval=N	sampling interval in ns (default: 10'000'000, i.e. 10 ms)
alloc[=BYTES]	profile allocations with BYTES interval

For other parameters, please refer to async-profiler and need to be tested by yourself

Comparison table between sampling types and JFR events in task analysis

Task sample type	JFR event type	Description	Unit
CPU WALL ITIMER CTIMER	EXECUTION_SAMPLE	Multiple AsyncProfilerEventType types correspond to the EXECUTION_SAMPLE event. This is primarily due to the fact that different sampling types employ distinct underlying mechanisms and have varying sampling scopes.	Sample times. The execution time can be calculated based on the sampling interval. For instance, if the number of samples is 10 and the interval is set to 10ms, the total execution time can be estimated as 100ms (the default interval is 10ms)
LOCK	THREAD_PARK JAVA_MONITOR_ENTER	Empty	ns
ALLOC	OBJECT_ALLOCATION_IN_NEW_TLAB OBJECT_ALLOCATION_OUTSIDE_TLAB	Empty	byte
Add `live` option to extended parameters	PROFILER_LIVE_OBJECT	Because it is not in the event parameter of async-profiler, it is not selected separately in the task sampling type of the UI during implementation, but is used as an extended parameter	byte

Performance expenses

There is no performance overhead when an instance is not receiving an async-profiler task. Performance impact is only introduced once the async-profiler performance analysis is initiated. The extent of this overhead depends on the specific configuration parameters. When using the default settings, the performance impact typically ranges from 0.3% to 10%. For more detailed information, please refer to the issue.

Blog: SkyWalking 10 Release: Service Hierarchy, Kubernetes Network Monitoring by eBPF, BanyanDB, and More

Mon, 13 May 2024 00:00:00 +0000

The Apache SkyWalking team today announced the 10 release. SkyWalking 10 provides a host of groundbreaking features and enhancements. The introduction of Layer and Service Hierarchy streamlines monitoring by organizing services and metrics into distinct layers and providing seamless navigation across them. Leveraging eBPF technology, Kubernetes Network Monitoring delivers granular insights into network traffic, topology, and TCP/HTTP metrics. BanyanDB emerges as a high-performance native storage solution, while expanded monitoring support encompasses Apache RocketMQ, ClickHouse, and Apache ActiveMQ Classic. Support for Multiple Labels Names enhances flexibility in metrics analysis, while enhanced exporting and querying capabilities streamline data dissemination and processing.

This release blog briefly introduces these new features and enhancements as well as some other notable changes.

Layer and Service Hierarchy

Layer concept was introduced in SkyWalking 9.0.0, it represents an abstract framework in computer science, such as Operating System(OS_LINUX layer), Kubernetes(k8s layer). It organizes services and metrics into different layers based on their roles and responsibilities in the system. SkyWalking provides a suite of monitoring and diagnostic tools for each layer, but there is a gap between the layers, which can not easily bridge the data across different layers.

In SkyWalking 10, SkyWalking provides new abilities to jump/connect across different layers and provide a seamless monitoring experience for users.

Layer Jump

In the topology graph, users can click on a service node to jump to the dashboard of the service in another layer. The following figures show the jump from the GENERAL layer service topology to the VIRTUAL_DATABASE service layer dashboard by clicking the topology node. Figure 1: Layer Jump

Figure 2: Layer jump Dashboard

Service Hierarchy

SkyWalking 10 introduces a new concept called Service Hierarchy, which defines the relationships of existing logically same services in various layers. OAP will detect the services from different layers, and try to build the connections. Users can click the Hierarchy Services in any layer’s service topology node or service dashboard to get the Hierarchy Topology. In this topology graph, users can see the relationships between the services in different layers and the summary of the metrics and also can jump to the service dashboard in the layer. When a service occurs performance issue, users can easily analyze the metrics from different layers and track down the root cause:

The examples of the Service Hierarchy relationships:

The application song deployed in the Kubernetes cluster with SkyWalking agent and Service Mesh at the same time. So the application song across the GENERAL, MESH, MESH_DP and K8S_SERVICE layers which could be monitored by SkyWalking, the Service Hierarchy topology as below: Figure 3: Service Hierarchy Agent With K8s Service And Mesh With K8s Service.

And can also have the Service Instance Hierarchy topology to get the single instance status across the layers as below: Figure 4: Instance Hierarchy Agent With K8s Service(Pod)
The PostgreSQL database psql deployed in the Kubernetes cluster and used by the application song. So the database psql across the VIRTUAL_DATABASE, POSTGRESQL and K8S_SERVICE layers which could be monitored by SkyWalking, the Service Hierarchy topology as below: Figure 5: Service Hierarchy Agent(Virtual Database) With Real Database And K8s Service

For more supported layers and how to detect the relationships between services in different layers please refer to the Service Hierarchy. how to configure the Service Hierarchy in SkyWalking, please refer to the Service Hierarchy Configuration section.

Monitoring Kubernetes Network Traffic by using eBPF

In the previous version, skyWalking provides Kubernetes (K8s) monitoring from kube-state-metrics and cAdvisor, which can monitor the Kubernetes cluster status and the metrics of the Kubernetes resources.

In SkyWalking 10, by leverage Apache SkyWalking Rover 0.6+, SkyWalking has the ability to monitor the Kubernetes network traffic by using eBPF, which can collect and map access logs from applications in Kubernetes environments. Through these data, SkyWalking can analyze and provide the Service Traffic, Topology, TCP/HTTP level metrics from the Kubernetes aspect.

The following figures show the Topology and TCP Dashboard of the Kubernetes network traffic:

Figure 6: Kubernetes Network Traffic Topology

Figure 7: Kubernetes Network Traffic TCP Dashboard

More details about how to monitor the Kubernetes network traffic by using eBPF, please refer to the Monitoring Kubernetes Network Traffic by using eBPF.

BanyanDB - Native APM Database

BanyanDB 0.6.0 and BanyanDB Java client 0.6.0 are released with SkyWalking 10, As a native storage solution for SkyWalking, BanyanDB is going to be SkyWalking’s next-generation storage solution. This is recommended to use for medium-scale deployments from 0.6 until 1.0.
It has shown high potential performance improvement. Less than 50% CPU usage and 50% memory usage with 40% disk volume compared to Elasticsearch in the same scale.

Apache RocketMQ Server Monitoring

Apache RocketMQ is an open-source distributed messaging and streaming platform, which is widely used in various scenarios including Internet, big data, mobile Internet, IoT, and other fields. SkyWalking provides a basic monitoring dashboard for RocketMQ, which includes the following metrics:

Cluster Metrics: including messages produced/consumed today, total producer/consumer TPS, producer/consumer message size, messages produced/consumed until yesterday, max consumer latency, max commitLog disk ratio, commitLog disk ratio, pull/send threadPool queue head wait time, topic count, and broker count.
Broker Metrics: including produce/consume TPS, producer/consumer message size.
Topic Metrics: including max producer/consumer message size, consumer latency, producer/consumer TPS, producer/consumer offset, producer/consumer message size, consumer group count, and broker count.

The following figure shows the RocketMQ Cluster Metrics dashboard: Figure 8: Apache RocketMQ Server Monitoring

For more metrics and details about the RocketMQ monitoring, please refer to the Apache RocketMQ Server Monitoring,

ClickHouse Server Monitoring

ClickHouse is an open-source column-oriented database management system that allows generating analytical data reports in real-time, it is widely used for online analytical processing (OLAP). ClickHouse monitoring provides monitoring of the metrics 、events and asynchronous metrics of the ClickHouse server, which includes the following parts of metrics:

Server Metrics
Query Metrics
Network Metrics
Insert Metrics
Replica Metrics
MergeTree Metrics
ZooKeeper Metrics
Embedded ClickHouse Keeper Metrics

The following figure shows the ClickHouse Cluster Metrics dashboard: Figure 9: ClickHouse Server Monitoring

For more metrics and details about the ClickHouse monitoring, please refer to the ClickHouse Server Monitoring, and here is a blog that can help for a quick start Monitoring ClickHouse through SkyWalking.

Apache ActiveMQ Server Monitoring

Apache ActiveMQ Classic is a popular and powerful open-source messaging and integration pattern server. SkyWalking provides a basic monitoring dashboard for ActiveMQ, which includes the following metrics:

Cluster Metrics: including memory usage, rates of write/read, and average/max duration of write.
Broker Metrics: including node state, number of connections, number of producers/consumers, and rate of write/read under the broker. Depending on the cluster mode, one cluster may include one or more brokers.
Destination Metrics: including number of producers/consumers, messages in different states, queues, and enqueue duration in a queue/topic.

The following figure shows the ActiveMQ Cluster Metrics dashboard: Figure 10: Apache ActiveMQ Server Monitoring

For more metrics and details about the ActiveMQ monitoring, please refer to the Apache ActiveMQ Server Monitoring, and here is a blog that can help for a quick start Monitoring ActiveMQ through SkyWalking.

Support Multiple Labels Names

Before SkyWalking 10, SkyWalking does not store the labels names in the metrics data, which makes MQE have to use _ as the generic label name, it can’t query the metrics data with multiple labels names.

SkyWalking 10 supports storing the labels names in the metrics data, and MQE can query or calculate the metrics data with multiple labels names. For example: The k8s_cluster_deployment_status metric has labels namespace, deployment and status. If we want to query all deployment metric values with namespace=skywalking-showcase and status=true, we can use the following expression:

k8s_cluster_deployment_status{namespace='skywalking-showcase', status='true'}

related enhancement:

Since Alarm rule configuration had migrated to the MQE in SkyWalking 9.6.0, the alarm rule also supports multiple labels names.
PromeQL service supports multiple labels names query.

Metrics gRPC exporter

SkyWalking 10 enhanced the metrics gPRC exporter, it supports exporting all types of metrics data to the gRPC server.

SkyWalking Native UI Metrics Query Switch to V3 APIs

SkyWalking Native UI metrics query deprecate the V2 APIs, and all migrated to V3 APIs and MQE.

Other Notable Enhancements

Support Java 21 runtime and oap-java21 image for Java 21 runtime.
Remove CLI(swctl) from the image.
More MQE functions and operators supported.
Enhance the native UI and improve the user experience.
Several bugs and CVEs fixed.

Blog: Monitoring Kubernetes network traffic by using eBPF

Mon, 18 Mar 2024 00:00:00 +0000

Background

Apache SkyWalking is an open-source Application Performance Management system that helps users gather logs, traces, metrics, and events from various platforms and display them on the UI. With version 9.7.0, SkyWalking can collect access logs from probes in multiple languages and from Service Mesh, generating corresponding topologies, tracing, and other data. However, it could not initially collect and map access logs from applications in Kubernetes environments. This article explores how the 10.0.0 version of Apache SkyWalking employs eBPF technology to collect and store application access logs, addressing this limitation.

Why eBPF?

To monitor the network traffic in Kubernetes, the following features support be support:

Cross Language: Applications deployed in Kubernetes may be written in any programming language, making support for diverse languages important.
Non-Intrusiveness: It’s imperative to monitor network traffic without making any modifications to the applications, as direct intervention with applications in Kubernetes is not feasible.
Kernel Metrics Monitoring: Often, diagnosing network issues by analyzing traffic performance at the user-space level is insufficient. A deeper analysis incorporating kernel-space network traffic metrics is frequently necessary.
Support for Various Network Protocols: Applications may communicate using different transport protocols, necessitating support for a range of protocols.

Given these requirements, eBPF emerges as a capable solution. In the next section, we will delve into detailed explanations of how Apache SkyWalking Rover resolves these aspects.

Kernel Monitoring and Protocol Analysis

In previous articles, we’ve discussed how to monitor network traffic from programs written in various languages. This technique remains essential for network traffic monitoring, allowing for the collection of traffic data without language limitations. However, due to the unique aspects of our monitoring trigger mechanism and the specific features of kernel monitoring, these two areas warrant separate explanations.

Kernel Monitoring

Kernel monitoring allows users to gain insights into network traffic performance based on the execution at the kernel level, specifically from Layer 2 (Data Link) to Layer 4 (Transport) of the OSI model.

Network monitoring at the kernel layer is deference from the syscall (user-space) layer in terms of the metrics and identifiers used. While the syscalls layer can utilize file descriptors to correlate various operations, kernel layer network operations primarily use packets as unique identifiers. This discrepancy necessitates a mapping relationship that SkyWalking Rover can use to bind these two layers together for comprehensive monitoring.

Let’s dive into the details of how data is monitored in both sending and receiving modes.

Observe Sending

When sending data, tracking the status and timing of each packet is crucial for understanding the state of each transmission. Within the kernel, operations progress from Layer 4 (L4) down to Layer 2 (L2), maintaining the same thread ID as during the syscalls layer, which simplifies data correlation.

SkyWalking Rover monitors several key kernel functions to observe packet transmission dynamics, listed from L4 to L2:

kprobe/tcp_sendmsg: Captures the time when a packet enters the L4 protocol stack for sending and the time it finishes processing. This function is essential for tracking the initial handling of packets at the transport layer.
kprobe/tcp_transmit_skb: Records the total number of packet transmissions and the size of each packet sent. This function helps identify how many times a packet or a batch of packets is attempted to be sent, which is critical for understanding network throughput and congestion.
tracepoint/tcp/tcp_retransmit_skb: Notes whether packet retransmission occurs, providing insights into network reliability and connection quality. Retransmissions can significantly impact application performance and user experience.
tracepoint/skb/kfree_skb: Records packet loss during transmission and logs the reason for such occurrences. Understanding packet loss is crucial for diagnosing network issues and ensuring data integrity.
kprobe/__ip_queue_xmit: Records the start and end times of processing by the L3 protocol. This function is vital for understanding the time taken for IP-level operations, including routing decisions.
kprobe/nf_hook_slow: Records the total time and number of occurrences spent in Netfilter hooks, such as iptables rule evaluations. This monitoring point is important for assessing the impact of firewall rules and other filtering mechanisms on packet flow.
kprobe/neigh_resolve_output: If resolving an unknown MAC address is necessary before sending a network request, this function records the occurrences and total time spent on this resolution. MAC address resolution times can affect the initial packet transmission delay.
kprobe/__dev_queue_xmit: Records the start and end times of entering the L2 protocol stack, providing insights into the data link layer’s processing times.
tracepoint/net/net_dev_start_xmit and tracepoint/net/net_dev_xmit: Records the actual time taken to transmit each packet at the network interface card (NIC). These functions are crucial for understanding the hardware-level performance and potential bottlenecks at the point of sending data to the physical network.

According to the interception of the above method, Apache SkyWalking Rover can provide key execution time and metrics for each level when sending network data, from the application layer (Layer 7) to the transport layer (Layer 4), and finally to the data link layer (Layer 2).

Observe Receiving

When receiving data, the focus is often on the time it takes for packets to travel from the network interface card (NIC) to the user space. Unlike the process of sending data, data receiving in the kernel proceeds from the data link layer (Layer 2) up to the transport layer (Layer 4), until the application layer (Layer 7) retrieves the packet’s content. In SkyWalking Rover, monitors the following key system functions to observe this process, listed from L2 to L4:

tracepoint/net/netif_receive_skb: Records the time when a packet is received by the network interface card. This tracepoint is crucial for understanding the initial point of entry for incoming data into the system.
kprobe/ip_rcv: Records the start and end times of packet processing at the network layer (Layer 3). This probe provides insights into how long it takes for the IP layer to handle routing, forwarding, and delivering packets to the correct application.
kprobe/nf_hook_slow: Records the total time and occurrences spent in Netfilter hooks, same with the sending traffic flow.
kprobe/tcp_v4_rcv: Records the start and end times of packet processing at the transport layer (Layer 4). This probe is key to understanding the efficiency of TCP operations, including connection management, congestion control, and data flow.
tracepoint/skb/skb_copy_datagram_iovec: When application layer protocols use the data, this tracepoint binds the packet to the syscall layer data at Layer 7. This connection is essential for correlating the kernel’s handling of packets with their consumption by user-space applications.

Based on the above methods, network monitoring can help you understand the complete execution process and execution time from when data is received by the network card to when it is used by the program.

Metrics

By intercepting the methods mentioned above, we can gather key metrics that provide insights into network performance and behavior. These metrics include:

Packets: The size of the packets and the frequency of their transmission or reception. These metric offers a fundamental understanding of the network load and the efficiency of data movement between the sender and receiver.
Connections: The number of connections established or accepted between services and the time taken for these connections to be set up. This metric is crucial for analyzing the efficiency of communication and connection management between different services within the network.
L2-L4 Events: The time spent on key events within the Layer 2 to Layer 4 protocols. This metric sheds light on the processing efficiency and potential bottlenecks within the lower layers of the network stack, which are essential for data transmission and reception.

Protocol Analyzing

In previous articles, we have discussed parsing HTTP/1.x protocols. However, with HTTP/2.x, the protocol’s stateful nature and the pre-established connections between services complicate network profiling. This complexity makes it challenging for Apache SkyWalking Rover to fully perceive the connection context, hindering protocol parsing operations.

Transitioning network monitoring to Daemon mode offers a solution to this challenge. By continuously observing service operations around the clock, SkyWalking Rover can begin monitoring as soon as a service starts. This immediate initiation allows for the tracking of the complete execution context, making the observation of stateful protocols like HTTP/2.x feasible.

Probes

To detect when a process is started, monitoring a specific trace point (tracepoint/sched/sched_process_fork) is essential. This approach enables the system to be aware of process initiation events. Given the necessity to filter process traffic based on certain criteria such as the process’s namespace, Apache SkyWalking Rover follows a series of steps to ensure accurate and efficient monitoring. These steps include:

Monitoring Activation: The process is immediately added to a monitoring whitelist upon detection. This step ensures that the process is considered for monitoring from the moment it starts, without delay.
Push to Queue: The process’s PID (Process ID) is pushed into a monitoring confirmation queue. This queue holds the PIDs of newly detected processes that are pending further confirmation from a user-space program. This asynchronous approach allows for the separation of immediate detection and subsequent processing, optimizing the monitoring workflow.
User-Space Program Confirmation: The user-space program retrieves process PIDs from the queue and assesses whether each process should continue to be monitored. If a process is deemed unnecessary for monitoring, it is removed from the whitelist.

This process ensures that SkyWalking Rover can dynamically adapt its monitoring scope based on real-time conditions and configurations, allowing for both comprehensive coverage and efficient resource use.

Limitations

The monitoring of stateful protocols like HTTP/2.x currently faces certain limitations:

Inability to Observe Pre-existing Connections: Monitoring the complete request and response cycle requires that monitoring be initiated before any connections are established. This requirement means that connections set up before the start of monitoring cannot be observed.
Challenges with TLS Requests: Observing TLS encrypted traffic is complex because it relies on asynchronously attaching uprobes (user-space attaching) for observation. If new requests are made before these uprobes are successfully attached, it becomes impossible to access the data before encryption or after decryption.

Demo

Next, let’s quickly demonstrate the Kubernetes monitoring feature, so you can understand more specifically what it accomplishes.

Deploy SkyWalking Showcase

SkyWalking Showcase contains a complete set of example services and can be monitored using SkyWalking. For more information, please check the official documentation.

In this demo, we only deploy service, the latest released SkyWalking OAP, and UI.

export FEATURE_FLAGS=java-agent-injector,single-node,elasticsearch,rover
make deploy.kubernetes

After deployment is complete, please run the following script to open SkyWalking UI: http://localhost:8080/.

kubectl port-forward svc/ui 8080:8080 --namespace default

Done

Once deployed, Apache SkyWalking Rover automatically begins monitoring traffic within the system upon startup. Then, reports this traffic data to SkyWalking OAP, where it is ultimately stored in a database.

In the Service Dashboard within Kubernetes, you can view a list of monitored Kubernetes services. If any of these services have HTTP traffic, this information would be displayed alongside them in the dashboard.

Figure 1: Kubernetes Service List

Additionally, within the Topology Tab, you can observe the topology among related services. In each service or call relationship, there would display relevant TCP and HTTP metrics.

Figure 2: Kubernetes Service Topology

When you select a specific service from the Service list, you can view service metrics at both the TCP and HTTP levels for the chosen service.

Figure 3: Kubernetes Service TCP Metrics

Figure 4: Kubernetes Service HTTP Metrics

Furthermore, by using the Endpoint Tab, you can see which URIs have been accessed for the current service.

Figure 5: Kubernetes Service Endpoint List

Conclusion

In this article, I’ve detailed how to utilize eBPF technology for network monitoring of services within a Kubernetes cluster, a capability that has been implemented in Apache SkyWalking Rover. This approach leverages the power of eBPF to provide deep insights into network traffic and service interactions, enhancing visibility and observability across the cluster.

Blog: Activating Automatical Performance Analysis -- Continuous Profiling

Sun, 25 Jun 2023 00:00:00 +0000

Background

In previous articles, We have discussed how to use SkyWalking and eBPF for performance problem detection within processes and networks. They are good methods to locate issues, but still there are some challenges:

The timing of the task initiation: It’s always challenging to address the processes that require performance monitoring when problems occur. Typically, manual engagement is required to identify processes and the types of performance analysis necessary, which cause extra time during the crash recovery. The root cause locating and the time of crash recovery conflict with each other from time to time. In the real case, rebooting would be the first choice of recovery, meanwhile, it destroys the site of crashing.
Resource consumption of tasks: The difficulties to determine the profiling scope. Wider profiling causes more resources than it should. We need a method to manage resource consumption and understand which processes necessitate performance analysis.
Engineer capabilities: On-call is usually covered by the whole team, which have junior and senior engineers, even senior engineers have their understanding limitation of the complex distributed system, it is nearly impossible to understand the whole system by a single one person.

The Continuous Profiling is a new created mechanism to resolve the above issues.

Automate Profiling

As profiling is resource costing and high experience required, how about introducing a method to narrow the scope and automate the profiling driven by polices creates by senior SRE engineer? So, in 9.5.0, SkyWalking first introduced preset policy rules for specific services to be monitored by the eBPF Agent in a low-energy manner, and run profiling when necessary automatically.

Policy

Policy rules specify how to monitor target processes and determine the type of profiling task to initiate when certain threshold conditions are met.

These policy rules primarily consist of the following configuration information:

Monitoring type: This specifies what kind of monitoring should be implemented on the target process.
Threshold determination: This defines how to determine whether the target process requires the initiation of a profiling task.
Trigger task: This specifies what kind of performance analysis task should be initiated.

Monitoring type

The type of monitoring is determined by observing the data values of a specified process to generate corresponding metrics. These metric values can then facilitate subsequent threshold judgment operations. In eBPF observation, we believe the following metrics can most directly reflect the current performance of the program:

Monitor Type	Unit	Description
System Load	Load	System load average over a specified period.
Process CPU	Percentage	The CPU usage of the process as a percentage.
Process Thread Count	Count	The number of threads in the process.
HTTP Error Rate	Percentage	The percentage of HTTP requests that result in error responses (e.g., 4xx or 5xx status codes).
HTTP Avg Response Time	Millisecond	The average response time for HTTP requests.

Monitoring network type metrics is not as simple as obtaining basic process information. It requires the initiation of eBPF programs and attaching them to the target process for observation. This is similar to the principles of network profiling task we introduced in the previous article, except that we no longer collect the full content of the data packets. Instead, we only collect the content of messages that match specified HTTP prefixes.

By using this method, we can significantly reduce the number of times the kernel sends data to the user space, and the user-space program can parse the data content with less system resource usage. This ultimately helps in conserving system resources.

Metrics collector

The eBPF agent would report metrics of processes periodically as follows to indicate the process performance in time.

Name	Unit	Description
process_cpu	(0-100)%	The CPU usage percent
process_thread_count	count	The thread count of process
system_load	count	The average system load for the last minute, each process have same value
http_error_rate	(0-100)%	The network request error rate percentage
http_avg_response_time	ms	The network average response duration

Threshold determination

For the threshold determination, the judgement is made by the eBPF Agent based on the target monitoring process in its own memory, rather than relying on calculations performed by the SkyWalking backend. The advantage of this approach is that it doesn’t have to wait for the results of complex backend computations, and it reduces potential issues brought about by complicated interactions.

By using this method, the eBPF Agent can swiftly initiate tasks immediately after conditions are met, without any delay.

It includes the following configuration items:

Threshold: Check if the monitoring value meets the specified expectations.
Period: The time period(seconds) for monitoring data, which can also be understood as the most recent duration.
Count: The number of times(seconds) the threshold is triggered within the detection period, which can also be understood as the total number of times the specified threshold rule is triggered in the most recent duration(seconds). Once the count check is met, the specified Profiling task will be started.

Trigger task

When the eBPF Agent detects that the threshold determination in the specified policy meets the rules, it can initiate the corresponding task according to pre-configured rules. For each different target performance task, their task initiation parameters are different:

On/Off CPU Profiling: It automatically performs performance analysis on processes that meet the conditions, defaulting to 10 minutes of monitoring.
Network Profiling: It performs network performance analysis on all processes in the same Service Instance on the current machine, to prevent the cause of the issue from being unrealizable due to too few process being collected, defaulting to 10 minutes of monitoring.

Once the task is initiated, no new profiling tasks would be started for the current process for a certain period. The main reason for this is to prevent frequent task creation due to low threshold settings, which could affect program execution. The default time period is 20 minutes.

Data Flow

The figure 1 illustrates the data flow of the continuous profiling feature:

Figure 1: Data Flow of Continuous Profiling

eBPF Agent with Process

Firstly, we need to ensure that the eBPF Agent and the process to be monitored are deployed on the same host machine, so that we can collect relevant data from the process. When the eBPF Agent detects a threshold validation rule that conforms to the policy, it immediately triggers the profiling task for the target process, thereby reducing any intermediate steps and accelerating the ability to pinpoint performance issues.

Sliding window

The sliding window plays a crucial role in the eBPF Agent’s threshold determination process, as illustrated in the figure 2:

Figure 2: Sliding Window in eBPF Agent

Each element in the array represents the data value for a specified second in time. When the sliding window needs to verify whether it is responsible for a rule, it fetches the content of each element from a certain number of recent elements (period parameter). If an element exceeds the threshold, it is marked in red and counted. If the number of red elements exceeds a certain number, it is deemed to trigger a task.

Using a sliding window offers the following two advantages:

Fast retrieval of recent content: With a sliding window, complex calculations are unnecessary. You can know the data by simply reading a certain number of recent array elements.
Solving data spikes issues: Validation through count prevents situations where a data point suddenly spikes and then quickly returns to normal. Verification with multiple values can reveal whether exceeding the threshold is frequent or occasional.

eBPF Agent with SkyWalking Backend

The eBPF Agent communicates periodically with the SkyWalking backend, involving three most crucial operations:

Policy synchronization: Through periodic policy synchronization, the eBPF Agent can keep processes on the local machine updated with the latest policy rules as much as possible.
Metrics sending: For processes that are already being monitored, the eBPF Agent periodically sends the collected data to the backend program. This facilitates real-time query of current data values by users, who can also compare this data with historical values or thresholds when problems arise.
Profiling task reporting: When the eBPF detects that a certain process has triggered a policy rule, it automatically initiates a performance task, collects relevant information from the current process, and reports it to the SkyWalking backend. This allows users to know when, why, and what type of profiling task was triggered from the interface.

Demo

Next, let’s quickly demonstrate the continuous profiling feature, so you can understand more specifically what it accomplishes.

Deploy SkyWalking Showcase

SkyWalking Showcase contains a complete set of example services and can be monitored using SkyWalking. For more information, please check the official documentation.

In this demo, we only deploy service, the latest released SkyWalking OAP, and UI.

export SW_OAP_IMAGE=apache/skywalking-oap-server:9.5.0
export SW_UI_IMAGE=apache/skywalking-ui:9.5.0
export SW_ROVER_IMAGE=apache/skywalking-rover:0.5.0

export FEATURE_FLAGS=mesh-with-agent,single-node,elasticsearch,rover
make deploy.kubernetes

After deployment is complete, please run the following script to open SkyWalking UI: http://localhost:8080/.

kubectl port-forward svc/ui 8080:8080 --namespace default

Create Continuous Profiling Policy

Currently, continues profiling feature is set by default in the Service Mesh panel at the Service level.

Figure 3: Continuous Policy Tab

By clicking on the edit button aside from the Policy List, the polices of current service could be created or updated.

Figure 4: Edit Continuous Profiling Policy

Multiple polices are supported. Every policy has the following configurations.

Target Type: Specifies the type of profiling task to be triggered when the threshold determination is met.
Items: For profiling task of the same target, one or more validation items can be specified. As long as one validation item meets the threshold determination, the corresponding performance analysis task will be launched.
1. Monitor Type: Specifies the type of monitoring to be carried out for the target process.
2. Threshold: Depending on the type of monitoring, you need to fill in the corresponding threshold to complete the verification work.
3. Period: Specifies the number of recent seconds of data you want to monitor.
4. Count: Determines the total number of seconds triggered within the recent period.
5. URI Regex/List: This is applicable to HTTP monitoring types, allowing URL filtering.

Done

After clicking the save button, you can see the currently created monitoring rules, as shown in the figure 5:

Figure 5: Continuous Profiling Monitoring Processes

The data can be divided into the following parts:

Policy list: On the left, you can see the rule list you have created.
Monitoring Summary List: Once a rule is selected, you can see which pods and processes would be monitored by this rule. It also summarizes how many profiling tasks have been triggered in the last 48 hours by the current pod or process, as well as the last trigger time. This list is also sorted in descending order by the number of triggers to facilitate your quick review.

When you click on a specific process, a new dashboard would show to list metrics and triggered profiling results.

Figure 6: Continuous Profiling Triggered Tasks

The current figure contains the following data contents:

Task Timeline: It lists all profiling tasks in the past 48 hours. And when the mouse hovers over a task, it would also display detailed information:
1. Task start and end time: It indicates when the current performance analysis task was triggered.
2. Trigger reason: It would display the reason why the current process was profiled and list out the value of the metric exceeding the threshold when the profiling was triggered. so you can quickly understand the reason.
Task Detail: Similar to the CPU Profiling and Network Profiling introduced in previous articles, this would display the flame graph or process topology map of the current task, depending on the profiling type.

Meanwhile, on the Metrics tab, metrics relative to profiling policies are collected to retrieve the historical trend, in order to provide a comprehensive explanation of the trigger point about the profiling.

Figure 7: Continuous Profiling Metrics

Conclusion

In this article, I have detailed how the continuous profiling feature in SkyWalking and eBPF works. In general, it involves deploying the eBPF Agent service on the same machine where the process to be monitored resides, and monitoring the target process with low resource consumption. When it meets the threshold conditions, it would initiate more complex CPU Profiling and Network Profiling tasks.

In the future, we will offer even more features. Stay tuned!

Twitter, ASFSkyWalking
Slack. Send Request to join SkyWalking slack mail to the mail list(dev@skywalking.apache.org), we will invite you in.
Subscribe to our medium list.

Zh: 自动化性能分析——持续剖析

Sun, 25 Jun 2023 00:00:00 +0000

背景

在之前的文章中，我们讨论了如何使用 SkyWalking 和 eBPF 来检测性能问题，包括进程和网络。这些方法可以很好地定位问题，但仍然存在一些挑战：

任务启动的时间: 当需要进行性能监控时，解决需要性能监控的进程始终是一个挑战。通常需要手动参与，以标识进程和所需的性能分析类型，这会在崩溃恢复期间耗费额外的时间。根本原因定位和崩溃恢复时间有时会发生冲突。在实际情况中，重新启动可能是恢复的第一选择，同时也会破坏崩溃的现场。
任务的资源消耗: 确定分析范围的困难。过宽的分析范围会导致需要更多的资源。我们需要一种方法来管理资源消耗并了解哪些进程需要性能分析。
工程师能力: 通常由整个团队负责呼叫，其中有初级和高级工程师，即使是高级工程师也对复杂的分布式系统有其理解限制，单个人几乎无法理解整个系统。

持续剖析（Continuous Profiling） 是解决上述问题的新机制。

自动剖析

由于性能分析的资源消耗和高经验要求，因此引入一种方法以缩小范围并由高级 SRE 工程师创建策略自动剖析。因此，在 9.5.0 中，SkyWalking 首先引入了预设策略规则，以低功耗方式监视特定服务的 eBPF 代理，并在必要时自动运行剖析。

策略

策略规则指定了如何监视目标进程并确定在满足某些阈值条件时应启动何种类型的分析任务。

这些策略规则主要包括以下配置信息：

监测类型: 这指定了应在目标进程上实施什么样的监测。
阈值确定: 这定义了如何确定目标进程是否需要启动分析任务。
触发任务: 这指定了应启动什么类型的性能分析任务。

监测类型

监测类型是通过观察指定进程的数据值来生成相应的指标来确定的。这些指标值可以促进后续的阈值判断操作。在 eBPF 观测中，我们认为以下指标最能直接反映程序的当前性能：

监测类型	单位	描述
系统负载	负载	在指定时间段内的系统负载平均值。
进程 CPU	百分比	进程的 CPU 使用率百分比。
进程线程计数	计数	进程中的线程数。
HTTP 错误率	百分比	导致错误响应（例如，4xx 或 5xx 状态代码）的 HTTP 请求的百分比。
HTTP 平均响应时间	毫秒	HTTP 请求的平均响应时间。

指标收集器

eBPF 代理会定期报告以下进程度量，以指示进程性能：

名称	单位	描述
process_cpu	(0-100)%	CPU 使用率百分比
process_thread_count	计数	进程中的线程数
system_load	计数	最近一分钟的平均系统负载，每个进程的值相同
http_error_rate	(0-100)%	网络请求错误率百分比
http_avg_response_time	毫秒	网络平均响应持续时间

阈值确定

对于阈值的确定，eBPF 代理是基于其自身内存中的目标监测进程进行判断，而不是依赖于 SkyWalking 后端执行的计算。这种方法的优点在于，它不必等待复杂后端计算的结果，减少了复杂交互所带来的潜在问题。

通过使用此方法，eBPF 代理可以在条件满足后立即启动任务，而无需任何延迟。

它包括以下配置项：

阈值: 检查监测值是否符合指定的期望值。
周期: 监控数据的时间周期（秒），也可以理解为最近的持续时间。
计数: 检测期间触发阈值的次数（秒），也可以理解为最近持续时间内指定阈值规则触发的总次数（秒）。一旦满足计数检查，指定的分析任务将被开始。

触发任务

当 eBPF Agent 检测到指定策略中的阈值决策符合规则时，根据预配置的规则可以启动相应的任务。对于每个不同的目标性能任务，它们的任务启动参数都不同：

On/Off CPU Profiling: 它会自动对符合条件的进程进行性能分析，缺省情况下监控时间为 10 分钟。
Network Profiling: 它会对当前机器上同一 Service Instance 中的所有进程进行网络性能分析，以防问题的原因因被收集进程太少而无法实现，缺省情况下监控时间为 10 分钟。

一旦任务启动，当前进程将在一定时间内不会启动新的剖析任务。主要原因是为了防止因低阈值设置而频繁创建任务，从而影响程序执行。缺省时间为 20 分钟。

数据流

图 1 展示了持续剖析功能的数据流：

图 1: 持续剖析的数据流

eBPF Agent进行进程跟踪

首先，我们需要确保 eBPF Agent 和要监测的进程部署在同一台主机上，以便我们可以从进程中收集相关数据。当 eBPF Agent 检测到符合策略的阈值验证规则时，它会立即为目标进程触发剖析任务，从而减少任何中间步骤并加速定位性能问题的能力。

滑动窗口

滑动窗口在 eBPF Agent 的阈值决策过程中发挥着至关重要的作用，如图 2 所示：

图 2: eBPF Agent 中的滑动窗口

数组中的每个元素表示指定时间内的数据值。当滑动窗口需要验证是否负责某个规则时，它从最近的一定数量的元素 (period 参数) 中获取每个元素的内容。如果一个元素超过了阈值，则标记为红色并计数。如果红色元素的数量超过一定数量，则被认为触发了任务。

使用滑动窗口具有以下两个优点：

快速检索最近的内容：使用滑动窗口，无需进行复杂的计算。你可以通过简单地读取一定数量的最近数组元素来了解数据。
解决数据峰值问题：通过计数进行验证，可以避免数据点突然增加然后快速返回正常的情况。使用多个值进行验证可以揭示超过阈值是频繁还是偶然发生的。

eBPF Agent与OAP后端通讯

eBPF Agent 定期与 SkyWalking 后端通信，涉及三个最关键的操作：

策略同步：通过定期的策略同步，eBPF Agent 可以尽可能地让本地机器上的进程与最新的策略规则保持同步。
指标发送：对于已经被监视的进程，eBPF Agent 定期将收集到的数据发送到后端程序。这就使用户能够实时查询当前数据值，用户也可以在出现问题时将此数据与历史值或阈值进行比较。
剖析任务报告：当 eBPF 检测到某个进程触发了策略规则时，它会自动启动性能任务，从当前进程收集相关信息，并将其报告给 SkyWalking 后端。这使用户可以从界面了解何时、为什么和触发了什么类型的剖析任务。

演示

接下来，让我们快速演示持续剖析功能，以便你更具体地了解它的功能。

部署 SkyWalking Showcase

SkyWalking Showcase 包含完整的示例服务，并可以使用 SkyWalking 进行监视。有关详细信息，请查看官方文档。

在此演示中，我们只部署服务、最新发布的 SkyWalking OAP 和 UI。

export SW_OAP_IMAGE=apache/skywalking-oap-server:9.5.0
export SW_UI_IMAGE=apache/skywalking-ui:9.5.0
export SW_ROVER_IMAGE=apache/skywalking-rover:0.5.0

export FEATURE_FLAGS=mesh-with-agent,single-node,elasticsearch,rover
make deploy.kubernetes

部署完成后，请运行以下脚本以打开 SkyWalking UI：http://localhost:8080/。

kubectl port-forward svc/ui 8080:8080 --namespace default

创建持续剖析策略

目前，持续剖析功能在 Service Mesh 面板的 Service 级别中默认设置。

图 3: 持续策略选项卡

通过点击 Policy List 旁边的编辑按钮，可以创建或更新当前服务的策略。

图 4: 编辑持续剖析策略

支持多个策略。每个策略都有以下配置。

Target Type：指定符合阈值决策时要触发的剖析任务的类型。
Items：对于相同目标的剖析任务，可以指定一个或多个验证项目。只要一个验证项目符合阈值决策，就会启动相应的性能分析任务。
1. Monitor Type：指定要为目标进程执行的监视类型。
2. Threshold：根据监视类型的不同，需要填写相应的阈值才能完成验证工作。
3. Period：指定你要监测的最近几秒钟的数据数量。
4. Count：确定最近时间段内触发的总秒数。
5. URI 正则表达式/列表：这适用于 HTTP 监控类型，允许 URL 过滤。

完成

单击保存按钮后，你可以看到当前已创建的监控规则，如图 5 所示：

图 5: 持续剖析监控进程

数据可以分为以下几个部分：

策略列表：在左侧，你可以看到已创建的规则列表。
监测摘要列表：选择规则后，你可以看到哪些 pod 和进程将受到该规则的监视。它还总结了当前 pod 或进程在过去 48 小时内触发的性能分析任务数量，以及最后一个触发时间。该列表还按触发次数降序排列，以便你快速查看。

当你单击特定进程时，将显示一个新的仪表板以列出指标和触发的剖析结果。

图 6: 持续剖析触发的任务

当前图包含以下数据内容：

任务时间轴：它列出了过去 48 小时的所有剖析任务。当鼠标悬停在任务上时，它还会显示详细信息：
1. 任务的开始和结束时间：它指示当前性能分析任务何时被触发。
2. 触发原因：它会显示为什么会对当前进程进行剖析，并列出当剖析被触发时超过阈值的度量值，以便你快速了解原因。
任务详情：与前几篇文章介绍的 CPU 剖析和网络剖析类似，它会显示当前任务的火焰图或进程拓扑图，具体取决于剖析类型。

同时，在 Metrics 选项卡中，收集与剖析策略相关的指标以检索历史趋势，以便在剖析的触发点提供全面的解释。

图 7: 持续剖析指标

结论

在本文中，我详细介绍了 SkyWalking 和 eBPF 中持续剖析功能的工作原理。通常情况下，它涉及将 eBPF Agent 服务部署在要监视的进程所在的同一台计算机上，并以低资源消耗监测目标进程。当它符合阈值条件时，它会启动更复杂的 CPU 剖析和网络剖析任务。

在未来，我们将提供更多功能。敬请期待！

Twitter：ASFSkyWalking
Slack：向邮件列表 (dev@skywalking.apache.org) 发送“Request to join SkyWalking Slack”，我们会邀请你加入。
订阅我们的 Medium 列表。

Blog: eBPF enhanced HTTP observability - L7 metrics and tracing

Thu, 12 Jan 2023 00:00:00 +0000

Background

Apache SkyWalking is an open-source Application Performance Management system that helps users collect and aggregate logs, traces, metrics, and events for display on a UI. In the previous article, we introduced how to use Apache SkyWalking Rover to analyze the network performance issue in the service mesh environment. However, in business scenarios, users often rely on mature layer 7 protocols, such as HTTP, for interactions between systems. In this article, we will discuss how to use eBPF techniques to analyze performance bottlenecks of layer 7 protocols and how to enhance the tracing system using network sampling.

This article will show how to use Apache SkyWalking with eBPF to enhance metrics and traces in HTTP observability.

HTTP Protocol Analysis

HTTP is one of the most common Layer 7 protocols and is usually used to provide services to external parties and for inter-system communication. In the following sections, we will show how to identify and analyze HTTP/1.x protocols.

Protocol Identification

In HTTP/1.x, the client and server communicate through a single file descriptor (FD) on each side. Figure 1 shows the process of communication involving the following steps:

Connect/accept: The client establishes a connection with the HTTP server, or the server accepts a connection from the client.
Read/write (multiple times): The client or server reads and writes HTTPS requests and responses. A single request-response pair occurs within the same connection on each side.
Close: The client and server close the connection.

To obtain HTTP content, it’s necessary to read it from the second step of this process. As defined in the RFC, the content is contained within the data of the Layer 4 protocol and can be obtained by parsing the data. The request and response pair can be correlated because they both occur within the same connection on each side.

Figure 1: HTTP communication timeline.

HTTP Pipeline

HTTP pipelining is a feature of HTTP/1.1 that enables multiple HTTP requests to be sent over a single TCP connection without waiting for the corresponding responses. This feature is important because it ensures that the order of the responses on the server side matches the order of the requests.

Figure 2 illustrates how this works. Consider the following scenario: an HTTP client sends multiple requests to a server, and the server responds by sending the HTTP responses in the same order as the requests. This means that the first request sent by the client will receive the first response from the server, the second request will receive the second response, and so on.

When designing HTTP parsing, we should follow this principle by adding request data to a list and removing the first item when parsing a response. This ensures that the responses are processed in the correct order.

Figure 2: HTTP/1.1 pipeline.

Metrics

Based on the identification of the HTTP content and process topology diagram mentioned in the previous article, we can combine these two to generate process-to-process metrics data.

Figure 3 shows the metrics that currently support the analysis between the two processes. Based on the HTTP request and response data, we can analyze the following data:

Metrics Name	Type	Unit	Description
Request CPM(Call Per Minute)	Counter	count	The HTTP request count
Response Status CPM(Call Per Minute)	Counter	count	The count of per HTTP response status code
Request Package Size	Counter/Histogram	Byte	The request package size
Response Package Size	Counter/Histogram	Byte	The response package size
Client Duration	Counter/Histogram	Millisecond	The duration of single HTTP response on the client side
Server Duration	Counter/Histogram	Millisecond	The duration of single HTTP response on the server side

Figure 3: Process-to-process metrics.

HTTP and Trace

During the HTTP process, if we unpack the HTTP requests and responses from raw data, we can use this data to correlate with the existing tracing system.

Trace Context Identification

In order to track the flow of requests between multiple services, the trace system usually creates a trace context when a request enters a service and passes it along to other services during the request-response process. For example, when an HTTP request is sent to another server, the trace context is included in the request header.

Figure 4 displays the raw content of an HTTP request intercepted by Wireshark. The trace context information generated by the Zipkin Tracing system can be identified by the “X-B3” prefix in the header. By using eBPF to intercept the trace context in the HTTP header, we can connect the current request with the trace system.

Figure 4: View of HTTP headers in Wireshark.

Trace Event

We have added the concept of an event to traces. An event can be attached to a span and consists of start and end times, tags, and summaries, allowing us to attach any desired information to the Trace.

When performing eBPF network profiling, two events can be generated based on the request-response data. Figure 5 illustrates what happens when a service performs an HTTP request with profiling. The trace system generates trace context information and sends it in the request. When the service executes in the kernel, we can generate an event for the corresponding trace span by interacting with the request-response data and execution time in the kernel space.

Previously, we could only observe the execution status in the user space. However, by combining traces and eBPF technologies, we can now also get more information about the current trace in the kernel space, which would impact less performance for the target service if we do similar things in the tracing SDK and agent.

Figure 5: Logical view of profiling an HTTP request and response.

Sampling

To ensure efficient data storage and minimize unnecessary data sampling, we use a sampling mechanism for traces in our system. This mechanism triggers sampling only when certain conditions are met. We also provide a list of the top N traces, which allows users to quickly access the relevant request information for a specific trace.

To help users easily identify and analyze relevant events, we offer three different sampling rules:

Slow Traces: Sampling is triggered when the response time for a request exceeds a specified threshold.
Response Status [400, 500): Sampling is triggered when the response status code is greater than or equal to 400 and less than 500.
Response Status [500, 600): Sampling is triggered when the response status code is greater than or equal to 500 and less than 600.

In addition, we recognize that not all request or response raw data may be necessary for analysis. For example, users may be more interested in requesting data when trying to identify performance issues, while they may be more interested in response data when troubleshooting errors. As such, we also provide configuration options for request or response events to allow users to specify which type of data they would like to sample.

Profiling in a Service Mesh

The SkyWalking and SkyWalking Rover projects have already implemented the HTTP protocol analyze and trace associations. How do they perform when running in a service mesh environment?

Deployment

Figure 6 demonstrates the deployment of SkyWalking and SkyWalking Rover in a service mesh environment. SkyWalking Rover is deployed as a DaemonSet on each machine where a service is located and communicates with the SkyWalking backend cluster. It automatically recognizes the services on the machine and reports metadata information to the SkyWalking backend cluster. When a new network profiling task arises, SkyWalking Rover senses the task and analyzes the designated processes, collecting and aggregating network data before ultimately reporting it back to the SkyWalking backend service.

Figure 6: SkyWalking rover deployment topology in a service mesh.

Tracing Systems

Starting from version 9.3.0, the SkyWalking backend fully supports all functions in the Zipkin server. Therefore, the SkyWalking backend can collect traces from both the SkyWalking and Zipkin protocols. Similarly, SkyWalking Rover can identify and analyze trace context in both the SkyWalking and Zipkin trace systems. In the following two sections, network analysis results will be displayed in the SkyWalking and Zipkin UI respectively.

SkyWalking

When SkyWalking performs network profiling, similar to the TCP metrics in the previous article, the SkyWalking UI will first display the topology between processes. When you open the dashboard of the line representing the traffic metrics between processes, you can see the metrics of HTTP traffic from the “HTTP/1.x” tab and the sampled HTTP requests with tracing in the “HTTP Requests” tab.

As shown in Figure 7, there are three lists in the tab, each corresponding to a condition in the event sampling rules. Each list displays the traces that meet the pre-specified conditions. When you click on an item in the trace list, you can view the complete trace.

Figure 7: Sampled HTTP requests within tracing context.

When you click on an item in the trace list, you can quickly view the specified trace. In Figure 8, we can see that in the current service-related span, there is a tag with a number indicating how many HTTP events are related to that trace span.

Since we are in a service mesh environment, each service involves interacting with Envoy. Therefore, the current span includes Envoy’s request and response information. Additionally, since the current service has both incoming and outgoing requests, there are events in the corresponding span.

Figure 8: Events in the trace detail.

When the span is clicked, the details of the span will be displayed. If there are events in the current span, the relevant event information will be displayed on a time axis. As shown in Figure 9, there are a total of 6 related events in the current Span. Each event represents a data sample of an HTTP request/response. One of the events spans multiple time ranges, indicating a longer system call time. It may be due to a blocked system call, depending on the implementation details of the HTTP request in different languages. This can also help us query the possible causes of errors.

Figure 9: Events in one trace span.

Finally, we can click on a specific event to see its complete information. As shown in Figure 10, it displays the sampling information of a request, including the SkyWalking trace context protocol contained in the request header from the HTTP raw data. The raw request data allows you to quickly re-request the request to solve any issues.

Figure 10: The detail of the event.

Zipkin

Zipkin is one of the most widely used distributed tracing systems in the world. SkyWalking can function as an alternative server to provide advanced features for Zipkin users. Here, we use this way to bring the feature into the Zipkin ecosystem out-of-box. The new events would also be treated as a kind of Zipkin’s tags and annotations.

To add events to a Zipkin span, we need to do the following:

Split the start and end times of each event into two annotations with a canonical name.
Add the sampled HTTP raw data from the event to the Zipkin span tags, using the same event name for corresponding purposes.

Figures 11 and 12 show annotations and tags in the same span. In these figures, we can see that the span includes at least two events with the same event name and sequence suffix (e.g., “Start/Finished HTTP Request/Response Sampling-x” in the figure). Both events have separate timestamps to represent their relative times within the span. In the tags, the data content of the corresponding event is represented by the event name and sequence number, respectively.

Figure 11: Event timestamp in the Zipkin span annotation.

Figure 12: Event raw data in the Zipkin span tag.

Demo

In this section, we demonstrate how to perform network profiling in a service mesh and complete metrics collection and HTTP raw data sampling. To follow along, you will need a running Kubernetes environment.

Deploy SkyWalking Showcase

SkyWalking Showcase contains a complete set of example services and can be monitored using SkyWalking. For more information, please check the official documentation.

In this demo, we only deploy service, the latest released SkyWalking OAP, and UI.

export SW_OAP_IMAGE=apache/skywalking-oap-server:9.3.0
export SW_UI_IMAGE=apache/skywalking-ui:9.3.0
export SW_ROVER_IMAGE=apache/skywalking-rover:0.4.0

export FEATURE_FLAGS=mesh-with-agent,single-node,elasticsearch,rover
make deploy.kubernetes

After deployment is complete, please run the following script to open SkyWalking UI: http://localhost:8080/.

kubectl port-forward svc/ui 8080:8080 --namespace default

Start Network Profiling Task

Currently, we can select the specific instances that we wish to monitor by clicking the Data Plane item in the Service Mesh panel and the Service item in the Kubernetes panel.

In figure 13, we have selected an instance with a list of tasks in the network profiling tab.

Figure 13: Network Profiling tab in the Data Plane.

When we click the Start button, as shown in Figure 14, we need to specify the sampling rules for the profiling task. The sampling rules consist of one or more rules, each of which is distinguished by a different URI regular expression. When the HTTP request URI matches the regular expression, the rule is used. If the URI regular expression is empty, the default rule is used. Using multiple rules can help us make different sampling configurations for different requests.

Each rule has three parameters to determine if sampling is needed:

Minimal Request Duration (ms): requests with a response time exceeding the specified time will be sampled.
Sampling response status code between 400 and 499: all status codes in the range [400-499) will be sampled.
Sampling response status code between 500 and 599: all status codes in the range [500-599) will be sampled.

Once the sampling configuration is complete, we can create the task.

Figure 14: Create network profiling task page.

Done!

After a few seconds, you will see the process topology appear on the right side of the page.

When you click on the line between processes, you can view the data between the two processes, which is divided into three tabs:

TCP: displays TCP-related metrics.
HTTP/1.x: displays metrics in the HTTP 1 protocol.
HTTP Requests: displays the analyzed request and saves it to a list according to the sampling rule.

Figure 16: TCP metrics in a network profiling task.

Figure 17: HTTP/1.x metrics in a network profiling task.

Figure 18: HTTP sampled requests in a network profiling task.

Conclusion

In this article, we detailed the overview of how to analyze the Layer 7 HTTP/1.x protocol in network analysis, and how to associate it with existing trace systems. This allows us to extend the scope of data we can observe from just user space to also include kernel-space data.

In the future, we will delve further into the analysis of kernel data, such as collecting information on TCP packet size, transmission frequency, network card, and help on enhancing distributed tracing from another perspective.

Additional Resources

Blog: Diagnose Service Mesh Network Performance with eBPF

Tue, 27 Sep 2022 00:00:00 +0000

Background

This article will show how to use Apache SkyWalking with eBPF to make network troubleshooting easier in a service mesh environment.

Apache SkyWalking is an application performance monitor tool for distributed systems. It observes metrics, logs, traces, and events in the service mesh environment and uses that data to generate a dependency graph of your pods and services. This dependency graph can provide quick insights into your system, especially when there’s an issue.

However, when troubleshooting network issues in SkyWalking’s service topology, it is not always easy to pinpoint where the error actually is. There are two reasons for the difficulty:

Traffic through the Envoy sidecar is not easy to observe. Data from Envoy’s Access Log Service (ALS) shows traffic between services (sidecar-to-sidecar), but not metrics on communication between the Envoy sidecar and the service it proxies. Without that information, it is more difficult to understand the impact of the sidecar.
There is a lack of data from transport layer (OSI Layer 4) communication. Since services generally use application layer (OSI Layer 7) protocols such as HTTP, observability data is generally restricted to application layer communication. However, the root cause may actually be in the transport layer, which is typically opaque to observability tools.

Access to metrics from Envoy-to-service and transport layer communication can make it easier to diagnose service issues. To this end, SkyWalking needs to collect and analyze transport layer metrics between processes inside Kubernetes pods - a task well suited to eBPF. We investigated using eBPF for this purpose and present our results and a demo below.

Monitoring Kubernetes Networks with eBPF

With its origins as the Extended Berkeley Packet Filter, eBPF is a general purpose mechanism for injecting and running your own code into the Linux kernel and is an excellent tool for monitoring network traffic in Kubernetes Pods. In the next few sections, we'll provide an overview of how to use eBPF for network monitoring as background for introducing Skywalking Rover, a metrics collector and profiler powered by eBPF to diagnose CPU and network performance.

How Applications and the Network Interact

Interactions between the application and the network can generally be divided into the following steps from higher to lower levels of abstraction:

User Code: Application code uses high-level network libraries in the application stack to exchange data across the network, like sending and receiving HTTP requests.
Network Library: When the network library receives a network request, it interacts with the language API to send the network data.
Language API: Each language provides an API for operating the network, system, etc. When a request is received, it interacts with the system API. In Linux, this API is called syscalls.
Linux API: When the Linux kernel receives the request through the API, it communicates with the socket to send the data, which is usually closer to an OSI Layer 4 protocol, such as TCP, UDP, etc.
Socket Ops: Sending or receiving the data to/from the NIC.

Our hypothesis is that eBPF can monitor the network. There are two ways to implement the interception: User space (uprobe) or Kernel space (kprobe). The table below summarizes the differences.

	Pros	Cons
uprobe	• Get more application-related contexts, such as whether the current request is HTTP or HTTPS. • Requests and responses can be intercepted by a single method	• Data structures can be unstable, so it is more difficult to get the desired data. • Implementation may differ between language/library versions. • Does not work in applications without symbol tables.
kprobe	• Available for all languages. • The data structure and methods are stable and do not require much adaptation. • Easier correlation with underlying data, such as getting the destination address of TCP, OSI Layer 4 protocol metrics, etc.	• A single request and response may be split into multiple probes. • Contextual information is not easy to get for stateful requests. For example header compression in HTTP/2.

For the general network performance monitor, we chose to use the kprobe (intercept the syscalls) for the following reasons:

It’s available for applications written in any programming language, and it’s stable, so it saves a lot of development/adaptation costs.
It can be correlated with metrics from the system level, which makes it easier to troubleshoot.
As a single request and response are split into multiple probes, we can use technology to correlate them.
For contextual information, It’s usually used in OSI Layer 7 protocol network analysis. So, if we just monitor the network performance, then they can be ignored.

Kprobes and network monitoring

Following the network syscalls of Linux documentation, we can implement network monitoring by intercepting two types of methods: socket operations and send/receive methods.

Socket Operations

When accepting or connecting with another socket, we can get the following information:

Connection information: Includes the remote address from the connection which helps us to understand which pod is connected.
Connection statics: Includes basic metrics from sockets, such as round-trip time (RTT), lost packet count in TCP, etc.
Socket and file descriptor (FD) mapping: Includes the relationship between the Linux file descriptor and socket object. It is useful when sending and receiving data through a Linux file descriptor.

Send/Receive

The interface related to sending or receiving data is the focus of performance analysis. It mainly contains the following parameters:

Socket file descriptor: The file descriptor of the current operation corresponding to the socket.
Buffer: The data sent or received, passed as a byte array.

Based on the above parameters, we can analyze the following data:

Bytes: The size of the packet in bytes.
Protocol: The protocol analysis according to the buffer data, such as HTTP, MySQL, etc.
Execution Time: The time it takes to send/receive the data.

At this point (Figure 1) we can analyze the following steps for the whole lifecycle of the connection:

Connect/Accept: When the connection is created.
Transform: Sending and receiving data on the connection.
Close: When the connection is closed.

Figure 1

Protocol and TLS

The previous section described how to analyze connections using send or receive buffer data. For example, following the HTTP/1.1 message specification to analyze the connection. However, this does not work for TLS requests/responses.

Figure 2

When TLS is in use, the Linux Kernel transmits data encrypted in user space. In the figure above, The application usually transmits SSL data through a third-party library (such as OpenSSL). For this case, the Linux API can only get the encrypted data, so it cannot recognize any higher layer protocol. To decrypt inside eBPF, we need to follow these steps:

Read unencrypted data through uprobe: Compatible multiple languages, using uprobe to capture the data that is not encrypted before sending or after receiving. In this way, we can get the original data and associate it with the socket.
Associate with socket: We can associate unencrypted data with the socket.

OpenSSL Use case

For example, the most common way to send/receive SSL data is to use OpenSSL as a shared library, specifically the SSL_read and SSL_write methods to submit the buffer data with the socket.

Following the documentation, we can intercept these two methods, which are almost identical to the API in Linux. The source code of the SSL structure in OpenSSL shows that the Socket FD exists in the BIO object of the SSL structure, and we can get it by the offset.

In summary, with knowledge of how OpenSSL works, we can read unencrypted data in an eBPF function.

Introducing SkyWalking Rover, an eBPF-based Metrics Collector and Profiler

SkyWalking Rover introduces the eBPF network profiling feature into the SkyWalking ecosystem. It’s currently supported in a Kubernetes environment, so must be deployed inside a Kubernetes cluster. Once the deployment is complete, SkyWalking Rover can monitor the network for all processes inside a given Pod. Based on the monitoring data, SkyWalking can generate the topology relationship diagram and metrics between processes.

Topology Diagram

The topology diagram can help us understand the network access between processes inside the same Pod, and between the process and external environment (other Pod or service). Additionally, it can identify the data direction of traffic based on the line flow direction.

In Figure 3 below, all nodes within the hexagon are the internal process of a Pod, and nodes outside the hexagon are externally associated services or Pods. Nodes are connected by lines, which indicate the direction of requests or responses between nodes (client or server). The protocol is indicated on the line, and it’s either HTTP(S), TCP, or TCP(TLS). Also, we can see in this figure that the line between Envoy and Python applications is bidirectional because Envoy intercepts all application traffic.

Figure 3

Metrics

Once we recognize the network call relationship between processes through the topology, we can select a specific line and view the TCP metrics between the two processes.

The diagram below (Figure 4) shows the metrics of network monitoring between two processes. There are four metrics in each line. Two on the left side are on the client side, and two on the right side are on the server side. If the remote process is not in the same Pod, only one side of the metrics is displayed.

Figure 4

The following two metric types are available:

Counter: Records the total number of data in a certain period. Each counter contains the following data: a. Count: Execution count. b. Bytes: Packet size in bytes. c. Execution time: Execution duration.
Histogram: Records the distribution of data in the buckets.

Based on the above data types, the following metrics are exposed:

Name	Type	Unit	Description
Write	Counter and histogram	Millisecond	The socket write counter.
Read	Counter and histogram	Millisecond	The socket read counter.
Write RTT	Counter and histogram	Microsecond	The socket write round trip time (RTT) counter.
Connect	Counter and histogram	Millisecond	The socket connect/accept with another server/client counter.
Close	Counter and histogram	Millisecond	The socket with other socket counter.
Retransmit	Counter	Millisecond	The socket retransmit package counter.
Drop	Counter	Millisecond	The socket drop package counter.

Demo

In this section, we demonstrate how to perform network profiling in the service mesh. To follow along, you will need a running Kubernetes environment.

NOTE: All commands and scripts are available in this GitHub repository.

Install Istio

Istio is the most widely deployed service mesh, and comes with a complete demo application that we can use for testing. To install Istio and the demo application, follow these steps:

Install Istio using the demo configuration profile.
Label the default namespace, so Istio automatically injects Envoy sidecar proxies when we’ll deploy the application.
Deploy the bookinfo application to the cluster.
Deploy the traffic generator to generate some traffic to the application.

export ISTIO_VERSION=1.13.1

# install istio
istioctl install -y --set profile=demo
kubectl label namespace default istio-injection=enabled

# deploy the bookinfo applications
kubectl apply -f https://raw.githubusercontent.com/istio/istio/$ISTIO_VERSION/samples/bookinfo/platform/kube/bookinfo.yaml
kubectl apply -f https://raw.githubusercontent.com/istio/istio/$ISTIO_VERSION/samples/bookinfo/networking/bookinfo-gateway.yaml
kubectl apply -f https://raw.githubusercontent.com/istio/istio/$ISTIO_VERSION/samples/bookinfo/networking/destination-rule-all.yaml
kubectl apply -f https://raw.githubusercontent.com/istio/istio/$ISTIO_VERSION/samples/bookinfo/networking/virtual-service-all-v1.yaml

# generate traffic
kubectl apply -f https://raw.githubusercontent.com/mrproliu/skywalking-network-profiling-demo/main/resources/traffic-generator.yaml

Install SkyWalking

The following will install the storage, backend, and UI needed for SkyWalking:

git clone https://github.com/apache/skywalking-kubernetes.git
cd skywalking-kubernetes
cd chart
helm dep up skywalking
helm -n istio-system install skywalking skywalking \
 --set fullnameOverride=skywalking \
 --set elasticsearch.minimumMasterNodes=1 \
 --set elasticsearch.imageTag=7.5.1 \
 --set oap.replicas=1 \
 --set ui.image.repository=apache/skywalking-ui \
 --set ui.image.tag=9.2.0 \
 --set oap.image.tag=9.2.0 \
 --set oap.envoy.als.enabled=true \
 --set oap.image.repository=apache/skywalking-oap-server \
 --set oap.storageType=elasticsearch \
 --set oap.env.SW_METER_ANALYZER_ACTIVE_FILES='network-profiling'

Install SkyWalking Rover

SkyWalking Rover is deployed on every node in Kubernetes, and it automatically detects the services in the Kubernetes cluster. The network profiling feature has been released in the version 0.3.0 of SkyWalking Rover. When a network monitoring task is created, the SkyWalking rover sends the data to the SkyWalking backend.

kubectl apply -f https://raw.githubusercontent.com/mrproliu/skywalking-network-profiling-demo/main/resources/skywalking-rover.yaml

Start the Network Profiling Task

Once all deployments are completed, we must create a network profiling task for a specific instance of the service in the SkyWalking UI.

To open SkyWalking UI, run:

kubectl port-forward svc/skywalking-ui 8080:80 --namespace
istio-system

Currently, we can select the specific instances that we wish to monitor by clicking the Data Plane item in the Service Mesh panel and the Service item in the Kubernetes panel.

In the figure below, we have selected an instance with a list of tasks in the network profiling tab. When we click the start button, the SkyWalking Rover starts monitoring this instance’s network.

Figure 5

Done!

After a few seconds, you will see the process topology appear on the right side of the page.

Figure 6

When you click on the line between processes, you can see the TCP metrics between the two processes.

Figure 7

Conclusion

In this article, we detailed a problem that makes troubleshooting service mesh architectures difficult: lack of context between layers in the network stack. These are the cases when eBPF begins to really help with debugging/productivity when existing service mesh/envoy cannot. Then, we researched how eBPF could be applied to common communication, such as TLS. Finally, we demo the implementation of this process with SkyWalking Rover.

For now, we have completed the performance analysis for OSI layer 4 (mostly TCP). In the future, we will also introduce the analysis for OSI layer 7 protocols like HTTP.

Get Started with Istio

To get started with service mesh today, Tetrate Istio Distro is the easiest way to install, manage, and upgrade Istio. It provides a vetted upstream distribution of Istio that’s tested and optimized for specific platforms by Tetrate plus a CLI that facilitates acquiring, installing, and configuring multiple Istio versions. Tetrate Istio Distro also offers FIPS certified Istio builds for FedRAMP environments.

For enterprises that need a unified and consistent way to secure and manage services and traditional workloads across complex, heterogeneous deployment environments, we offer Tetrate Service Bridge, our flagship edge-to-workload application connectivity platform built on Istio and Envoy.

Additional Resources

Blog: Pinpoint Service Mesh Critical Performance Impact by using eBPF

Tue, 05 Jul 2022 00:00:00 +0000

Content

Background

Apache SkyWalking observes metrics, logs, traces, and events for services deployed into the service mesh. When troubleshooting, SkyWalking error analysis can be an invaluable tool helping to pinpoint where an error occurred. However, performance problems are more difficult: It’s often impossible to locate the root cause of performance problems with pre-existing observation data. To move beyond the status quo, dynamic debugging and troubleshooting are essential service performance tools. In this article, we’ll discuss how to use eBPF technology to improve the profiling feature in SkyWalking and analyze the performance impact in the service mesh.

Trace Profiling in SkyWalking

Since SkyWalking 7.0.0, Trace Profiling has helped developers find performance problems by periodically sampling the thread stack to let developers know which lines of code take more time. However, Trace Profiling is not suitable for the following scenarios:

Thread Model: Trace Profiling is most useful for profiling code that executes in a single thread. It is less useful for middleware that relies heavily on async execution models. For example Goroutines in Go or Kotlin Coroutines.
Language: Currently, Trace Profiling is only supported in Java and Python, since it’s not easy to obtain the thread stack in the runtimes of some languages such as Go and Node.js.
Agent Binding: Trace Profiling requires Agent installation, which can be tricky depending on the language (e.g., PHP has to rely on its C kernel; Rust and C/C++ require manual instrumentation to make install).
Trace Correlation: Since Trace Profiling is only associated with a single request it can be hard to determine which request is causing the problem.
Short Lifecycle Services: Trace Profiling doesn’t support short-lived services for (at least) two reasons:
1. It’s hard to differentiate system performance from class code manipulation in the booting stage.
2. Trace profiling is linked to an endpoint to identify performance impact, but there is no endpoint to match these short-lived services.

Fortunately, there are techniques that can go further than Trace Profiling in these situations.

Introduce eBPF

We have found that eBPF — a technology that can run sandboxed programs in an operating system kernel and thus safely and efficiently extend the capabilities of the kernel without requiring kernel modifications or loading kernel modules — can help us fill gaps left by Trace Profiling. eBPF is a trending technology because it breaks the traditional barrier between user and kernel space. Programs can now inject bytecode that runs in the kernel, instead of having to recompile the kernel to customize it. This is naturally a good fit for observability.

In the figure below, we can see that when the system executes the execve syscalls, the eBPF program is triggered, and the current process runtime information is obtained by using function calls.

Using eBPF technology, we can expand the scope of Skywalking’s profiling capabilities:

Global Performance Analysis: Before eBPF, data collection was limited to what agents can observe. Since eBPF programs run in the kernel, they can observe all threads. This is especially useful when you are not sure whether a performance problem is caused by a particular request.
Data Content: eBPF can dump both user and kernel space thread stacks, so if a performance issue happens in kernel space, it’s easier to find.
Agent Binding: All modern Linux kernels support eBPF, so there is no need to install anything. This means it is an orchestration-free vs an agent model. This reduces friction caused by built-in software which may not have the correct agents installed, such as Envoy in a Service Mesh.
Sampling Type: Unlike Trace Profiling, eBPF is event-driven and, therefore, not constrained by interval polling. For example, eBPF can trigger events and collect more data depending on a transfer size threshold. This can allow the system to triage and prioritize data collection under extreme load.

eBPF Limitations

While eBPF offers significant advantages for hunting performance bottlenecks, no technology is perfect. eBPF has a number of limitations described below. Fortunately, since SkyWalking does not require eBPF, the impact is limited.

Linux Version Requirement: eBPF programs require a Linux kernel version above 4.4, with later kernel versions offering more data to be collected. The BCC has documented the features supported by different Linux kernel versions, with the differences between versions usually being what data can be collected with eBPF.
Privileges Required: All processes that intend to load eBPF programs into the Linux kernel must be running in privileged mode. As such, bugs or other issues in such code may have a big impact.
Weak Support for Dynamic Language: eBPF has weak support for JIT-based dynamic languages, such as Java. It also depends on what data you want to collect. For Profiling, eBPF does not support parsing the symbols of the program, which is why most eBPF-based profiling technologies only support static languages like C, C++, Go, and Rust. However, symbol mapping can sometimes be solved through tools provided by the language. For example, in Java, perf-map-agent can be used to generate the symbol mapping. However, dynamic languages don’t support the attach (uprobe) functionality that would allow us to trace execution events through symbols.

Introducing SkyWalking Rover

SkyWalking Rover introduces the eBPF profiling feature into the SkyWalking ecosystem. The figure below shows the overall architecture of SkyWalking Rover. SkyWalking Rover is currently supported in Kubernetes environments and must be deployed inside a Kubernetes cluster. After establishing a connection with the SkyWalking backend server, it saves information about the processes on the current machine to SkyWalking. When the user creates an eBPF profiling task via the user interface, SkyWalking Rover receives the task and executes it in the relevant C, C++, Golang, and Rust language-based programs.

Other than an eBPF-capable kernel, there are no additional prerequisites for deploying SkyWalking Rover.

CPU Profiling with Rover

CPU profiling is the most intuitive way to show service performance. Inspired by Brendan Gregg‘s blog post, we’ve divided CPU profiling into two types that we have implemented in Rover:

On-CPU Profiling: Where threads are spending time running on-CPU.
Off-CPU Profiling: Where time is spent waiting while blocked on I/O, locks, timers, paging/swapping, etc.

Profiling Envoy with eBPF

Envoy is a popular proxy, used as the data plane by the Istio service mesh. In a Kubernetes cluster, Istio injects Envoy into each service’s pod as a sidecar where it transparently intercepts and processes incoming and outgoing traffic. As the data plane, any performance issues in Envoy can affect all service traffic in the mesh. In this scenario, it’s more powerful to use eBPF profiling to analyze issues in production caused by service mesh configuration.

Demo Environment

If you want to see this scenario in action, we’ve built a demo environment where we deploy an Nginx service for stress testing. Traffic is intercepted by Envoy and forwarded to Nginx. The commands to install the whole environment can be accessed through GitHub.

On-CPU Profiling

On-CPU profiling is suitable for analyzing thread stacks when service CPU usage is high. If the stack is dumped more times, it means that the thread stack occupies more CPU resources.

When installing Istio using the demo configuration profile, we found there are two places where we can optimize performance:

Zipkin Tracing: Different Zipkin sampling percentages have a direct impact on QPS.
Access Log Format: Reducing the fields of the Envoy access log can improve QPS.

Zipkin Tracing

Zipkin with 100% sampling

In the default demo configuration profile, Envoy is using 100% sampling as default tracing policy. How does that impact the performance?

As shown in the figure below, using the on-CPU profiling, we found that it takes about 16% of the CPU overhead. At a fixed consumption of 2 CPUs, its QPS can reach 5.7K.

Disable Zipkin tracing

At this point, we found that if Zipkin is not necessary, the sampling percentage can be reduced or we can even disable tracing. Based on the Istio documentation, we can disable tracing when installing the service mesh using the following command:

istioctl install -y --set profile=demo \
   --set 'meshConfig.enableTracing=false' \
   --set 'meshConfig.defaultConfig.tracing.sampling=0.0'

After disabling tracing, we performed on-CPU profiling again. According to the figure below, we found that Zipkin has disappeared from the flame graph. With the same 2 CPU consumption as in the previous example, the QPS reached 9K, which is an almost 60% increase.

Tracing with Throughput

With the same CPU usage, we’ve discovered that Envoy performance greatly improves when the tracing feature is disabled. Of course, this requires us to make trade-offs between the number of samples Zipkin collects and the desired performance of Envoy (QPS).

The table below illustrates how different Zipkin sampling percentages under the same CPU usage affect QPS.

Zipkin sampling %	QPS	CPUs	Note
100% (default)	5.7K	2	16% used by Zipkin
1%	8.1K	2	0.3% used by Zipkin
disabled	9.2K	2	0% used by Zipkin

Access Log Format

Default Log Format

In the default demo configuration profile, the default Access Log format contains a lot of data. The flame graph below shows various functions involved in parsing the data such as request headers, response headers, and streaming the body.

Simplifying Access Log Format

Typically, we don’t need all the information in the access log, so we can often simplify it to get what we need. The following command simplifies the access log format to only display basic information:

istioctl install -y --set profile=demo \
   --set meshConfig.accessLogFormat="[%START_TIME%] \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE%\n"

After simplifying the access log format, we found that the QPS increased from 5.7K to 5.9K. When executing the on-CPU profiling again, the CPU usage of log formatting dropped from 2.4% to 0.7%.

Simplifying the log format helped us to improve the performance.

Off-CPU Profiling

Off-CPU profiling is suitable for performance issues that are not caused by high CPU usage. For example, when there are too many threads in one service, using off-CPU profiling could reveal which threads spend more time context switching.

We provide data aggregation in two dimensions:

Switch count: The number of times a thread switches context. When the thread returns to the CPU, it completes one context switch. A thread stack with a higher switch count spends more time context switching.
Switch duration: The time it takes a thread to switch the context. A thread stack with a higher switch duration spends more time off-CPU.

Write Access Log

Enable Write

Using the same environment and settings as before in the on-CPU test, we performed off-CPU profiling. As shown below, we found that access log writes accounted for about 28% of the total context switches. The “__write” shown below also indicates that this method is the Linux kernel method.

Disable Write

SkyWalking implements Envoy’s Access Log Service (ALS) feature which allows us to send access logs to the SkyWalking Observability Analysis Platform (OAP) using the gRPC protocol. Even by disabling the access logging, we can still use ALS to capture/aggregate the logs. We’ve disabled writing to the access log using the following command:

istioctl install -y --set profile=demo --set meshConfig.accessLogFile=""

After disabling the Access Log feature, we performed the off-CPU profiling. File writing entries have disappeared as shown in the figure below. Envoy throughput also increased from 5.7K to 5.9K.

Conclusion

In this article, we’ve examined the insights Apache Skywalking’s Trace Profiling can give us and how much more can be achieved with eBPF profiling. All of these features are implemented in skywalking-rover. In addition to on- and off-CPU profiling, you will also find the following features:

Continuous profiling, helps you automatically profile without manual intervention. For example, when Rover detects that the CPU exceeds a configurable threshold, it automatically executes the on-CPU profiling task.
More profiling types to enrich usage scenarios, such as network, and memory profiling.

Blog: Scaling with Apache SkyWalking

Mon, 24 Jan 2022 00:00:00 +0000

Background

In the Apache SkyWalking ecosystem, the OAP obtains metrics, traces, logs, and event data through SkyWalking Agent, Envoy, or other data sources. Under the gRPC protocol, it transmits data by communicating with a single server node. Only when the connection is broken, the reconnecting policy would be used based on DNS round-robin mode. When new services are added at runtime or the OAP load is kept high due to increased traffic of observed services, the OAP cluster needs to scale out for increased traffic. The load of the new OAP node would be less due to all existing agents having connected to previous nodes. Even without scaling, the load of OAP nodes would be unbalanced, because the agent would keep the connection due to random policy at the booting stage. In these cases, it would become a challenge to keep up the health status of all nodes, and be able to scale out when needed.

In this article, we mainly discuss how to solve this challenge in SkyWalking.

How to Load Balance

SkyWalking mainly uses the gRPC protocol for data transmission, so this article mainly introduces load balancing in the gRPC protocol.

Proxy Or Client-side

Based on the gRPC official Load Balancing blog, there are two approaches to load balancing:

Client-side: The client perceives multiple back-end services and uses a load-balancing algorithm to select a back-end service for each RPC.
Proxy: The client sends the message to the proxy server, and the proxy server load balances the message to the back-end service.

From the perspective of observability system architecture:

	Pros	Cons
Client-side	High performance because of the elimination of extra hop	Complex client (cluster awareness, load balancing, health check, etc.) Ensure each data source to be connected provides complex client capabilities
Proxy	Simple Client	Higher latency

We choose Proxy mode for the following reasons:

Observable data is not very time-sensitive, a little latency caused by transmission is acceptable. A little extra hop is acceptable and there is no impact on the client-side.
As an observability platform, we cannot/should not ask clients to change. They make their own tech decisions and may have their own commercial considerations.

Transmission Policy

In the proxy mode, we should determine the transmission path between downstream and upstream.

Different data protocols require different processing policies. There are two transmission policies:

Synchronous: Suitable for protocols that require data exchange in the client, such as SkyWalking Dynamic Configuration Service. This type of protocol provides real-time results.
Asynchronous batch: Used when the client doesn’t care about the upstream processing results, but only the transmitted data (e.g., trace report, log report, etc.)

The synchronization policy requires that the proxy send the message to the upstream server when receiving the client message, and synchronously return the response data to the downstream client. Usually, only a few protocols need to use the synchronization policy.

As shown below, after the client sends the request to the Proxy, the proxy would send the message to the server synchronously. When the proxy receives the result, it returns to the client.

The asynchronous batch policy means that the data is sent to the upstream server in batches asynchronously. This policy is more common because most protocols in SkyWalking are primarily based on data reporting. We think using the queue as a buffer could have a good effect. The asynchronous batch policy is executed according to the following steps:

The proxy receives the data and wraps it as an Event object.
An event is added into the queue.
When the cycle time is reached or when the queue elements reach the fixed number, the elements in the queue will parallel consume and send to the OAP.

The advantage of using queues is:

Separate data receiving and sending to reduce the mutual influence.
The interval quantization mechanism can be used to combine events, which helps to speed up sending events to the OAP.
Using multi-threaded consumption queue events can make fuller use of network IO.

As shown below, after the proxy receives the message, the proxy would wrap the message as an event and push it to the queue. The message sender would take batch events from the queue and send them to the upstream OAP.

Routing

Routing algorithms are used to route messages to a single upstream server node.

The Round-Robin algorithm selects nodes in order from the list of upstream service nodes. The advantage of this algorithm is that the number of times each node is selected is average. When the size of the data is close to the same, each upstream node can handle the same quantity of data content.

With the Weight Round-Robin, each upstream server node has a corresponding routing weight ratio. The difference from Round-Robin is that each upstream node has more chances to be routed according to its weight. This algorithm is more suitable to use when the upstream server node machine configuration is not the same.

The Fixed algorithm is a hybrid algorithm. It can ensure that the same data is routed to the same upstream server node, and when the upstream server scales out, it still maintains routing to the same node; unless the upstream node does not exist, it will reroute. This algorithm is mainly used in the SkyWalking Meter protocol because this protocol needs to ensure that the metrics of the same service instance are sent to the same OAP node. The Routing steps are as follows:

Generate a unique identification string based on the data content, as short as possible. The amount of data is controllable.
Get the upstream node of identity from LRU Cache, and use it if it exists.
According to the identification, generate the corresponding hash value, and find the upstream server node from the upstream list.
Save the mapping relationship between the upstream server node and identification to LRU Cache.

The advantage of this algorithm is to bind the data with the upstream server node as much as possible, so the upstream server can better process continuous data. The disadvantage is that it takes up a certain amount of memory space to save the corresponding relationship.

As shown below, the image is divided into two parts:

The left side represents that the same data content always is routed to the same server node.
The right side represents the data routing algorithm. Get the number from the data, and use the remainder algorithm to obtain the position.

We choose to use a combination of Round-Robin and Fixed algorithm for routing:

The Fixed routing algorithm is suitable for specific protocols, mainly used when passing metrics data to the SkyWalking Meter protocol
The Round-Robin algorithm is used by default. When the SkyWalking OAP cluster is deployed, the configuration of the nodes needs to be as much the same as possible, so there would be no need to use the Weight Round-Robin algorithm.

How to balance the load balancer itself?

Proxy still needs to deal with the load balancing problem from client to itself, especially when deploying a Proxy cluster in a production environment.

There are three ways to solve this problem:

Connection management: Use the max_connection config on the client-side to specify the maximum connection duration of each connection. For more information, please read the proposal.
Cluster awareness: The proxy has cluster awareness, and actively disconnects the connection when the load is unbalanced to allow the client to re-pick up the proxy.
Resource limit+HPA: Restrict the connection resource situation of each proxy, and no longer accept new connections when the resource limit is reached. And use the HPA mechanism of Kubernetes to dynamically scale out the number of the proxy.

	Connection management	Cluster awareness	Resource Limit+HPA
Pros	Simple to use	Ensure that the number of connections in each proxy is relatively	Simple to use
Cons	Each client needs to ensure that data is not lost The client is required to accept GOWAY responses	May cause a sudden increase in traffic on some nodes Each client needs to ensure that data is not lost	Traffic will not be particularly balanced in each instance

We choose Limit+HPA for these reasons:

Easy to config and use the proxy and easy to understand based on basic data metrics.
No data loss due to broken connection. There is no need for the client to implement any other protocols to prevent data loss, especially when the client is a commercial product.
The connection of each node in the proxy cluster does not need to be particularly balanced, as long as the proxy node itself is high-performance.

SkyWalking-Satellite

We have implemented this Proxy in the SkyWalking-Satellite project. It’s used between Client and SkyWalking OAP, effectively solving the load balancing problem.

After the system is deployed, the Satellite would accept the traffic from the Client, and the Satellite will perceive all the nodes of the OAP through Kubernetes Label Selector or manual configuration, and load balance the traffic to the upstream OAP node.

As shown below, a single client still maintains a connection with a single Satellite, Satellite would establish the connection with each OAP, and load balance message to the OAP node.

When scaling Satellite, we need to deploy the SWCK adapter and configure the HPA in Kubernetes. SWCK is a platform for the SkyWalking users, provisions, upgrades, maintains SkyWalking relevant components, and makes them work natively on Kubernetes.

After deployment is finished, the following steps would be performed:

Read metrics from OAP: HPA requests the SWCK metrics adapter to dynamically read the metrics in the OAP.
Scaling the Satellite: Kubernetes HPA senses that the metrics values are in line with expectations, so the Satellite would be scaling automatically.

As shown below, use the dotted line to divide the two parts. HPA uses SWCK Adapter to read the metrics in the OAP. When the threshold is met, HPA would scale the Satellite deployment.

Example

In this section, we will demonstrate two cases:

SkyWalking Scaling: After SkyWalking OAP scaling, the traffic would auto load balancing through Satellite.
Satellite Scaling: Satellite’s own traffic load balancing.

NOTE: All commands could be accessed through GitHub.

SkyWalking Scaling

We will use the bookinfo application to demonstrate how to integrate Apache SkyWalking 8.9.1 with Apache SkyWalking-Satellite 0.5.0, and observe the service mesh through the Envoy ALS protocol.

Before starting, please make sure that you already have a Kubernetes environment.

Install Istio

Istio provides a very convenient way to configure the Envoy proxy and enable the access log service. The following step:

Install the istioctl locally to help manage the Istio mesh.
Install Istio into the Kubernetes environment with a demo configuration profile, and enable the Envoy ALS. Transmit the ALS message to the satellite. The satellite we will deploy later.
Add the label into the default namespace so Istio could automatically inject Envoy sidecar proxies when you deploy your application later.

# install istioctl
export ISTIO_VERSION=1.12.0
curl -L https://istio.io/downloadIstio | sh - 
sudo mv $PWD/istio-$ISTIO_VERSION/bin/istioctl /usr/local/bin/

# install istio
istioctl install -y --set profile=demo \
	--set meshConfig.enableEnvoyAccessLogService=true \
	--set meshConfig.defaultConfig.envoyAccessLogService.address=skywalking-system-satellite.skywalking-system:11800

# enbale envoy proxy in default namespace
kubectl label namespace default istio-injection=enabled

Install SWCK

SWCK provides convenience for users to deploy and upgrade SkyWalking related components based on Kubernetes. The automatic scale function of Satellite also mainly relies on SWCK. For more information, you could refer to the official documentation.

# Install cert-manager
kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.3.1/cert-manager.yaml

# Deploy SWCK
mkdir -p skywalking-swck && cd skywalking-swck
wget https://dlcdn.apache.org/skywalking/swck/0.6.1/skywalking-swck-0.6.1-bin.tgz
tar -zxvf skywalking-swck-0.6.1-bin.tgz
cd config
kubectl apply -f operator-bundle.yaml

Deploy Apache SkyWalking And Apache SkyWalking-Satellite

We have provided a simple script to deploy the skywalking OAP, UI, and Satellite.

# Create the skywalking components namespace
kubectl create namespace skywalking-system
kubectl label namespace skywalking-system swck-injection=enabled
# Deploy components
kubectl apply -f https://raw.githubusercontent.com/mrproliu/sw-satellite-demo-scripts/5821a909b647f7c8f99c70378e197630836f45f7/resources/sw-components.yaml

Deploy Bookinfo Application

export ISTIO_VERSION=1.12.0
kubectl apply -f https://raw.githubusercontent.com/istio/istio/$ISTIO_VERSION/samples/bookinfo/platform/kube/bookinfo.yaml
kubectl wait --for=condition=Ready pods --all --timeout=1200s
kubectl port-forward service/productpage 9080

Next, please open your browser and visit http://localhost:9080. You should be able to see the Bookinfo application. Refresh the webpage several times to generate enough access logs.

Then, you can see the topology and metrics of the Bookinfo application on SkyWalking WebUI. At this time, you can see that the Satellite is working!

Deploy Monitor

We need to install OpenTelemetry Collector to collect metrics in OAPs and analyze them.

# Add OTEL collector
kubectl apply -f https://raw.githubusercontent.com/mrproliu/sw-satellite-demo-scripts/5821a909b647f7c8f99c70378e197630836f45f7/resources/otel-collector-oap.yaml

kubectl port-forward -n skywalking-system  service/skywalking-system-ui 8080:80

Next, please open your browser and visit http://localhost:8080/ and create a new item on the dashboard. The SkyWalking Web UI pictured below shows how the data content is applied.

Scaling OAP

Scaling the number of OAPs by deployment.

kubectl scale --replicas=3 -n skywalking-system deployment/skywalking-system-oap

Done!

After a period of time, you will see that the number of OAPs becomes 3, and the ALS traffic is balanced to each OAP.

Satellite Scaling

After we have completed the SkyWalking Scaling, we would carry out the Satellite Scaling demo.

Deploy SWCK HPA

SWCK provides an adapter to implement the Kubernetes external metrics to adapt the HPA through reading the metrics in SkyWalking OAP. We expose the metrics service in Satellite to OAP and configure HPA Resource to auto-scaling the Satellite.

Install the SWCK adapter into the Kubernetes environment:

kubectl apply -f skywalking-swck/config/adapter-bundle.yaml

Create the HPA resource, and limit each Satellite to handle a maximum of 10 connections:

kubectl apply -f https://raw.githubusercontent.com/mrproliu/sw-satellite-demo-scripts/5821a909b647f7c8f99c70378e197630836f45f7/resources/satellite-hpa.yaml

Then, you could see we have 9 connections in one satellite. One envoy proxy may establish multiple connections to the satellite.

$ kubectl get HorizontalPodAutoscaler -n skywalking-system
NAME       REFERENCE                                TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
hpa-demo   Deployment/skywalking-system-satellite   9/10      1         3         1          5m18s

Scaling Application

The scaling application could establish more connections to the satellite, to verify whether the HPA is in effect.

kubectl scale --replicas=3 deployment/productpage-v1 deployment/details-v1

Done!

By default, Satellite will deploy a single instance and a single instance will only accept 11 connections. HPA resources limit one Satellite to handle 10 connections and use a stabilization window to make Satellite stable scaling up. In this case, we deploy the Bookinfo application in 10+ instances after scaling, which means that 10+ connections will be established to the Satellite.

So after HPA resources are running, the Satellite would be automatically scaled up to 2 instances. You can learn about the calculation algorithm of replicas through the official documentation. Run the following command to view the running status:

$ kubectl get HorizontalPodAutoscaler -n skywalking-system --watch
NAME       REFERENCE                                TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
hpa-demo   Deployment/skywalking-system-satellite   11/10     1         3         1          3m31s
hpa-demo   Deployment/skywalking-system-satellite   11/10     1         3         1          4m20s
hpa-demo   Deployment/skywalking-system-satellite   11/10     1         3         2          4m38s
hpa-demo   Deployment/skywalking-system-satellite   11/10     1         3         2          5m8s
hpa-demo   Deployment/skywalking-system-satellite   6/10      1         3         2          5m23s

By observing the “number of connections” metric, we would be able to see that when the number of connections of each gRPC exceeds 10 connections, then the satellite automatically scales through the HPA rule. As a result, the connection number is down to normal status (in this example, less than 10)

swctl metrics linear --name satellite_service_grpc_connect_count --service-name satellite::satellite-service

Blog: SkyWalking Python Agent Supports Profiling Now

Sun, 12 Sep 2021 00:00:00 +0000

The Java Agent of Apache SkyWalking has supported profiling since v7.0.0, and it enables users to troubleshoot the root cause of performance issues, and now we bring it into Python Agent. In this blog, we will show you how to use it, and we will introduce the mechanism of profiling.

How to use profiling in Python Agent

This feature is released in Python Agent at v0.7.0. It is turned on by default, so you don’t need any extra configuration to use it. You can find the environment variables about it here.

Here are the demo codes of an intentional slow application.

import time

def method1():
    time.sleep(0.02)
    return '1'

def method2():
    time.sleep(0.02)
    return method1()

def method3():
    time.sleep(0.02)
    return method2()

if __name__ == '__main__':
    import socketserver
    from http.server import BaseHTTPRequestHandler

    class SimpleHTTPRequestHandler(BaseHTTPRequestHandler):

        def do_POST(self):
            method3()
            time.sleep(0.5)
            self.send_response(200)
            self.send_header('Content-Type', 'application/json')
            self.end_headers()
            self.wfile.write('{"song": "Despacito", "artist": "Luis Fonsi"}'.encode('ascii'))

    PORT = 19090
    Handler = SimpleHTTPRequestHandler

    with socketserver.TCPServer(("", PORT), Handler) as httpd:
        httpd.serve_forever()

We can start it with SkyWalking Python Agent CLI without changing any application code now, which is also the latest feature of v0.7.0. We just need to add sw-python run before our start command(i.e. sw-python run python3 main.py), to start the application with python agent attached. More information about sw-python can be found there.

Then, we should add a new profile task for the / endpoint from the SkyWalking UI, as shown below.

We can access it by curl -X POST http://localhost:19090/, after that, we can view the result of this profile task on the SkyWalking UI.

The mechanism of profiling

When a request lands on an application with the profile function enabled, the agent begins the profiling automatically if the request’s URI is as required by the profiling task. A new thread is spawned to fetch the thread dump periodically until the end of request.

The agent sends these thread dumps, called ThreadSnapshot, to SkyWalking OAPServer, and the OAPServer analyzes those ThreadSnapshot(s) and gets the final result. It will take a method invocation with the same stack depth and code signature as the same operation, and estimate the execution time of each method from this.

Let’s demonstrate how this analysis works through the following example. Suppose we have such a program below and we profile it at 10ms intervals.

def main():
    methodA()

def methodA():
    methodB()

def methodB():
    methodC()
    methodD()

def methodC():
    time.sleep(0.04)

def methodD():
    time.sleep(0.06)

The agent collects a total of 10 ThreadSnapShot(s) over the entire time period(Diagram A). The first 4 snapshots represent the thread dumps during the execution of function C, and the last 6 snapshots represent the thread dumps during the execution of function D. After the analysis of OAPServer, we can see the result of this profile task on the SkyWalking Rocketbot UI as shown in the right of the diagram. With this result, we can clearly see the function call relationship and the time consumption situation of this program.

Diagram A

You can read more details of profiling theory from this blog.

We hope you enjoy the profile in the Python Agent, and if so, you can give us a star on Python Agent and SkyWalking on GitHub.

Blog: End-User Tracing in a SkyWalking-Observed Browser

Thu, 25 Mar 2021 00:00:00 +0000

Origin: End-User Tracing in a SkyWalking-Observed Browser - The New Stack

Apache SkyWalking: an APM (application performance monitor) system, especially designed for microservices, cloud native, and container-based (Docker, Kubernetes, Mesos) architectures.

skywalking-client-js: a lightweight client-side JavaScript exception, performance, and tracing library. It provides metrics and error collection to the SkyWalking backend. It also makes the browser the starting point for distributed tracing.

Background

Web application performance affects the retention rate of users. If a page load time is too long, the user will give up. So we need to monitor the web application to understand performance and ensure that servers are stable, available and healthy. SkyWalking is an APM tool and the skywalking-client-js extends its monitoring to include the browser, providing performance metrics and error collection to the SkyWalking backend.

Performance Metrics

The skywalking-client-js uses [window.performance] (https://developer.mozilla.org/en-US/docs/Web/API/Window/performance) for performance data collection. From the MDN doc, the performance interface provides access to performance-related information for the current page. It’s part of the High Resolution Time API, but is enhanced by the Performance Timeline API, the Navigation Timing API, the User Timing API, and the Resource Timing API. In skywalking-client-js, all performance metrics are calculated according to the Navigation Timing API defined in the W3C specification. We can get a PerformanceTiming object describing our page using the window.performance.timing property. The PerformanceTiming interface contains properties that offer performance timing information for various events that occur during the loading and use of the current page.

We can better understand these attributes when we see them together in the figure below from W3C:

The following table contains performance metrics in skywalking-client-js.

Metrics Name	Describe	Calculating Formulae	Note
redirectTime	Page redirection time	redirectEnd - redirectStart	If the current document and the document that is redirected to are not from the same origin, set redirectStart, redirectEnd to 0
ttfbTime	Time to First Byte	responseStart - requestStart	According to Google Development
dnsTime	Time to DNS query	domainLookupEnd - domainLookupStart
tcpTime	Time to TCP link	connectEnd - connectStart
transTime	Time to content transfer	responseEnd - responseStart
sslTime	Time to SSL secure connection	connectEnd - secureConnectionStart	Only supports HTTPS
resTime	Time to resource loading	loadEventStart - domContentLoadedEventEnd	Represents a synchronized load resource in pages
fmpTime	Time to First Meaningful Paint	-	Listen for changes in page elements. Traverse each new element, and calculate the total score of these elements. If the element is visible, the score is 1 * weight; if the element is not visible, the score is 0
domAnalysisTime	Time to DOM analysis	domInteractive - responseEnd
fptTime	First Paint Time	responseEnd - fetchStart
domReadyTime	Time to DOM ready	domContentLoadedEventEnd - fetchStart
loadPageTime	Page full load time	loadEventStart - fetchStart
ttlTime	Time to interact	domInteractive - fetchStart
firstPackTime	Time to first package	responseStart - domainLookupStart

Skywalking-client-js collects those performance metrics and sends them to the OAP (Observability Analysis Platform) server , which aggregates data on the back-end side that is then shown in visualizations on the UI side. Users can optimize the page according to these data.

Exception Metrics

There are five kinds of errors that can be caught in skywalking-client-js:

The resource loading error is captured by window.addeventlistener ('error ', callback, true)
window.onerror catches JS execution errors
window.addEventListener('unhandledrejection', callback) is used to catch the promise errors
the Vue errors are captured by Vue.config.errorHandler
the Ajax errors are captured by addEventListener('error', callback); addEventListener('abort', callback); addEventListener('timeout', callback); in send callback.

The Skywalking-client-js traces error data to the OAP server, finally visualizing data on the UI side. For an error overview of the App, there are several metrics for basic statistics and trends of errors, including the following metrics.

App Error Count, the total number of errors in the selected time period.
App JS Error Rate, the proportion of PV with JS errors in a selected time period to total PV.
All of Apps Error Count, Top N Apps error count ranking.
All of Apps JS Error Rate, Top N Apps JS error rate ranking.
Error Count of Versions in the Selected App, Top N Error Count of Versions in the Selected App ranking.
Error Rate of Versions in the Selected App, Top N JS Error Rate of Versions in the Selected App ranking.
Error Count of the Selected App, Top N Error Count of the Selected App ranking.
Error Rate of the Selected App, Top N JS Error Rate of the Selected App ranking.

For pages, we use several metrics for basic statistics and trends of errors, including the following metrics:

Top Unstable Pages / Error Rate, Top N Error Count pages of the Selected version ranking.
Top Unstable Pages / Error Count, Top N Error Count pages of the Selected version ranking.
Page Error Count Layout, data display of different errors in a period of time.

User Metrics

SkyWalking browser monitoring also provides metrics about how the visitors use the monitored websites, such as PV(page views), UV(unique visitors), top N PV(page views), etc.

In SPAs (single page applications), the page will be refreshed only once. The traditional method only reports PV once after the page loading, but cannot count the PV of each sub-page, and can’t make other types of logs aggregate by sub-page.

SkyWalking browser monitoring provides two processing methods for SPA pages:

Enable SPA automatic parsing. This method is suitable for most single page application scenarios with URL hash as the route. In the initialized configuration item, set enableSPA to true, which will turn on the page’s hashchange event listener (trigger re reporting PV), and use URL hash as the page field in other data reporting.
Manual reporting. This method can be used in all single page application scenarios. This method can be used if the first method is not usable. The following example provides a set page method to manually update the page name when data is reported. When this method is called, the page PV will be re reported by default:

app.on('routeChange', function (to) {
    ClientMonitor.setPerformance({
    collector: 'http://127.0.0.1:8080',
    service: 'browser-app',
    serviceVersion: '1.0.0',
    pagePath: to.path,
    autoTracePerf: true,
    enableSPA: true,
  });
});

Let’s take a look at the result found in the following image. It shows the most popular applications and versions, and the changes of PV over a period of time.

Make the browser the starting point for distributed tracing

SkyWalking browser monitoring intercepts HTTP requests to trace segments and spans. It supports tracking these following modes of HTTP requests: XMLHttpRequest and fetch. It also supports tracking libraries and tools based on XMLHttpRequest and fetch - such as Axios, SuperAgent, OpenApi, and so on.

Let’s see how the SkyWalking browser monitoring intercepts HTTP requests:

After this, use window.addEventListener('xhrReadyStateChange', callback) and set the readyState value tosw8 = xxxx in the request header. At the same time, reporting requests information to the back-end side. Finally, we can view trace data on the trace page. The following graphic is from the trace page:

To see how we listen for fetch requests, let’s see the source code of fetch

As you can see, it creates a promise and a new XMLHttpRequest object. Because the code of the fetch is built into the browser, it must monitor the code execution first. Therefore, when we add listening events, we can’t monitor the code in the fetch. Just after monitoring the code execution, let’s rewrite the fetch:

import { fetch } from 'whatwg-fetch'; window.fetch = fetch;

In this way, we can intercept the fetch request through the above method.

Additional Resources

End-User Tracing in a SkyWalking-Observed Browser.

Blog: Apache SkyWalking: Use Profiling to Fix the Blind Spot of Distributed Tracing

Mon, 13 Apr 2020 00:00:00 +0000

This post originally appears on The New Stack

This post introduces a way to automatically profile code in production with Apache SkyWalking. We believe the profile method helps reduce maintenance and overhead while increasing the precision in root cause analysis.

Limitations of the Distributed Tracing

In the early days, metrics and logging systems were the key solutions in monitoring platforms. With the adoption of microservice and distributed system-based architecture, distributed tracing has become more important. Distributed tracing provides relevant service context, such as system topology map and RPC parent-child relationships.

Some claim that distributed tracing is the best way to discover the cause of performance issues in a distributed system. It’s good at finding issues at the RPC abstraction, or in the scope of components instrumented with spans. However, it isn’t that perfect.

Have you been surprised to find a span duration longer than expected, but no insight into why? What should you do next? Some may think that the next step is to add more instrumentation, more spans into the trace, thinking that you would eventually find the root cause, with more data points. We’ll argue this is not a good option within a production environment. Here’s why:

There is a risk of application overhead and system overload. Ad-hoc spans measure the performance of specific scopes or methods, but picking the right place can be difficult. To identify the precise cause, you can “instrument” (add spans to) many suspicious places. The additional instrumentation costs more CPU and memory in the production environment. Next, ad-hoc instrumentation that didn’t help is often forgotten, not deleted. This creates a valueless overhead load. In the worst case, excess instrumentation can cause performance problems in the production app or overload the tracing system.
The process of ad-hoc (manual) instrumentation usually implies at least a restart. Trace instrumentation libraries, like Zipkin Brave, are integrated into many framework libraries. To instrument a method’s performance typically implies changing code, even if only an annotation. This implies a re-deploy. Even if you have the way to do auto instrumentation, like Apache SkyWalking, you still need to change the configuration and reboot the app. Otherwise, you take the risk of GC caused by hot dynamic instrumentation.
Injecting instrumentation into an uninstrumented third party library is hard and complex. It takes more time and many won’t know how to do this.
Usually, we don’t have code line numbers in the distributed tracing. Particularly when lambdas are in use, it can be difficult to identify the line of code associated with a span. Regardless of the above choices, to dive deeper requires collaboration with your Ops or SRE team, and a shared deep level of knowledge in distributed tracing.

Regardless of the above choices, to dive deeper requires collaboration with your Ops or SRE team, and a shared deep level of knowledge in distributed tracing.

Profiling in Production

Introduction

To reuse distributed tracing to achieve method scope precision requires an understanding of the above limitations and a different approach. We called it PROFILE.

Most high-level languages build and run on a thread concept. The profile approach takes continuous thread dumps. We merge the thread dumps to estimate the execution time of every method shown in the thread dumps. The key for distributed tracing is the tracing context, identifiers active (or current) for the profiled method. Using this trace context, we can weave data harvested from profiling into existing traces. This allows the system to automate otherwise ad-hoc instrumentation. Let’s dig deeper into how profiling works:

We consider a method invocation with the same stack depth and signature (method, line number etc), the same operation. We derive span timestamps from the thread dumps the same operation is in. Let’s put this visually:

Above, represents 10 successive thread dumps. If this method is in dumps 4-8, we assume it started before dump 4 and finished after dump 8. We can’t tell exactly when the method started and stopped. but the timestamps of thread dumps are close enough.

To reduce overhead caused by thread dumps, we only profile methods enclosed by a specific entry point, such as a URI or MVC Controller method. We identify these entry points through the trace context and the APM system.

The profile does thread dump analysis and gives us:

The root cause, precise to the line number in the code.
Reduced maintenance as ad-hoc instrumentation is obviated.
Reduced overload risk caused by ad-hoc instrumentation.
Dynamic activation: only when necessary and with a very clear profile target.

Implementing Precise Profiling with Apache SkyWalking 7

Distributed profiling is built-into Apache SkyWalking application performance monitoring (APM). Let’s demonstrate how the profiling approach locates the root cause of the performance issue.

final CountDownLatchcountDownLatch= new CountDownLatch(2);
 
threadPool.submit(new Task1(countDownLatch));
threadPool.submit(new Task2(countDownLatch));
 
try {
   countDownLatch.await(500, TimeUnit.MILLISECONDS);
} catch (InterruptedExceptione) {
}

Task1 and Task2 have a race condition and unstable execution time: they will impact the performance of each other and anything calling them. While this code looks suspicious, it is representative of real life. People in the OPS/SRE team are not usually aware of all code changes and who did them. They only know something in the new code is causing a problem.

To make matters interesting, the above code is not always slow: it only happens when the condition is locked. In SkyWalking APM, we have metrics of endpoint p99/p95 latency, so, we are easy to find out the p99 of this endpoint is far from the avg response time. However, this is not the same as understanding the cause of the latency. To locate the root cause, add a profile condition to this endpoint: duration greater than 500ms. This means faster executions will not add profiling load.

This is a typical profiled trace segment (part of the whole distributed trace) shown on the SkyWalking UI. We now notice the “service/processWithThreadPool” span is slow as we expected, but why? This method is the one we added the faulty code to. As the UI shows that method, we know the profiler is working. Now, let’s see what the profile analysis result say.

This is the profile analysis stack view. We see the stack element names, duration (include/exclude the children) and slowest methods have been highlighted. It shows clearly, “sun.misc.Unsafe.park” costs the most time. If we look for the caller, it is the code we added: CountDownLatch.await.

The Limitations of the Profile Method

No diagnostic tool can fit all cases, not even the profile method.

The first consideration is mistaking a repeatedly called method for a slow method. Thread dumps are periodic. If there is a loop of calling one method, the profile analysis result would say the target method is slow because it is captured every time in the dump process. There could be another reason. A method called many times can also end up captured in each thread dump. Even so, the profile did what it is designed for. It still helps the OPS/SRE team to locate the code having the issue.

The second consideration is overhead, the impact of repeated thread dumps is real and can’t be ignored. In SkyWalking, we set the profile dump period to at least 10ms. This means we can’t locate method performance issues if they complete in less than 10ms. SkyWalking has a threshold to control the maximum parallel degree as well.

Understanding the above keeps distributed tracing and APM systems useful for your OPS/SRE team.

How to Try This

Everything we discussed, including the Apache SkyWalking Java Agent, profile analysis code, and UI, could be found in our GitHub repository. We hope you enjoyed this new profile method, and love Apache SkyWalking. If so, give us a star on GitHub to encourage us.

SkyWalking 7 has just been released. You can contact the project team through the following channels:

Follow SkyWalking twitter.
Subscribe mailing list: dev@skywalking.apache.org. Send to dev-subscribe@kywalking.apache.org to subscribe to the mail list.

Co-author Sheng Wu is a Tetrate founding engineer and the founder and VP of Apache SkyWalking. He is solving the problem of observability for large-scale service meshes in hybrid and multi-cloud environments.

Adrian Cole works in the Spring Cloud team at VMware, mostly on Zipkin

Han Liu is a tech expert at Lagou. He is an Apache SkyWalking committer

Blog: SkyWalking performance in Service Mesh scenario

Fri, 25 Jan 2019 00:00:00 +0000

Author: Hongtao Gao, Apache SkyWalking & ShardingShpere PMC
GitHub, Twitter, Linkedin

Service mesh receiver was first introduced in Apache SkyWalking 6.0.0-beta. It is designed to provide a common entrance for receiving telemetry data from service mesh framework, for instance, Istio, Linkerd, Envoy etc. What’s the service mesh? According to Istio’s explain:

The term service mesh is used to describe the network of microservices that make up such applications and the interactions between them.

As a PMC member of Apache SkyWalking, I tested trace receiver and well understood the performance of collectors in trace scenario. I also would like to figure out the performance of service mesh receiver.

Different between trace and service mesh

Following chart presents a typical trace map:

You could find a variety of elements in it just like web service, local method, database, cache, MQ and so on. But service mesh only collect service network telemetry data that contains the entrance and exit data of a service for now(more elements will be imported soon, just like Database). A smaller quantity of data is sent to the service mesh receiver than the trace.

But using sidecar is a little different.The client requesting “A” that will send a segment to service mesh receiver from “A”’s sidecar. If “A” depends on “B”, another segment will be sent from “A”’s sidecar. But for a trace system, only one segment is received by the collector. The sidecar model splits one segment into small segments, that will increase service mesh receiver network overhead.

Deployment Architecture

In this test, I will pick two different backend deployment. One is called mini unit, consist of one collector and one elasticsearch instance. Another is a standard production cluster, contains three collectors and three elasticsearch instances.

Mini unit is a suitable architecture for dev or test environment. It saves your time and VM resources, speeds up depolyment process.

The standard cluster provides good performance and HA for a production scenario. Though you will pay more money and take care of the cluster carefully, the reliability of the cluster will be a good reward to you.

I pick 8 CPU and 16GB VM to set up the test environment. This test targets the performance of normal usage scenarios, so that choice is reasonable. The cluster is built on Google Kubernetes Engine(GKE), and every node links each other with a VPC network. For running collector is a CPU intensive task, the resource request of collector deployment should be 8 CPU, which means every collector instance occupy a VM node.

Testing Process

Receiving mesh fragments per second(MPS) depends on the following variables.

Ingress query per second(QPS)
The topology of a microservice cluster
Service mesh mode(proxy or sidecar)

In this test, I use Bookinfo app as a demo cluster.

So every request will touch max 4 nodes. Plus picking the sidecar mode(every request will send two telemetry data), the MPS will be QPS * 4 *2.

There are also some important metrics that should be explained

Client Query Latency: GraphQL API query response time heatmap.
Client Mesh Sender: Send mesh segments per second. The total line represents total send amount and the error line is the total number of failed send.
Mesh telemetry latency: service mesh receiver handling data heatmap.
Mesh telemetry received: received mesh telemetry data per second.

Mini Unit

You could find collector can process up to 25k data per second. The CPU usage is about 4 cores. Most of the query latency is less than 50ms. After login the VM on which collector instance running, I know that system load is reaching the limit(max is 8).

According to the previous formula, a single collector instance could process 3k QPS of Bookinfo traffic.

Standard Cluster

Compare to the mini-unit, cluster’s throughput increases linearly. Three instances provide total 80k per second processing power. Query latency increases slightly, but it’s also very small(less than 500ms). I also checked every collector instance system load that all reached the limit. 10k QPS of BookInfo telemetry data could be processed by the cluster.

Conclusion

Let’s wrap them up. There are some important things you could get from this test.

QPS varies by the there variables. The test results in this blog are not important. The user should pick property value according to his system.
Collector cluster’s processing power could scale out.
The collector is CPU intensive application. So you should provide sufficient CPU resource to it.

This blog gives people a common method to evaluate the throughput of Service Mesh Receiver. Users could use this to design their Apache Skywalking backend deployment architecture.

Apache SkyWalking – Profiling

Blog: How AI Changed the Economics of Architecture

From Paused Idea to Runnable System

AI Speed Changed the Design Loop

What Actually Changed

Blog: SkyWalking GraalVM Distro: Design and Benchmarks

Why GraalVM Is Not Optional

The Challenge: A Mature, Dynamic Java Platform

The Design Goal: Make Migration Repeatable

Turning Runtime Dynamism into Build-Time Assets

Same-FQCN Replacements as a Controlled Boundary

Reflection Config Is Generated, Not Guessed

Keeping Upstream Sync Practical

Benchmark Results

Boot Test (Docker Compose, no traffic, median of 3 runs)

Under Sustained Load (Kind + Istio 1.25.2 + Bookinfo at ~20 RPS, 2 OAP replicas)

Current Status

Getting Started

Blog: Profiling Java application with SkyWalking bundled async-profiler

Background

Why use async-profiler

Architecture diagram

The processes of running a profiling task

Demo

Run the Async Profiling Task Step by Step

Create a New Task

Check the Progresses Of the Task

Performance Analysis

Some Details

Differences in CPU sampling during task creation

ExecArgs in task creation

Comparison table between sampling types and JFR events in task analysis

Performance expenses

Blog: SkyWalking 10 Release: Service Hierarchy, Kubernetes Network Monitoring by eBPF, BanyanDB, and More

Layer and Service Hierarchy

Layer Jump

Service Hierarchy

Monitoring Kubernetes Network Traffic by using eBPF

BanyanDB - Native APM Database

Apache RocketMQ Server Monitoring

ClickHouse Server Monitoring

Apache ActiveMQ Server Monitoring

Support Multiple Labels Names

Metrics gRPC exporter

SkyWalking Native UI Metrics Query Switch to V3 APIs

Other Notable Enhancements

Blog: Monitoring Kubernetes network traffic by using eBPF

Background

Why eBPF?

Kernel Monitoring and Protocol Analysis

Kernel Monitoring

Observe Sending

Observe Receiving

Metrics

Protocol Analyzing

Probes

Limitations

Demo

Deploy SkyWalking Showcase

Done

Conclusion

Blog: Activating Automatical Performance Analysis -- Continuous Profiling

Background

Automate Profiling

Policy

Monitoring type

Network related monitoring

Metrics collector

Threshold determination

Trigger task

Data Flow

eBPF Agent with Process

Sliding window

eBPF Agent with SkyWalking Backend

Demo

Deploy SkyWalking Showcase

Create Continuous Profiling Policy

Done

Conclusion

Zh: 自动化性能分析——持续剖析