Skip to main content

the avatar of Klaas Freitag

Kraft 2.0 Announcement

Kraft 2.0 logo interpretationWith the start of the new year, I am very happy to announce the release of version Kraft 2.0.0.

Kraft provides effective invoicing and document management for small businesses on Linux. Check the feature list.

This new version is a big step ahead for the project. It does not only deliver the outstanding ports to Qt6 and KDE Frameworks 6 and tons of modernizations and cleanups, but for the first time, it also does some significant changes in the underlying architecture and drops outdated technology.

Kraft now stores documents not longer in a relational database, but as XML documents in the filesystem. While separate files are more natural for documents anyway, this is paving the way to let Kraft integrate with private cloud infrastructures like OpenCloud or Nextcloud via sync. That is not only for backup- and web-app-purposes, but also for synced data that enables to run Kraft as distributed system. An example is if office staff works from different home offices. Expect this and related usecases to be supported in the near future of Kraft.

But there are more features: For example, the document lifecycle was changed to be more compliant: Documents remain in a draft status now until they get finalized, when they get their final document number. From that point on, they can not longer be altered.

There is too much on the long Changes-List to mention here.

However, what is important is that after more than 20 years of developing and maintaining this app, I continue to be motivated to work on this bit. It is not a big project, but I think it is important that we have this kind of “productivity”-applications available for Linux to make it attractive for people to switch to Linux.

Around Kraft, a small but beautiful community has built up. I like to thank everybody who contributed in any way to Kraft over the years. It is big fun to work with you all!

If you are interested, please get in touch.

a silhouette of a person's head and shoulders, used as a default avatar

pgtwin as OCF Agent

When I was looking for a solution that could provide High Availability for two Datacenters, the only solution that remained viable and comprehensible for me was using Corosync/Pacemaker. The reason that I actually need this is, that Mainframe environments typically use two Datacenters, since z/OS can nicely operate with that. The application that I had to setup is Kubernetes on Linux on Z and since Kubernetes itself normally runs with 3 or more nodes, I had to find a different solution. I found, that I could use an external database to run Kubernetes with https://github.com/k3s-io/kine, and being no DBA, I selected PostgreSQL as first try.

For pacemaker, there already exists an OCF Agent called pgsql https://linux.die.net/man/7/ocf_heartbeat_pgsql that is included with the clusterlabs OCF agents. In addition, RedHat created another OCF agent, called PAF https://clusterlabs.github.io/PAF/ that sounded promising. However, I first had to build it on my own, and later I found that it was really nicely promoted, but was missing out on some needed features.

That is, a colleague asked, if I wanted to try to use his AI, and countless improvements and bugs later, the pgtwin https://github.com/azouhr/pgtwin agent really seems quite stable. Now, to some of the main design concepts.

Make use of the promotable clone resource

PostgreSQL’s primary/standby model maps perfectly to promoted/unpromoted. This is actually how you also would configure pgsql with a current pacemaker release. All documentation relies on the current schema of this configuration.

Use Physical Replication with Slots

  • Prevent WAL files from being recycled while standby is offline
  • Enable standby to catch up after brief disconnections
  • Automatically created/managed by pgtwin
  • Automatically cleaned up when excessive (prevents disk fill)

Why physical, and not logical replication?

  • Byte-identical replica (all databases, all tables, all objects)
  • Lower overhead than logical replication
  • Supports pg_rewind for timeline divergence recovery

Automatic Standby Initialization

Traditionally, the database admin would have to setup the replication and the OCF agent would then take over the management. However, since we already had basebackup functionality ready in case the WAL had been cleaned up, it was just a small step to provide full initialization

The only steps on the secondary for the admin after configuring the primary are:

  • Create the PostgreSQL Data Directory with correct ownership/permissions
  • Setup the password file .pgpass

The remaining tasks of creating a sync streaming replication is done during startup of the node by pgtwin.

Timeline Divergence and pg_rewind

After a failover, the old primary may have diverged from the new primary, and thus the synchronous replication will fail. pgtwin handles this as folows:

  1. Detect divergence (timeline check in pgsql_demote)
  2. Runs pg_rewind to sync from new primary
  3. Replays necessary WAL ro reconcile
  4. Starts as standby.

This is much faster than trying to do a full basebackup, at least with big databases. Typical failover times are merely seconds.

Replication Health Monitoring

Every monitor cycle, pgtwin does not only check if PostgreSQL is running, but also the replication health. This includes the replication state (streaming, catchup, etc.) as well as the replication lag and the synchronous state.

If the replication check fails for 5 consecutive monitor cycles (configurable), pgtwin automatically triggers recovery. First trying with pg_rewind, however if that fails, it will go for pg_basebackup.

Configuration Validation

At startup, pgtwin validates PostgreSQL configuration for a number of settings that it considers critical. There are hard checks like “restart_after_crash = off” that must be set to off to prevent PostgreSQL from trying to promote itself instead of letting pacemaker handle the situation. But also a number of other parameters.

To check the startup validation, have a look at the pacemaker system logs:

journalctl -u pacemaker -f

State Machine and Lifecyle

pgtwin has a clear idea about the state of PostgreSQL lifecycle:

┌─────────────────────────────────────────────────────────────┐
│                      STOPPED STATE                          │
│  PostgreSQL not running                                     │
└──────────────────────┬──────────────────────────────────────┘
                       │ start operation
                       ↓
              ┌────────────────┐
              │ PGDATA valid?  │
              └────┬───────┬───┘
                   │       │
             NO ←──┘       └──→ YES
              │                 │
              ↓                 ↓
    ┌──────────────────┐  ┌─────────────────┐
    │ Auto-initialize  │  │ Start PostgreSQL│
    │ (pg_basebackup)  │  │ as standby      │
    └────────┬─────────┘  └────────┬────────┘
             │                     │
             └──────────┬──────────┘
                        ↓
┌─────────────────────────────────────────────────────────────┐
│                   UNPROMOTED STATE                          │
│  PostgreSQL running as standby                              │
│  - Replaying WAL from primary                               │
│  - Read-only queries allowed                                │
│  - Monitor checks replication health                        │
└──────────────────────┬──────────────────────────────────────┘
                       │ promote operation
                       ↓
              ┌────────────────────┐
              │ pg_ctl promote     │
              │ (remove standby    │
              │  signal)           │
              └────────┬───────────┘
                       ↓
┌─────────────────────────────────────────────────────────────┐
│                    PROMOTED STATE                           │
│  PostgreSQL running as primary                              │
│  - Accepts write operations                                 │
│  - Streams WAL to standby                                   │
│  - Manages replication slot                                 │
│  - Monitor checks replication health                        │
└──────────────────────┬──────────────────────────────────────┘
                       │ demote operation
                       ↓
              ┌────────────────────┐
              │ Stop PostgreSQL    │
              │ Check timeline     │
              │ pg_rewind if needed│
              │ Create standby     │
              │ signal             │
              └────────┬───────────┘
                       ↓
       (returns to UNPROMOTED STATE)

Failure Handling

The following Failures are handled completely automatically and are designed to provide seamless operation without dataloss:

  1. Primary Failure and Recovery
  2. Standby Failure and Recovery
  3. Replication Failure
  4. Split-Brain Prevention

For the Split-Brain Prevention, additional Pacemaker configurations like a second corosync with direct network connection as well as a third ring with IPMI will be needed.

Container Mode

pgtwin is prepared to also support containers instead of a locally installed PostgreSQL database. However, the current implementation is too sluggish and has too much overhead during management of the database.

For future releases, I plan to change the implementation by switching from “podman run” to the use of “nsexec”. We will see, if this makes the implementation usable. Still, currently implemented is

  • Version check, that prevents from using a wrong Container PostgreSQL Version with the current PGDATA
  • Additional PostgreSQL User that allows to use the PGDATA Userid to be used within the Container.
  • All PostgreSQL commands are run by a wrapper, so that there is a seamless integration between bare-metal and container operations guaranteed.

Single-Node Startup

The original authors of pgsql were very considered about the data even in the case of a double crash of the cluster. The scenario they had in mind was like this:

  • Primary crashes
  • Secondary takes over and handles applications
  • Secondary crashes
  • Primary comes up with outdated data and continues as primary

Now, with pgtwin there is a number of considerations going to the startup

  1. If both nodes come up, pgtwin will check the timeline on who should become promoted
  2. If cluster was down, and one node comes up:
    • If Node was primary and had sync mode enabled: Node likely crashed, should not be promoted.
    • If Node was primary and had async mode enabled: Node likely crashed when other node was missing. This node should become primary
    • If Node was secondary: Cluster probably crashed, or was restarted after the secondary crashed, node should not be promoted

The key insight here is, that in case just one node is restarted, it only should be promoted standalone if it was primary before, and in addition it had async streaming replication activated even though the cluster was configured for sync streaming replication.

The cluster will refuse to start with a single node else. If startup is really needed, the admins will have to override.

pgtwin-migrate

In a future blog entry, I will cover the features of the currently experimental pgtwin-migrate OCF agent. This agent allows to fail over between two PostgreSQL Clusters, like two Versions or between different Vendors.

a silhouette of a person's head and shoulders, used as a default avatar

What does it mean to write in 2026?

I've been writing for something like 50 years now. I started by scribbling letters on paper as a child because I was fascinated that these expressed meaning. I wrote a lot for school, for university, for work, and privately. I wrote letters, emails, posts on social media, articles, papers, documentation, diaries, opinion pieces, and presentations. I've been writing my blog for more than 20 years.

Writing always has been a way for me to connect to the people, to the community, around me, communicating with my tribe. It also has always been a way to express, refine and archive my thoughts, a bit like building a memory of insights. It also has been a way to record some of my personal history and the history of the projects I'm involved with.

My writing has changed over the last couple of years. I'm writing less publicly and more focused on specific projects. It feels like it has become less personal and more utilitarian.

Part of this is that the Internet has lost a good part of its strength as a neutral platform to reach the world. For a long time I knew where to reach the people I wanted to address and had control about my content and how it was distributed. Nowadays social media platforms act as distributors, but we are prey to their algorithms. So while publishing content is still simple, it's much harder to get it to your audience without compromising to the mechanisms which make the algorithms tick.

Another part is the disrupting advance of AI writing capabilities. While I have relied on humans to give me feedback in the past, to get into a conversation on the topics of my posts to refine the thoughts in them, now there is this all-powerful-seeming assistant in my editor who is eager to take over those roles. And it would even write for me in my own style. So what's the value of writing in 2026? Is it even worth bothering with trying to express your thoughts in writing, when a machine can produce content which looks the same, much faster and in much larger quantity? What does this do to readers, do they still care about what I would write?

My feeling is that it's still worth to put in effort to create genuine, trustworthy, truthful writing. The format, the tools, the channels might change, but the values don't. The challenge will be to figure out how to create a signal which transports these values.

I have always liked the format and style of a blog, as a stream of thoughts, coming from a personal perspective, but focused on topics of relevance to others. I enjoy reading this from others and I enjoy writing in this style. And I don't have to rely on a platform I don't control, but can use my own.

So it looks like this blog won't go away, but will channel my thoughts in 2026 as well.

a silhouette of a person's head and shoulders, used as a default avatar

Tumbleweed – Review of the week 2026/1

Dear Tumbleweed users and hackers,

Happy New Year to you all! While people all around the world are celebrating the new year, Tumbleweed has been tirelessly rolling ahead and has published six snapshots (20251227 – 20251231, 20260101). Naturally, there are no groundbreaking changes, as many developers and maintainers are out celebrating, and any greater coordinated effort is taking a bit more time.

Nevertheless, the six snapshots brought you these changes:

  • Python 3.13.11 (some CVE fixes)
  • libgit2 1.9.2S
  • Neon 0.36.0
  • Harfbuzz 12.3.0
  • NetworkManager 1.54.3
  • GStreamer 1.26.10
  • VLC 3.0.22 & 3.0.23: finally linking ffmpeg-8
  • GPG 2.5.16
  • upower 1.91.0

The next snapshot is already in progress of syncing out, and the next few changes are pulling up in the staging projects. You can expect these things shortly:

Let’s get rolling for the Year 2026! I’m looking forward to a great year!

a silhouette of a person's head and shoulders, used as a default avatar

Path Aware High Availability (PAHA)

During my works on Kubernetes on Linux on Z and the creation of https://github.com/azouhr/pgtwin, I came across the same issue that most admins have to solve in two-node clusters. How can I get quorum, and what node is to be the primary.

While using additional techniques like providing a second corosync ring for HA, and even a third ring for an IPMI device, the elegance of having a three node quorum could not easily be implemented in my desired environment.

When trying to solve the correct placement of the primary PostgreSQL database in the two-Node Cluster, it came to me, that there is an external dependency that could be used as arbitrator. It does not really help an application if a resource is available, but it cannot be reached.

The main insight here was:

**Availability without accessibility is useless**

This pattern shifts HA from “server-centric” (is it running?) to “use-case-centric” (can it be used for its intended purpose?). I did some research, however I could not find anyone describing this key principle as a method to determine placement of resources.

We did define a new term to make this handy:

Definition of “Critical Path”:

A critical path is any dependency required for the service to fulfill its designed use case.

Definition of “Path-Aware High Availability (PAHA)”

Path-Aware High Availability is a general clustering pattern where resource promotion decisions explicitly validate critical paths required for service delivery before allowing promotion. Unlike traditional HA which only checks if a service *process* is running, PAHA ensures the service is running on a node where clients can actually use it.

This turned out to be a really interesting thought. Besides network paths, this can also be applied to other paths, totally unrelated to the original use case:

Use Case Service Critical Path Validation Method
Database clustering PostgreSQL Gateway reachability Ping gateway from node
Storage HA iSCSI target Multipath to storage multipath -ll shows paths
FibreChannel SAN SAN LUN FC fabric connectivity fcinfo shows active paths
RoCE storage NVMe-oF target DCB lossless Ethernet dcbtool shows PFC enabled
API gateway Kong/Nginx Upstream service reachable Health check endpoint
Load balancer HAProxy Backend pool reachable TCP connect to backends
DNS server BIND Root server reachability Query root servers
NFS server NFS daemon Export filesystem mounted mount shows filesystem
Container orchestrator Kubernetes CNI network functional Pod-to-pod connectivity

This can even be used to mitigate sick-but-not-dead conditions. For example in a multipath environment, you might want to disable a path that sometimes shows crc errors. Even from the storage side, you would know if there are sufficient paths available, and can disable the sick path.

Now to the fun part. It tells about pacemaker, that such functionality can be implemented by simple configuration means, at least for networks. For pgtwin, the question was, what happens if ring0 (with the PostgreSQL resource) is partially broken. The other ring would keep the cluster running, but the placement of the primary with read-write capability would have to go to the node with service access.

What we had to do, was merely create a ping resource, setup a clone with it, and create a location rule that tells pacemaker where to place the primary resource. In case of pgtwin, we additionally prevent the unpromoted resource from running on a node without ping connectivity, because it likely will not be able to sync with the primary. The configuration looks like this:

primitive ping-gateway ocf:pacemaker:ping \
    params \
        host_list="192.168.1.1" \ 
        multiplier="100" \
        attempts="3" \
        timeout="2" \
    op monitor interval="10s" timeout="20s"
clone ping-clone ping-gateway \
    meta clone-max="2" clone-node-max="1"
location prefer-connected-promoted postgres-clone role=Promoted \
    rule 200: pingd gt 0
location require-connectivity-unpromoted postgres-clone role=Unpromoted \
    rule -inf: pingd eq 0

Now, in the assumed case of a Dual Datacenter setup, what happens if the gateway vanishes on one side is:

  1. The cluster makes sure that the primary is on the side with the ping availability.
  2. The secondary is located on the other side.
  3. The secondary may not run there without the ping resource and is stopped.
  4. The primary is notified about the secondary being gone, and switches to async replication mode.

This means, that we lost high availability of the PostgreSQL database, but it still serves the applications as usual. When the gateway comes back, the following happens:

  1. The cluster starts pgtwin on the secondary
  2. pgtwin initiates a rollback of the database to get the timelines in sync
  3. If the rollback is unsuccessful, pgtwin initiates a basebackup from the primary
  4. After the nodes are consistent, the database is started as secondary, and the replication is switched to sync again.
  5. The primary node is not moved back, because we set a resource stickiness by default.

All of this happens without admin intervention. This procedure greatly improves availability of the PostgreSQL database for the intended use.

the avatar of Nathan Wolf

Seamless Windows Apps on openSUSE with WinBoat

The author details their successful integration of openSUSE with Microsoft Office 365 using WinBoat, enabling Windows applications in a Linux environment without dual-booting. Despite minor setup challenges, they achieved significant functionality and security with Windows apps like Milestone XProtect and Rufus, appreciating the performance and seamless integration during their workflow.
the avatar of Greg Kroah-Hartman

Linux kernel security work

Lots of the CVE world seems to focus on “security bugs” but I’ve found that it is not all that well known exactly how the Linux kernel security process works. I gave a talk about this back in 2023 and at other conferences since then, attempting to explain how it works, but I also thought it would be good to explain this all in writing as it is required to know this when trying to understand how the Linux kernel CNA issues CVEs.

a silhouette of a person's head and shoulders, used as a default avatar

pgtwin — HA PostgreSQL: Configuration

in the previous blog pgtwin — HA PostgreSQL: VM Preparation we setup two VMs with KVM to prepare for a HA PostgreSQL setup. Now, we will configure the Corosync cluster engine, prepare PostgreSQL for synchronous streaming replication and finally configure Pacemaker to provide high availability.

Configure Corosync

Corosync has its main configuration file located at ‘/etc/corosync/corosync.conf’. Edit this file with the following content, change the IP Adresses according to your setup:

totem {
    version: 2
    cluster_name: pgtwin-devel
    transport: knet
    crypto_cipher: aes256
    crypto_hash: sha256
    token: 5000
    join: 60
    max_messages: 20
    token_retransmits_before_loss_const: 10

    # Dual ring configuration
    interface {
        ringnumber: 0
        mcastport: 5405
    }

    interface {
        ringnumber: 1
        mcastport: 5407
    }
}

nodelist {
    node {
        ring0_addr: 192.168.60.13
        ring1_addr: 192.168.61.233
        name: pgtwin1
        nodeid: 1
    }

    node {
        ring0_addr: 192.168.60.83
        ring1_addr: 192.168.61.253
        name: pgtwin2
        nodeid: 2
    }

}

quorum {
    provider: corosync_votequorum
    two_node: 1
    wait_for_all: 1
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
    timestamp: on
}

Next step is to create an authentication key for the cluster. Create the key on the first node, and then copy it to the other node:

corosync-keygen -l
scp /etc/corosync/authkey pgtwin2:/etc/corosync/authkey

Note, that by default you will not be allowed to access the remote node as root with ssh. This is a good standard for production sites. If you find that inconvenient, you can change the setting by adding a file to /etc/ssh/sshd_config.d. Don’t do this for production environments or externally reachable VMs though:

# cat /etc/ssh/sshd_config.d/10-permit-root.conf
PermitRootLogin=yes

On both nodes, make sure that the ownership and access rights are correct:

chmod 400 /etc/corosync/authkey
chown root:root /etc/corosync/authkey

Enable and start Corosync and Pacemaker:

systemctl enable corosync
systemctl start corosync

# Wait 10 seconds for Corosync to stabilize
sleep 10

# Check Corosync status
sudo corosync-cfgtool -s

# Enable and start Pacemaker
sudo systemctl enable pacemaker
sudo systemctl start pacemaker                                                                                           

Verify that the cluster is working with ‘crm status’

Configure PostgreSQL

PostgreSQL will only be configured on the first node. The second node will only need the data directory as well as the password file ‘.pgpass’ prepared, the pgtwin ocf agent itself will then perform the initial mirroring and final replication configuration of the database. Find the mentioned postgresql.custom.conf file at https://github.com/azouhr/pgtwin/blob/main/postgresql.custom.conf. This file holds the default configuration for use with pgtwin. You want to tweak the parameters according to your usage. Also make sure to use a password that is suitable for your environment.

# Initialize database
sudo -u postgres initdb -D /var/lib/pgsql/data

# Copy the provided PostgreSQL HA configuration
sudo cp /path/to/pgtwin/github/postgresql.custom.conf /var/lib/pgsql/data/postgresql.custom.conf
sudo chown postgres:postgres /var/lib/pgsql/data/postgresql.custom.conf

# Include custom config in main postgresql.conf
sudo -u postgres bash -c "echo \"include = 'postgresql.custom.conf'\" >> /var/lib/pgsql/data/postgresql.conf"

# Configure pg_hba.conf for replication
sudo -u postgres tee -a /var/lib/pgsql/data/pg_hba.conf <<EOF

# Replication connections
host    replication     replicator      192.168.60.0/24       scram-sha-256
host    postgres        replicator      192.168.60.0/24       scram-sha-256
EOF

# Start PostgreSQL manually (temporary)
sudo -u postgres pg_ctl -D /var/lib/pgsql/data start

# Create replication user
sudo -u postgres psql <<EOF
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'SecurePassword123';
GRANT pg_read_all_data TO replicator;
GRANT EXECUTE ON FUNCTION pg_ls_dir(text, boolean, boolean) TO replicator;
GRANT EXECUTE ON FUNCTION pg_stat_file(text, boolean) TO replicator;
GRANT EXECUTE ON FUNCTION pg_read_binary_file(text) TO replicator;
GRANT EXECUTE ON FUNCTION pg_read_binary_file(text, bigint, bigint, boolean) TO replicator;
EOF

# Stop PostgreSQL (cluster will manage it)
sudo -u postgres pg_ctl -D /var/lib/pgsql/data stop

Also add the connection definition for your application connections to the pg_hba.conf file.

The PostgreSQL configuration only needs to prepare the password file now. This needs to be added to both nodes:

# cat /var/lib/pgsql/.pgpass
# Replication database entries (for streaming replication)
pgtwin1:5432:replication:replicator:SecurePassword123
pgtwin2:5432:replication:replicator:SecurePassword123
192.168.60.13:5432:replication:replicator:SecurePassword123
192.168.60.83:5432:replication:replicator:SecurePassword123

# Postgres database entries (required for pg_rewind and admin operations)
pgtwin1:5432:postgres:replicator:SecurePassword123
pgtwin2:5432:postgres:replicator:SecurePassword123
192.168.60.13:5432:postgres:replicator:SecurePassword123
192.168.60.83:5432:postgres:replicator:SecurePassword123

Also set correct permissions for this file, else PostgreSQL will refrain from using it:

chmod 600 /var/lib/pgsql/.pgpass
chown postgres:postgres /var/lib/pgsql/.pgpass

After adding .pgpass to the second node, you will only need an empty data directory on that node prepared:

mkdir -p /var/lib/pgsql/data
chown postgres:postgres /var/lib/pgsql/data
chmod 700 /var/lib/pgsql/data

Configure Pacemaker

The final step before starting the HA PostgreSQL for the first time is to configure pacemaker. For first time users of pacemaker, this is a daunting configuration, and it needs a lot of considerations. For now, retrieve the already prepared file https://github.com/azouhr/pgtwin/blob/main/pgsql-resource-config.crm and adopt it to your environment.

The values that you have to edit are:

  • VIP address (virtual IP, that is migrated between the cluster nodes and serves as access address for all applications)
  • Ping-Gateway address, that allows the cluster to prefer a node with access to the network
  • Node Names in several resources, the defaults psql1 and psql2 will be come pgtwin1 and pgtwin2 respectively

After editing the file, load it into the cluster with the ‘crm’ command. The configuration can be done on any node, and will be available immediately from any node:

crm configure < pgsql-resource-config.crm

Thats it. The cluster will now try to bring up the PostgreSQL Database on both nodes in a HA configuration. You can monitor the process with the command ‘crm_mon’. Note, that in the beginning, the secondary node will have failed resources. This is due to the fact, that pgtwin has to perform an initial basebackup on that node. After a while, the output should look similar to this:

Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: pgtwin1 (version 3.0.1+20250807.16e74fc4da-1.2-3.0.1+20250807.16e74fc4da) - partition WITHOUT quorum
  * Last updated: Tue Dec 30 12:55:12 2025 on pgtwin1
  * Last change:  Tue Dec 30 12:55:07 2025 by hacluster via hacluster on pgtwin2
  * 2 nodes configured
  * 5 resource instances configured

Node List:
  * Online: [ pgtwin1 pgtwin2 ]

Active Resources:
  * postgres-vip        (ocf:heartbeat:IPaddr2):         Started pgtwin1
  * Clone Set: postgres-clone [postgres-db] (promotable):
    * Promoted: [ pgtwin1 ]
    * Unpromoted: [ pgtwin2 ]
  * Clone Set: ping-clone [ping-gateway]:
    * Started: [ pgtwin1 pgtwin2 ]

After the cluster stabilized, you can perform a number of tests to check the state:

On pgtwin1 (primary):

# Check replication status
sudo -u postgres psql -x -c "SELECT * FROM pg_stat_replication;"

# Expected: One row showing pgtwin2 connected

On pgtwin2 (standby):

# Check if in recovery mode
sudo -u postgres psql -c "SELECT pg_is_in_recovery();"

# Expected: t (true)

Congratulations, you got a HA PostgreSQL Database running. To access the database on the primary, just use the command:

sudo -u postgres psql

Since this has direct socket access, you will have full access to the database without password that way. For further tests and more information, have a look at https://github.com/azouhr/pgtwin/blob/main/QUICKSTART_DUAL_RING_HA.md.

a silhouette of a person's head and shoulders, used as a default avatar

pgtwin — HA PostgreSQL: VM Preparation

In my last post Kubernetes on Linux on Z, I explained why I need a highly available PostgreSQL Database to operate K3s. Of course, a HA PostgreSQL that works with just two Datacenters has lots more usecases. Let me explain how to perform an initial setup like the one that I use for development.

Preparation of two VMs

The openSUSE Project releases readily prepared Tumbleweed images almost every day. Have a look at https://download.opensuse.org/tumbleweed/appliances/, I typically get an image from there that is named like ‘openSUSE-Tumbleweed-Minimal-VM.x86_64-1.0.0-kvm-and-xen-Snapshot20251222.qcow2’. The current image will have a different name, however lets go with this for now.

My typical KVM VMs use:

  • 2 CPUs
  • 2 GB memory
  • Raw disk image format
  • Two libvirt networks (ring0 and ring1)
  • Both, graphical (VNC) and serial console support

First, convert the image to a raw image. The reason why I like to use this is, that it is much easier to loop mount such an image also in the local operating system, and also to increase the image with standard commands like kpartx, losetup and dd. You can go with qcow2, if you prefer that format.

qemu-img convert openSUSE-Tumbleweed-Minimal-VM.x86_64-1.0.0-kvm-and-xen-Snapshot20251222.qcow2 pgtwin01.raw

We will need two images of that kind:

cp -a pgtwin01.raw pgtwin02.raw

Since I like to use two network rings for the HA Setup (I will go into details why this is a good thing in a concepts blog soon), lets create two libvirt networks. Attachment of real Linux bridges would also be possible. Create two files ring0.xml and ring1.xml:

# cat ring0.xml
<network>
  <name>ring0</name>
  <forward mode='nat'>
    <nat>
      <port start='1024' end='65535'/>
    </nat>
  </forward>
  <bridge name='virbr10' stp='on' delay='0'/>
  <ip address='192.168.60.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.60.2' end='192.168.60.254'/>
    </dhcp>
  </ip>
</network>
# cat ring1.xml
<network>
  <name>ring1</name>
  <forward mode='nat'>
    <nat>
      <port start='1024' end='65535'/>
    </nat>
  </forward>
  <bridge name='virbr11' stp='on' delay='0'/>
  <ip address='192.168.61.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.61.2' end='192.168.61.254'/>
    </dhcp>
  </ip>
</network>

After that, define the libvirt networks, and enable autostart:

virsh net-define ring0.xml
virsh net-define ring1.xml
virsh net-autostart ring0
virsh net-autostart ring1

Now, lets setup the two VMs. The following command will bring up a VM and ask a number of initial questions. This is just the basic setup of a VM, nothing really special there:

virt-install \
  --name pgtwin01 \
  --memory 2048 \
  --vcpus 2 \
  --disk path=/home/claude/images/pgtwin01.raw,format=raw \
  --import \
  --network network=ring0 \
  --network network=ring1 \
  --os-variant opensusetumbleweed \
  --graphics vnc,listen=0.0.0.0 \
  --console pty,target_type=serial

and the same with pgtwin02:

virt-install \
  --name pgtwin02 \
  --memory 2048 \
  --vcpus 2 \
  --disk path=/home/claude/images/pgtwin02,format=raw \
  --import \
  --network network=ring0 \
  --network network=ring1 \
  --os-variant opensusetumbleweed \
  --graphics vnc,listen=0.0.0.0 \
  --console pty,target_type=serial

In case you want to connect to Linux Bridges, use “bridge=” instead of “network=”. Typically, I configure ssh to the two VMs, this normally has been done during the virt-install process. The minimal image from openSUSE by default configures both network devices with dhcp. This is an issue, because it will have two default gateways defined. Let me explain how to fix this:

# nmcli c s
NAME                UUID                                  TYPE      DEVICE 
Wired connection 1  29df9468-975d-3944-91ca-355ed0c82a3c  ethernet  enp1s0 
Wired connection 2  1f45b334-b429-3823-80eb-a3aafeb33195  ethernet  enp2s0 
lo                  611124a1-fa8e-48d6-84ba-f75733093ca6  loopback  lo

There is two external interfaces configured here. If you check the routing, you will find two default gateway definitions:

ip r s

In this setup, only ring0 is used to connect to the world, and thus the default gateway of ring1 (connected over enp2s0) can be deleted:

nmcli connection modify 1f45b334-b429-3823-80eb-a3aafeb33195 \
    ipv4.gateway "" \
    ipv4.never-default yes

Adopt the UUID and requirements to your setup.

For the Pacemaker and PostgreSQL configuration later on, also setup your hostnames and resolving of the other nodes. The procedure to set the hostname seems to have changed recently, and it now uses hostnamectl instead of just writing the name to /etc/HOSTNAME.

On pgtwin01:
hostnamectl set-hostname pgtwin01
On pgtwin02:
hostnamectl set-hostname pgtwin02

The resolving is either over your standard DNS system or with /etc/hosts. Find the used IP Addresses with ‘ip a s’

echo "192.168.60.13   pgtwin01" >> /etc/hosts
echo "192.168.60.83   pgtwin02" >> /etc/hosts

Configure the firewall to allow communication between the two VMs:

# Corosync communication
firewall-cmd --permanent --add-port=5405/udp  # Corosync multicast
firewall-cmd --permanent --add-port=5404/udp  # Corosync multicast (alternative)

# Pacemaker communication
firewall-cmd --permanent --add-port=2224/tcp  # pcsd
firewall-cmd --permanent --add-port=3121/tcp  # Pacemaker

# PostgreSQL
firewall-cmd --permanent --add-port=5432/tcp

# Reload firewall
firewall-cmd --reload

The last step for preparing the VMs is installing the cluster software as well as the PostgreSQL database software.

zypper install -y \
    pacemaker \
    corosync \
    crmsh \
    sudo \
    resource-agents \
    fence-agents \
    postgresql18 \
    postgresql18-server \
    postgresql18-contrib

After that, you have two VMs readily installed with two network connections. The next steps will be the setup of Corosync, the initial configuration of the PostgreSQL Database, and finally the cluster resource definitions.

a silhouette of a person's head and shoulders, used as a default avatar

Kubernetes on Linux on Z

This year, I had the task of setting up a Kubernetes environment on a Linux Partition on a s390x system. At first sight, this sounds easy, there are offerings out there that you can purchase. The second look however can make you wonder. There is a structural mismatch between typical Linux on Z environments and Kubernetes:

While Linux on Z typically uses two datacenters as two high availability zones, Kubernetes requires you two have at least three.

This is a base foundation issue, that is not to overcome by just telling what you all did, you really have to get into the issue and find a solution. I might not know everything, however there is a solution that the rancher people developed, and it is called kine. This is an etcd shim, that allows to replace the etcd-database, which actually requires the three sites for its quorum mechanism, with an external sql database.

I am a little adventurous, and thus I told people, we can do that. The plan looked like this:

As you can see, Kubernetes talks to PostgreSQL over kine, and the HA functionality would be provided by PostgreSQL. This first thought was kind of naive, and needed a number of fixes.

  1. PostgreSQL can do streaming replication, however the standard version cannot run a Multi-Master.
  2. The only open source cluster solution that works for two nodes and I am aware of is corosync with pacemaker. However, the OCF Agents there are able to fail over, but after that, a DBA has to restore high availability. Patroni, as the standard solution for PostgreSQL in cloud environments today, does not solve the 2 Datacenter constraint for me.
  3. The Kubernetes of choice was k3s, however the rancher people stopped releasing for s390x
  4. All the needed containers are OpenSource, but many do not release s390x architecture, some even prevent from building that in their build scripts.

Together with a colleague I started to work on this project. Fortunately, we already had worked on zCX (Container Extensions on z/OS) and had provided many container images that were missing previously. To make development easier, we utilized OBS and worked on the images in a project that I created for that purpose. This can be found at “home:azouhr:d3v” for those interested. Thus, the fourth issue was just work, but not so much a real challenge.

The main challenges have been a Highly Available PostgreSQL that works on two nodes, as well as building k3s in a reasonable way for s390x. Let me get into some more details of using Corosync and Pacemaker with two node clusters, and what is needed to make that work with PostgreSQL.

Corosync and Pacemaker are a solution that is used in the HA product of the SUSE Enterprise Server, and it actually supports two nodes if you have a SBD (Storage-Based Death) device at hand. This is typically not an issue for Mainframe environments, because those machines normally do not have local disks anyways and always operate a SAN.

Pacemaker uses OCF-Agents, that operate certain programs. In my case, this would be PostgreSQL. I have been writing on such agents long ago, however it is kind of a daunting prospect to write an agent from scratch, especially when you have to learn the tasks of a PostgreSQL DBA along the process. After pushing the task off for some time, my colleague suggested to try an AI to get started, and what can I say, I was positively surprised with the result. I chose to give it a try. Since I did not want the KI to have too many rights on my home laptop. The setup I am using looks like this:

I know that many developers don’t like what they get from a AI, and I have to admit, I did not even try to run the first three or four versions that the AI produced. However after a while, the solution stabilized, and I could concentrate on smaller aspects of the OCF agent that “we” created. A recent state of what we produced can be found at https://github.com/azouhr/pgtwin. Note, that I did by far not publish all the different design documents, this would be more than 250 different documents, talking about different aspects of how the OCF agent should operate.

Some experiences with the KI:

  • KIs like to proceed, even if a thought is not ready. Working on the design is important, just don’t let a KI produce code, when you are not yet confident, that you are at the same level of understanding.
  • KIs can easily skim through massive amounts of log files, and also find and fix issues on their own. I personally like to challenge solutions to issues, when I feel that the solution is not perfect. This may lead to several iterations of new proposed solutions.
  • KIs sometimes solve issue A and break B, only to solve B and break A. They are happy to go on like this forever. Whenever you find a problem reoccurring, you have to get deeper into the issue. Let the KI explain what happens, create assumptions and let it explore different paths.
  • KIs sometimes stumble into the same issues that have been discussed earlier. I found that starting the discussion over again is tedious. Instead, ask the KI why it cannot use the solution from the previous location.
  • KIs sometimes try to figure things out without having enough data. Instead of adding debug information or tracing like any programmer would do, they just start experimenting. It often helps a great deal, just to tell them to switch on tracing, or to use things like strace to get more information.
  • Finally, you always have to manually review the result. My personal procedure is to add comments into the code that can be easily found with grep, and later tell the KI to fix the comments.
  • KIs have a date weakness. They like to confuse years and other numbers in dates. That’s why the release dates of pgtwin look confusing.
  • A little warning about documentation and promotion. Obviously KIs have been trained a lot with marketing procedures. They typically claim something is enterprise ready as soon as it did run once. After being “enterprise ready” I typically find quite a number of issues just by looking at the code.
  • Still it is impressive, how easy the code can be read, and how well it is documented. For someone who did read quite some code in history, it is really nice to look at the code. Also the amount of coding would not be possible in that short timeframe by a normal developer.

In my next post, I will go over the design of https://github.com/azouhr/pgtwin and explore the main features and concepts that I have been working on for the last weeks. I hope to create another bugfix release soon, however the agent already works quite well as it is now.