Skip to main content

the avatar of Open Build Service

Build Results Summary Chart Links to Build Results Overview

the avatar of Federico Mena-Quintero

Rustifying libipuz: character sets

It has been, what, like four years since librsvg got fully rustified, and now it is time to move another piece of critical infrastructure to a memory-safe language.

I'm talking about libipuz, the GObject-based C library that GNOME Crosswords uses underneath. This is a library that parses the ipuz file format and is able to represent various kinds of puzzles.

The words "GNOME CROSSWORDS" set inside a crossword
puzzle

Libipuz is an interesting beast. The ipuz format is JSON with a lot of hair: it needs to represent the actual grid of characters and their solutions, the grid's cells' numbers, the puzzle's clues, and all the styling information that crossword puzzles can have (it's more than you think!).

{
    "version": "http://ipuz.org/v2",
    "kind": [ "http://ipuz.org/crossword#1", "https://libipuz.org/barred#1" ],
    "title": "Mephisto No 3228",
    "styles": {
        "L": {"barred": "L" },
        "T": {"barred": "T" },
        "TL": {"barred": "TL" }
    },
    "puzzle":   [ [  1,  2,  0,  3,  4,  {"cell": 5, "style": "L"},  6,  0,  7,  8,  0,  9 ],
                  [  0,  {"cell": 0, "style": "L"}, {"cell": 10, "style": "TL"},  0,  0,  0,  0,  {"cell": 0, "style": "T"},  0,  0,  {"cell": 0, "style": "T"},  0 ]
                # the rest is omitted
    ],
    "clues": {
        "Across": [ {"number":1, "clue":"Having kittens means losing heart for home day", "enumeration":"5", "cells":[[0,0],[1,0],[2,0],[3,0],[4,0]] },
                    {"number":5, "clue":"Mostly allegorical poet on writing companion poem, say", "enumeration":"7", "cells":[[5,0],[6,0],[7,0],[8,0],[9,0],[10,0],[11,0]] },
                ]
        # the rest is omitted
    }
}

Libipuz uses json-glib, which works fine to ingest the JSON into memory, but then it is a complete slog to distill the JSON nodes into C data structures. You need iterate through each node in the JSON tree and try to fit its data into yours.

Get me the next node. Is the node an array? Yes? How many elements? Allocate my own array. Iterate the node's array. What's in this element? Is it a number? Copy the number to my array. Or is it a string? Do I support that, or do I throw an error? Oh, don't forget the code to meticulously free the partially-constructed thing I was building.

This is not pleasant code to write and test.

Ipuz also has a few mini-languages within the format, which live inside string properties. Parsing these in C unpleasant at best.

Differences from librsvg

While librsvg has a very small GObject-based API, and a medium-sized library underneath, libipuz has a large API composed of GObjects, boxed types, and opaque and public structures. Using libipuz involves doing a lot of calls to its functions, from loading a crossword to accessing each of its properties via different functions.

I want to use this rustification as an exercise in porting a moderately large C API to Rust. Fortunately, libipuz does have a good test suite that is useful from the beginning of the port.

Also, I want to see what sorts of idioms appear when exposing things from Rust that are not GObjects. Mutable, opaque structs can just be passed as a pointer to a heap allocation, i.e. a Box<T>. I want to take the opportunity to make more things in libipuz immutable; currently it has a bunch of reference-counted, mutable objects, which are fine in single-threaded C, but decidedly not what Rust would prefer. For librsvg it was very beneficial to be able to notice parts of objects that remain immutable after construction, and to distinguish those parts from the mutable ones that change when the object goes through its lifetime.

Let's begin!

In the ipuz format, crosswords have a character set or charset: it is the set of letters that appear in the puzzle's solution. Internally, GNOME Crosswords uses the charset as a histogram of letter counts for a particular puzzle. This is useful information for crossword authors.

Crosswords uses the histogram of letter counts in various important algorithms, for example, the one that builds a database of words usable in the crosswords editor. That database has a clever format which allows answering questions like the following quickly: What words in the database match ?OR??WORDS and CORES will match.

IPuzCharset is one of the first pieces of code I worked on in Crosswords, and it later got moved to libipuz. Originally it didn't even keep a histogram of character counts; it was just an ordered set of characters that could answer the question, "what is the index of the character ch within the ordered set?".

I implemented that ordered set with a GTree, a balanced binary tree. The keys in the key/value tree were the characters, and the values were just unused.

Later, the ordered set was turned into an actual histogram with character counts: keys are still characters, but each value is now a count of the coresponding character.

Over time, Crosswords started using IPuzCharset for different purposes. It is still used while building and accessing the database of words; but now it is also used to present statistics in the crosswords editor, and as part of the engine in an acrostics generator.

In particular, the acrostics generator has been running into some performance problems with IPuzCharset. I wanted to take the port to Rust as an opportunity to change the algorithm and make it faster.

Refactoring into mutable/immutable stages

IPuzCharset started out with these basic operations:

/* Construction; memory management */
IPuzCharset          *ipuz_charset_new              (void);
IPuzCharset          *ipuz_charset_ref              (IPuzCharset       *charet);
void                  ipuz_charset_unref            (IPuzCharset       *charset);

/* Mutation */
void                  ipuz_charset_add_text         (IPuzCharset       *charset,
                                                     const char        *text);
gboolean              ipuz_charset_remove_text      (IPuzCharset       *charset,
                                                     const char        *text);

/* Querying */
gint                  ipuz_charset_get_char_index   (const IPuzCharset *charset,
                                                     gunichar           c);
guint                 ipuz_charset_get_char_count   (const IPuzCharset *charset,
                                                     gunichar           c);
gsize                 ipuz_charset_get_n_chars      (const IPuzCharset *charset);
gsize                 ipuz_charset_get_size         (const IPuzCharset *charset);

All of those are implemented in terms of the key/value binary tree that stores a character in each node's key, and a count in the node's value.

I read the code in Crosswords that uses the ipuz_charset_*() functions and noticed that in every case, the code first constructs and populates the charset using ipuz_charset_add_text(), and then doesn't modify it anymore — it only does queries afterwards. The only place that uses ipuz_charset_remove_text() is the acrostics generator, but that one doesn't do any queries later: it uses the remove_text() operation as part of another algorithm, but only that.

So, I thought of doing this:

  • Split things into a mutable IPuzCharsetBuilder that has the add_text / remove_text operations, and also has a build() operation that consumes the builder and produces an immutable IPuzCharset.

  • IPuzCharset is immutable; it can only be queried.

  • IPuzCharsetBuilder can work with a hash table, which turns the "add a character" operation from O(log n) to O(1) amortized.

  • build() is O(n) on the number of unique characters and is only done once per charset.

  • Make IPuzCharset work with a different hash table that also allows for O(1) operations.

Basics of IPuzCharsetBuilder

IPuzCharsetBuilder is mutable, and it can live on the Rust side as a Box<T> so it can present an opaque pointer to C.

#[derive(Default)]
pub struct CharsetBuilder {
    histogram: HashMap<char, u32>,
}

// IPuzCharsetBuilder *ipuz_charset_builder_new (void); */
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_builder_new() -> Box<CharsetBuilder> {
    Box::new(CharsetBuilder::default())
}

For extern "C", Box<T> marshals as a pointer. It's nominally what one would get from malloc().

Then, simple functions to create the character counts:

impl CharsetBuilder {
    /// Adds `text`'s character counts to the histogram.
    fn add_text(&mut self, text: &str) {
        for ch in text.chars() {
            self.add_character(ch);
        }
    }

    /// Adds a single character to the histogram.
    fn add_character(&mut self, ch: char) {
        self.histogram
            .entry(ch)
            .and_modify(|e| *e += 1)
            .or_insert(1);
    }
}

The C API wrappers:

use std::ffi::CStr;

// void ipuz_charset_builder_add_text (IPuzCharsetBuilder *builder, const char *text);
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_builder_add_text(
    builder: &mut CharsetBuilder,
    text: *const c_char,
) {
    let text = CStr::from_ptr(text).to_str().unwrap();
    builder.add_text(text);
}

CStr is our old friend that takes a char * and can wrap it as a Rust &str after validating it for UTF-8 and finding its length. Here, the unwrap() will panic if the passed string is not UTF-8, but that's what we want; it's the equivalent of an assertion that what was passed in is indeed UTF-8.

// void ipuz_charset_builder_add_character (IPuzCharsetBuilder *builder, gunichar ch);
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_builder_add_character(builder: &mut CharsetBuilder, ch: u32) {
    let ch = char::from_u32(ch).unwrap();
    builder.add_character(ch);
}

Somehow, the glib-sys crate doesn't have gunichar, which is just a guint32 for a Unicode code point. So, we take in a u32, and check that it is in the appropriate range for Unicode code points with char::from_u32(). Again, a panic in the unwrap() means that the passed number is out of range; equivalent to an assertion.

Converting to an immutable IPuzCharset

pub struct Charset {
    /// Histogram of characters and their counts plus derived values.
    histogram: HashMap<char, CharsetEntry>,

    /// All the characters in the histogram, but in order.
    ordered: String,

    /// Sum of all the counts of all the characters.
    sum_of_counts: usize,
}

/// Data about a character in a `Charset`.  The "value" in a key/value pair where the "key" is a character.
#[derive(PartialEq)]
struct CharsetEntry {
    /// Index of the character within the `Charset`'s ordered version.
    index: u32,

    /// How many of this character in the histogram.
    count: u32,
}

impl CharsetBuilder {
    fn build(self) -> Charset {
        // omitted for brevity; consumes `self` and produces a `Charset` by adding
        // the counts for the `sum_of_counts` field, and figuring out the sort
        // order into the `ordered` field.
    }
}

Now, on the C side, IPuzCharset is meant to also be immutable and reference-counted. We'll use Arc<T> for such structures. One cannot return an Arc<T> to C code; it must first be converted to a pointer with Arc::into_raw():

// IPuzCharset *ipuz_charset_builder_build (IPuzCharsetBuilder *builder);
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_builder_build(
    builder: *mut CharsetBuilder,
) -> *const Charset {
    let builder = Box::from_raw(builder); // get back the Box from a pointer
    let charset = builder.build();        // consume the builder and free it
    Arc::into_raw(Arc::new(charset))      // Wrap the charset in Arc and get a pointer
}

Then, implement ref() and unref():

// IPuzCharset *ipuz_charset_ref (IPuzCharset *charet);
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_ref(charset: *const Charset) -> *const Charset {
    Arc::increment_strong_count(charset);
    charset
}

// void ipuz_charset_unref (IPuzCharset *charset);
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_unref(charset: *const Charset) {
    Arc::decrement_strong_count(charset);
}

The query functions need to take a pointer to what really is the Arc<Charset> on the Rust side. They reconstruct the Arc with Arc::from_raw() and wrap it in ManuallyDrop so that the Arc doesn't lose a reference count when the function exits:

// gsize ipuz_charset_get_n_chars (const IPuzCharset *charset);
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_get_n_chars(charset: *const Charset) -> usize {
    let charset = ManuallyDrop::new(Arc::from_raw(charset));
    charset.get_n_chars()
}

Tests

The C tests remain intact; these let us test all the #[no_mangle] wrappers.

The Rust tests can just be for the internals, simliar to this:

    #[test]
    fn supports_histogram() {
        let mut builder = CharsetBuilder::default();

        let the_string = "ABBCCCDDDDEEEEEFFFFFFGGGGGGG";
        builder.add_text(the_string);
        let charset = builder.build();

        assert_eq!(charset.get_size(), the_string.len());

        assert_eq!(charset.get_char_count('A').unwrap(), 1);
        assert_eq!(charset.get_char_count('B').unwrap(), 2);
        assert_eq!(charset.get_char_count('C').unwrap(), 3);
        assert_eq!(charset.get_char_count('D').unwrap(), 4);
        assert_eq!(charset.get_char_count('E').unwrap(), 5);
        assert_eq!(charset.get_char_count('F').unwrap(), 6);
        assert_eq!(charset.get_char_count('G').unwrap(), 7);

        assert!(charset.get_char_count('H').is_none());
    }

Integration with the build system

Libipuz uses meson, which is not particularly fond of cargo. Still, cargo can be used from meson with a wrapper script and a bit of easy hacks. See the merge request for details.

Further work

I've left the original C header file ipuz-charset.h intact, but ideally I'd like to automatically generate the headers from Rust with cbindgen. Doing it that way lets me check that my assumptions of the extern "C" ABI are correct ("does foo: &mut Foo appear as Foo *foo on the C side?"), and it's one fewer C-ism to write by hand. I need to see what to do about inline documentation; gi-docgen can consume C header files just fine, but I'm not yet sure about how to make it work with generated headers from cbindgen.

I still need to modify the CI's code coverage scripts to work with the mixed C/Rust codebase. Fortunately I can copy those incantations from librsvg.

Is it faster?

Maybe! I haven't benchmarked the acrostics generator yet. Stay tuned!

the avatar of Hollow Man's Blog

My Igalia Coding Experience 2023 I & II at Wolvic

Wolvic is a fast and secure browser for standalone virtual-reality and augmented-reality headsets. ex. Mozilla Firefox Reality.

Project summaries

List of things I have done

PRs opened/handled

Issues opened/helped with

the avatar of Nathan Wolf

du | Directory Size in the Terminal

This is a letter to future me for the next time I need to look up the disk usage in the terminal. If you find this useful, great, if you think this is lacking and unhelpful, that’s fine too. I don’t always remember how I used various commands in the terminal when there are weeks […]
the avatar of Nathan Wolf

a silhouette of a person's head and shoulders, used as a default avatar

openSUSE Tumbleweed – Review of the weeks 2024/07

Dear Tumbleweed users and hackers,

This week we have released 5 snapshots (0209, 0211, 0212, 0213, and 0214). With 5 snapshots, this is quite a normal week.

The most relevant changes in those snapshots were:

  • SDL 2.30.0
  • c-ares 1.26.0 (after a lengthy staging phase)
  • fwupd 1.9.13
  • PostgreSQL 16.2
  • Pulseaudio 17.0
  • GTK 4.12.5
  • Python 3.11.8
  • KDE Frameworks 5.115.0
  • RPM 4.19.1.1
  • Node.JS 21.6.1

The list of things currently being tested remained largely unchanged:

  • Meson 1.3.2
  • Shadow 4.14.5
  • pkgconf 2.1.1
  • RPM: enable reproducible builds by default (bsc#1148824). For upstream versions see: https://github.com/rpm-software-management/rpm/pull/2880
  • A bunch of cleanup work to eliminate more of python2 (boo#1219306)
  • dbus-broker: a big step forward; upgrades seem to be an issue that needs to be addressed
  • libxml 2.12.x: slow progress
  • GCC 14: our usual 2-phase approach to introduce it. Currently working on phase 1, meaning GCC14 will be providing the base libraries (libgcc_s1, libstdc++…). The compiler itself will stay at version 13 for now. Only one issue left: qemu fails to build
the avatar of openSUSE News

Exploring Agama's 2024 Roadmap

A recent post on the YaST blog about Agama’s roadmap looks at the new installer as functional enough to embark on tasks ranging from localization and network configuration to storage setup and initial software selection.

For those who don’t follow the YaST blog, here is what lies ahead for Agama in 2024.

The team has outlined a strategy for this year and, despite the fluidity of its development, the team is committed to a steady release schedule for Agama with two significant milestones. The first is set for mid-April and the other toward mid-July.

The milestone in April is set to revolutionize Agama’s architecture. It will be moving away from its reliance on Cockpit toward a more autonomous framework that is coupled with a refined user interface that aims to streamline storage configurations.

The aim of the second milestone is to improve Agama’s flexibility and capabilities for unattended installations, which seeks to position Agama as a formidable alternative to AutoYaST.

The scaffolding provided by the Cockpit Project makes the vision for Agama’s future clear in evolving a direction of a new path. The coming months will be dedicated to redefining this approach to ensure Agama’s growth is unhindered by external dependencies.

While architectural modifications lay the groundwork for future advancements, an equal focus must be made to enhance the user experience. The revamped storage configuration interface will be both user-friendly for newcomers and more versatile for the experience. This aims to provide a balance of simplicity and customization.

The openSUSE Conference 2024 is nestled between the milestones and the team will make use of this event to serve as a platform for discussing Agama’s potential to redefine the installation experience within the openSUSE ecosystem. insights and contributions are vital to Agama’s success so stakeholders are encouraged to engage with the team, share ideas and participate in the ongoing development of Agama.

Read more information about Agama on the YaST blog.

a silhouette of a person's head and shoulders, used as a default avatar

The syslog-ng Insider 2024-02: OpenObserve; configuration check; build services;

The February syslog-ng newsletter is now on-line:

  • Version 4.5.0 of syslog-ng is now available with OpenObserve JSON API support
  • Syslog-ng PE can now send logs to Google BigQuery
  • Syslog-ng can now do a full configuration check
  • How build services make life easier for upstream developers

It is available at https://www.syslog-ng.com/community/b/blog/posts/the-syslog-ng-insider-2024-02-openobserve-configuration-check-build-services

syslog-ng logo

the avatar of openSUSE News

Contribution Sessions to Begin Tomorrow

The openSUSE community is pleased to announce that it will have short sessions aimed at encouraging people on how to contribute to the project.

A group of volunteers will present short 15-minute sessions that are streamed and/or recorded on openSUSE’s YouTube channel that are aimed at teaching people about packaging, using the Open Build Service, creating tests for openQA and other development areas.

The first session about “Basic use of OBS/osc using a version bump as an example” is set to begin tomorrow, on Feb. 15 at 21:00 UTC.

Update: The “Packaging Guidelines (Patch Policies) and Submission of New Packages” session scheduled for Feb. 27 at 16:00 UTC has been postponed.

More sessions are expected to be scheduled for future dates.

The sessions are listed on the openSUSE Calendar; look for the Contribution Workshop sessions marked in orange.

Those who are interested in presenting should fill in the blank area for future sessions listed in the email about the events.

Giving a session is a great way to give back to the community and provides opportunities to teach others skills and knowledge about open-source development.

the avatar of Greg Kroah-Hartman

Linux is a CNA

As was recently announced, the Linux kernel project has been accepted as a CNA as a CVE Numbering Authority (CNA) for vulnerabilities found in Linux.

This is a trend, of more open source projects taking over the haphazard assignments of CVEs against their project by becoming a CNA so that no other group can assign CVEs without their involvment. Here’s the curl project doing much the same thing for the same reasons. I’d like to point out the great work that the Python project has done in supporting this effort, and the OpenSSF project also encouraging it and providing documentation and help for open source projects to accomplish this. I’d also like to thank the cve.org group and board as they all made the application process very smooth for us and provided loads of help in making this all possible.