Data Modelling in Rust

Intro

Recently I've been experimenting on how to efficiently and ergonomically model data types containing mutual and cyclic references in Rust. In a language with built in garbage collector like Java, C# etc., this is relatively straightforward as the GC handles cycles and object deallocation automatically for you. In Rust there's no included general purpose GC, instead there are many helper types for facilitating memory handling included in the standard library. There are many ways of solving the problem, each with different trade offs and characteristics. In this post I'll present some of the alternatives available using the Rust standard library.

First I should point out that I'm not an expert Rust developer, so this experimentation has been part of my Rust learning process, but I hope it can provide some ideas and insights for new Rust developers coming from a GC language. Please point any stupid errors I might have made, or just things I've totally missed.

Example Game

As an example I will use a very simple game which consists of two data types: a game and a player. The game type contains references to all the players in the game. A player has a name and health, contains a reference to the game he/she belongs to and also a collection of friends which are references to other players in the game. The data model is not thread safe and meant to be used from a single thread.

Scala Code

Here is the game example written in Scala code:

class Player(val game: Game, var name: String, var health: Int) {
  val friends = new ArrayBuffer[Player]

  def make_friends(p: Player): Unit = {
    friends += p
    p.friends += this
  }
}

class Game {
  val players = new ArrayBuffer[Player]

  def create_player(name: String, health: Int): Player = {
    val p = new Player(this, name, health)
    players += p
    p
  }
}

I think the code is pretty self explanatory. Similar code can be written in any common GC language like Java, Kotlin, C# etc. As you can see in the code there are reference cycles and mutable fields. A requirement is that you should be able follow any reference and mutate the data it points to. As we will see, these properties presents some interesting challenges when porting the code to Rust.

Here's a small usage example in Scala that creates a game, adds some players and prints the first players friends:

    val game = new Game

    val p1 = game.create_player("Eric", 10)
    val p2 = game.create_player("Tom", 15)
    val p3 = game.create_player("Carl", 17)

    p1.make_friends(p2)
    p1.make_friends(p3)

    p2.health = 20

    for (p <- p1.friends)
      println(p.name + ": " + p.health)

Nothing complicated here, just natural looking code. I want to get as close as possible to this code in safe Rust code (although some collection types mentioned later could benefit from using unsafe Rust).

Btw, all the Rust source code is available at GitHub if you wanna check it out. 

Solution 1: Reference Counting

Probably the most straightforward and naive way to implement the same example in Rust is to use reference counting, after all this is most similar to a general purpose GC. The Rust standard library contains a type called Rc for this purpose (for single threaded code, the thread safe alternative is called Arc). However, since we have cycles in our code we must also use the Weak type for reverse references, otherwise the object cycles would never be de-allocated.

Here is the implementation:

struct Player {
    game: GameWRef,
    name: String,
    health: i32,
    friends: Vec<PlayerWRef>,
}

type PlayerRef = Rc<RefCell<Player>>;
type PlayerWRef = Weak<RefCell<Player>>;

fn make_friends(player1: &PlayerRef, player2: &PlayerRef) {
    player1.borrow_mut().friends.push(Rc::downgrade(player2));
    player2.borrow_mut().friends.push(Rc::downgrade(player1));
}

#[derive(Default)]
struct Game {
    players: Vec<PlayerRef>,
}

type GameRef = Rc<RefCell<Game>>;
type GameWRef = Weak<RefCell<Game>>;

fn create_player(game: &GameRef, name: &str, health: i32) -> PlayerRef {
    let p: Rc<RefCell<Player>> = Rc::new(RefCell::new(Player {
        game: Rc::downgrade(game),
        name: name.into(),
        health,
        friends: Default::default(),
    }));

    game.borrow_mut().players.push(p.clone());
    p
}

Note that our data types must be wrapped in RefCell's as the borrow checker will not allow us to have multiple references to the same object if one of them is mutable. RefCell moves this check to runtime and if we were to break that rule the application will panic and terminate.

Here is the same usage example ported from the Scala code:

pub fn run_game() {
    let game: GameRef = Default::default();

    let p1 = create_player(&game, "Eric", 10);
    let p2 = create_player(&game, "Tom", 15);
    let p3 = create_player(&game, "Carl", 17);

    make_friends(&p1, &p2);
    make_friends(&p1, &p3);

    p2.borrow_mut().health = 20;

    for x in p1.borrow().friends.iter() {
        println!(
            "{}: {}",
            x.upgrade().unwrap().borrow().name,
            x.upgrade().unwrap().borrow().health
        )
    }
}

There are some downsides to this solution:

  • Reference counts must be updated every time a reference is cloned or dropped. LLVM can probably optimize many cases, but the overhead is still noticeable.
  • A RefCell performs borrow checks at runtime every time it's dereferenced. It's a simple check, but it also comes with some performance penalty.
  • If a runtime borrow check fails, the application will panic and terminate. It would be preferable to detect the errors at compile time.
  • The reference types gets quite complex. I've created some type aliases for the reference types to help both writing and reading the code.
  • There's a lot of boilerplate added when using the references: borrow(), borrow_mut(), upgrade(), downgrade() etc. That's a real downside compared to the clean Scala code.
  • You can't add member functions to the data types as you need the Rc and RefCell objects in the functions. This makes a bit more cumbersome to use compared to the dot style method calling.

Solution 2: Plain RefCell's and Indices

Let's try to address some of the problems with solution 1. First we can note that since the game object contains all the players, we don't really need to reference count them, we can just let the game object own them and give out references to them (we just have to be careful to remove all references when we actually delete a player object). However, we still need to use RefCell's though as we can have multiple references to the same object and want to be able to mutate it.

But that leads to a new problem: because of Rust's borrowing rules we can only get temporary references to something inside a RefCell and we can't really store them anywhere. Hmm, sounds like we're stuck...but wait, there a work around: instead of using normal references we can create our own reference type using indices into the player array. These indices can be freely passed around and stored in objects. And from a reference to the game object and a player index we can extract a temporary player reference that we can use in a function.

Here's how the reference type can look like:

type GameRef<'t> = &'t RefCell<Game<'t>>;

#[derive(Copy, Clone)]
struct PlayerRef<'t> {
    game: GameRef<'t>,
    index: usize,
}

impl<'t> PlayerRef<'t> {
    fn borrow(&self) -> Ref<'_, Player<'t>> {
        Ref::map(self.game.borrow(), |r| &r.players[self.index])
    }

    fn borrow_mut(&self) -> RefMut<'_, Player<'t>> {
        RefMut::map(self.game.borrow_mut(), |r| &mut r.players[self.index])
    }

    fn make_friends(self, player2: PlayerRef<'t>) {
        self.borrow_mut().friends.push(player2);
        player2.borrow_mut().friends.push(self);
    }
}

Basically we have re-implemented the borrow() and borrow_mut() methods from RefCell and mapped them to the players field in the Game type. The player reference is sort of a "fat pointer" containing both a reference to the Game RefCell and the player index. The reference object can be freely copied and stored. Note that since we have our own reference type we can place object methods there (like the make_friends() method) to make it a bit more convenient to use.

Also note since we use a normal const reference to the game object we need add lifetime parameter to our Game and Player types, so they now look like this:

struct Player<'t> {
    game: GameRef<'t>,
    name: String,
    health: i32,
    friends: Vec<PlayerRef<'t>>,
}

struct Game<'t> {
    players: Vec<Player<'t>>,
}

Ok, I think we made some improvements to solution 1:

  • No more reference counting runtime overhead
  • A bit less boilerplate when using references since we skip the Rc wrapping

But we also have some new downsides:

  • We have to be careful to manually remove all references to a player object before we delete it from the game, for example we must remove it from all friend lists
  • We had to add lifetime parameters to our data types which makes it a bit cumbersome to use them
  • We've added a level of indirection by using Vec indices instead of direct references

  • We have to implement quite a lot of boilerplate for each type of object that we store in the game object and want to have references to

So, this solution is not perfect, and we can still make some improvements.

Solution 3: Interior Mutability and Pooling

Let's try to get rid of the RefCell's, the runtime overhead and panics are not so nice side effects. Also, by getting rid of the RefCell's we might be able to get rid of Vec indices which would be another benefit. We still want the game object to own the player objects, but now hand out plain const references to player objects instead of indices. But that means we can't mutate anything via those references, right? Wrong, this is where interior mutability comes in handy. Since we are using a single execution thread, we can use the Cell type from the Rust standard library. It will work on any object field which type is Copy, i.e. most small and primitive types like numbers, booleans etc., for example our Player.health field can easily be converted to type Cell<i32> and then we can modify it through a const player reference.

But now we face another issue, we still want to be able to add friends and players through our const references, and that excludes using normal Vec collections as they can't be mutated through const references. We have to switch to other collection types which use interior mutability. Maybe there are crates with such collection types (please let me know), but it's not too hard to create our own for this simple example.

For the player vector we use this type (here's the full source code):

pub struct FixedVec<'t, T: Clear> {
    items: &'t [T],
    is_occupied: Vec<Cell<bool>>,
    size: Cell<usize>,
}

Basically it contains a fixed size array (remember we can't move objects that we have references too) that is used as an object pool (or arena). The pool is initialized with default objects and then there's an additional Vec of the same fixed size to keep track of which objects are actually used. This vector uses Cell for interior mutability so we can "allocate" and "free" objects through a const reference. The collection type also tracks how many objects are actually allocated in the size field.

This object pool vector is probably most efficiently implemented using MaybeUninit and unsafe code, but I wanted to keep the code simple and just use safe Rust. To solve it in safe Rust I added a Clear trait which is called on free. It's similar to Drop, but doesn't move the object (as we want to be able to reuse it).

Anyway, now our game type becomes:

struct Game<'t> {
    players: FixedVec<'t, Player<'t>>,
}

type GameRef<'t> = &'t Game<'t>;

And we can hand out normal Player references:

type PlayerRef<'t> = &'t Player<'t>;

Awesome! No runtime overhead (except for Cell which I think has minimal overhead) and very simple to use!

However, we still have to replace the Player friends list which contains references to other players. This data type is simpler to implement:

pub struct RefSet<'t, T> {
    items: Vec<Cell<Option<&'t T>>>,
}

Basically it's just a fixed size Vec containing Cell elements for interior mutability. Here we can use a plain Option for tracking which elements are occupied so no default initialization or Drop/Clear complications to worry about. You just have to set a limit on how many friends a player can have at the most.

This is how straightforward the usage is now:

pub fn run_game() {
    let players = vec![Player::default(); 100].into_boxed_slice();
    let game = Game::new(&players);

    let p1 = game.create_player("Eric", 10);
    let p2 = game.create_player("Tom", 15);
    let p3 = game.create_player("Carl", 17);

    p1.make_friends(p2);
    p1.make_friends(p3);

    p2.health.set(20);

    for x in p1.friends.iter() {
        println!("{}: {}", x.name.borrow(), x.health.get())
    }
}

To my eyes this is very close to the original Scala code! The only wart is we still need to use RefCell for the Player.name field, but that can be solved by implementing a fixed size String type with interior mutability (maybe there is a crate for this as well?). You probably don't want your players to have 200 character names anyway 😉

A side effect of using fixed size pools is that we get rid of basically all heap allocation after application initialization. Pretty much everything is pre-allocated, so if you don't have enough RAM you will notice it immediately on application startup. Of course, it depends on the application if this is good or bad, or even acceptable at all. As a compromise it should be possible to implement growing object pools which allocates new vectors from the heap on demand, but still only uses interior mutability.

Solution 3+: Static Data

But if we pre-allocate everything we might as well use static data and get rid of the lifetime parameters on our data types!

To achieve this in Rust, we need to leak the allocations which will allow us to get references with static lifetimes:

    let players = Box::leak(vec![Player::default(); 100].into_boxed_slice());
    let game = Box::leak(Box::new(Game::new(players)));

Then we can remove the lifetime parameters on Game and Player, and just use static lifetimes everywhere:

struct Player {
    game: Cell<Option<GameRef>>,
    name: RefCell<String>,
    health: Cell<i32>,
    friends: RefSet<'static, Player>,
}

type PlayerRef = &'static Player;

struct Game {
    players: FixedVec<'static, Player>,
}

type GameRef = &'static Game;

I don't think it gets much cleaner than that in Rust.

Conclusion

As we have seen there is no one correct way to implement this example, or any similar data model, in Rust. Which solution to choose depends on the application requirements. For example, if I were to implement a game server with high performance requirements and pre-allocation is ok, I would probably go with solution 3 or 3+. They should give the best performance and are most convenient to use at the cost of higher average memory usage (but not higher maximum usage). Let me know if I missed some other solutions that could be of interest.

An interesting exploration for solution 3 and 3+ would be to have object pools that grow and shrink on demand. It could be implemented using some form of linked arrays or array trie. Maybe there's a crate for that as well! 😊

Comments

  1. Are you aware of generational arenas?

    ReplyDelete
    Replies
    1. Not really, but I can imagine how they might work. Is there a crate that I can look at?

      Delete
    2. https://github.com/fitzgen/generational-arena

      Delete

Post a Comment

Popular posts from this blog

Rust for the Seasoned Scala Developer

Data Modelling in Rust Continued