Sunday, July 9, 2023

Mistakes with Rust smart pointers: when Deref goes wrong

Prologue

The following is an attempt at highlighting the sources of high astonishment arising from the implementation of the std::ops::Deref trait on custom types.

I write this, in part, because I had recently used Deref inappropriately while at $day_job, due to my lack of a full understanding of how Deref interacts with the rest of the language. It should go without saying that this was a confusing experience.

Naturally, the aforementioned mishap will serve as a motivating example throughout this dive into Rust semantics.

Newtypes

I cannot share the full context surrounding my misuse of Deref (for both pedagogic and legal reasons) so please understand that my motivating example won't be well motivated itself.

With that disclaimer out of the way, suppose we're building a library that provides an ASCII string type called AsciiString. A fitting representation of these so-called byte strings would simply be a plain ol' Vec<u8>.

However, ASCII characters are limited to the range 0..128, and thus not any value of type Vec<u8> would be a valid ASCII string. In other words, it's not a good idea to alias AsciiString to Vec<u8>; they are not interchangeable.

In Rust, it is considered good practice to enforce invariants at the type level. Thus, the idiomatic course of action would be to use a newtype to wrap a Vec<u8> and then statically guarantee that all wrapped byte vectors represent valid ASCII strings.

/// An ASCII-encoded, growable string.
struct AsciiString(Vec);

impl AsciiString {
    /// Range of valid ASCII bytes.
    const ASCII_RANGE: Range = 0..128;

    /// Converts a byte vector to an `AsciiString`.
    ///
    /// # Errors
    /// Returns `Err` if any byte in `vec` 
    /// falls outside the range [`Self::ASCII_RANGE`].
    fn from_ascii(vec: Vec) -> Result {
        if vec
            .iter()
            .all(|b| Self::ASCII_RANGE.contains(b))
        {
            Ok(Self(vec))
        } else {
            Err(AsciiError)
        }
    }
}

Explicit conversion

UTF-8 is backward compatible with ASCII. This means that we can freely reinterpret any well-constructed AsciiString value to a str value. What we stand to gain from this is compatibility with code written with UTF-8 strings in mind.

impl AsciiString {
    fn as_str(&self) -> &str {
        debug_assert!(str::from_utf8(&self.0).is_ok());

        // SAFETY: self.0 is ASCII-encoded, 
        // and ASCII is a subset of UTF-8
        unsafe { str::from_utf8_unchecked(&self.0) }
    }
}

Two statements in, and the unspeakable keyword has already been uttered; I'm feeling pretty tall today and the temptation is frankly irresistible. Obviously, any bug surrounding the construction of AsciiString values implies memory unsafely outside debug mode.

Hold on a second. Both std::str::from_utf8 and std::str::from_utf8_unchecked take in a &[u8] as an argument, while &self.0 is of type &Vec<u8>; there is definitely some kind of implicit coercion going on here. Is Rust not so strongly typed after all?

Enter Deref

pub trait Deref {
    type Target: ?Sized;

    fn deref(&self) -> &Self::Target;
}

We seem to have inadvertently run into Deref before I even got the chance to introduce it properly. The simple explanation of why &Vec<u8> was "coerced" to &[u8] is that Vec<T> implements Deref<Target = [T]>. In short, the compiler conceptually transformed the expression str::from_utf8(&self.0) into something akin to:

str::from_utf8( as Deref>::deref(&self.0))

This mechanism at play here is called Deref Coercion, and is defined by the interaction between two distinct mechanisms defined by The Rust Reference: Method Resolution and Type Coercion. Having already dropped so much jargon, I refrain from uncovering any gory details for now.

Clean Code™

For all practical purposes, AsciiString:as_str works perfectly as a sure and simple way to convert any &AsciiString reference to a &str. However, it will mean one would have to manually add .as_str() every time a conversion is needed.

For dubious reasons involving "clean code" we shall implement Deref<Target = str> for AsciiString: the compiler will from now on call .as_str() on our behalf.

impl Deref for AsciiString {
    type Target = str;

    fn deref(&self) -> &Self::Target {
        self.as_str()
    }
}

Idiomatic Code™

For those who care not for aesthetics when it comes to computer programs — a vulgar but arguably rational position — I shall appeal to Rust idioms. Chapter 5 of the Rust API Guidelines contains a recommendation titled C-DEREF which states:

The Deref traits are used implicitly by the compiler in many circumstances, and interact with method resolution. The relevant rules are designed specifically to accommodate smart pointers, and so the traits should be used only for that purpose.

Is AsciiString a smart pointer? What is a smart pointer anyway? That is a debate I'm uninterested in exploring here. However, the authors of the Rust API Guidelines go on to give examples of smart pointers in the Standard Library:

  • Box<T>
  • String as a smart pointer to str
  • Rc<T>
  • Arc<T>
  • Cow<'a, T>

On these grounds, I claim that AsciiString ought to be considered a smart pointer to str. Note that the idea that String is a smart pointer is itself often contested, since String is not generic over the data it points to the way Box<T> and friends are.

To my astonishment

Note that AsciiString inherits the implementation of Index<Range<usize>> from str. We get this as a side effect of implementing Deref and it allows us to borrow string slices from an AsciiString.

let data = vec![b'A', b'B', b'C'];
let ascii = AsciiString::from_ascii(data).unwrap();
assert_eq!(&ascii[1..3], "BC");

Up until now, we have been relatively well-behaved rustaceans. But where is the fun in that? One cannot be writing production-ready code all the time, for that is a sure path to madness.

Let's allow for AsciiString to be indexed using floating-point numbers; precisely because we can.

impl Index> for AsciiString {
    type Output = str;

    /// Returns an ASCII string slice by rounding 
    /// the given floating-point indices in `index`.
    fn index(&self, index: Range) -> &Self::Output {
        /// Returns the integer nearest 
        /// to the absolute value of `float`.
        fn nearest_index(float: f64) -> usize {
            float.abs().round() as usize
        }

        let start = nearest_index(index.start);
        let end = nearest_index(index.end);
        let string = self.as_str();

        &string[start..end]
    }
}

You may be wondering what happens when your index is exactly halfway between two integers. The answer is that f64::round will round away from 0.0, thus 0.5 will be rounded to 1. At this point, we might as well have had Index toss a coin to resolve this ambiguity.

let data = vec![b'A', b'B', b'C'];
let ascii = AsciiString::from_ascii(data).unwrap();
assert_eq!(&ascii[1.5_f64..3_f64], "BC");

We finally arrived at the part which surprised me and served as the motivation for all that I wrote before and after this point: we no longer inherit Index<Range<usize>> from str. The previous version of the above code block doesn't compile anymore:

error[E0308]: mismatched types
  --> src/main.rs:69:23
   |
69 |     assert_eq!(&ascii[1..3], "BC");
   |                       ^^^^ expected `Range`, 
   |                            found `Range<{integer}>`
   |
   = note: expected struct `std::ops::Range`
              found struct `std::ops::Range<{integer}>`

To me, this was incredibly confusing because Index is generic over its Idx type, which means it's perfectly valid to have multiple implementations of Index<Idx> for different Idx types.

Somehow, the compiler can no longer find the implementation of Index<usize> on &str. If you're like me, it's not at all obvious how that came to be. To better understand all of this, we dig deeper into exactly how the compiler resolves methods in the presence of a Deref implementation.

Nitty-gritty details

Everything I describe below is limited to commit 6bacf5a of the Rust project, i.e. Rust 1.69.0.

Method lookup in rustc is divided into two phases: probe and confirm. As the Rust Compiler Development Guide explains, probe is concerned with finding the callee method while confirm solves for unknown type variables.

During the probe phase, the compiler would first perform a "Deref-loop": essentially repeated dereferencing of the receiver type, i.e. AsciiString. In our particular case, this would yield two steps: AsciiString and [u8].

Then, the compiler would assemble a list of candidates from each step. These candidate methods are subsequently searched for one that fits the reliever type. Up until this point, it seems that rustc should've been able to find the index method from the Index<Range<usize>> implementation on [u8] because the search always includes Deref targets.

The issue, it turns out, is that method search does not check method parameter types against argument types. Rather, it only checks for method names and where-clauses such as T: Trait. Types are only resolved in the confirm phase, at which point rustc would have already picked its method.

But why is method lookup in rustc split into two phases? The answer lies in the cacheability of the picks provided by the probe phase. In other words, rustc developers wished to reuse probe search results of identical methods, hence why probe cannot depend on inferred types local to one particular call site. I would wager that this was most likely done to address performance issues.

If you can spot any inaccuracies in the above description, please don't shy away from correcting me via email. I would appreciate it, as I am by no means a rustc developer.

Epilogue

Moral of the story: don't implement Deref kids!

On a more serious note, if one does implement Deref<Target = U> on a type T, and it proved necessary to implement a generic trait Trait<S> on T. Then one should ensure that the implementations of Trait<S> for different parameters S are not scattered across T and U.

Put simply, either T or U should host all existing Trait<S> implementations. Note that the preceding "or" is inclusive, meaning that both T and U can host some Trait<S> implementations, as long as they both host all of them.




from Hacker News https://ift.tt/yrbW6sR

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.