Strings in Rust
During the last 20 years I have used a number of garbage collected and reference counted programming languages. All of them have a single type for representing strings. Rust has two types of strings that can be stored in three different ways.
I want to shortly illustrate how Rust's strings interact with the heap, with the stack, and with the data segment of your binary, as well as shortly explain what those things are.
Java, Swift, TypeScript, and Go all have string types that make it irrelevant if they're stored on the stack or on the heap. A Java string is always heap allocated, while a Go string may be stack or heap allocated. The point is that you don't need to know: the way you use the type doesn't change.
Rust doesn't work quite that way. In general, the programmer needs to choose between types that store data either on the stack or on the heap. The choice is generally speaking a trade-off between speed and versatility.
Please note that this post does not touch upon the topics of UTF8, UTF8 validity,
and so on. Neither does it talk about lifetimes, except for 'static
, which is
not explained further. This post basically glosses over everything that is not
needed to understand String
or str
.
We need to take a small detour.
Stack and heap
This is a conceptual view of the stack and heap parts of program memory. This is probably a virtual memory if you're on a laptop computer or physical memory if you're on a small embedded hardware without an operating system.
A view of program memory in a made up 16 bit computer
low memory addresses ======================================= addr[0x00F0]
| HEAP |
| typically grows towards higher addr |
| | |
| ˇ |
|/////////////////////////////////////|
| ^ |
| | |
| typically grows towards lower addr |
| STACK |
high memory addresses ======================================= addr[0xFFFF]
The stack and heap memory is structured in a way that they grow towards each other. In my examples I use a made up 16 bit memory architecture. That means that pointers are 16 bits and registers are 16 bits as well.
PROGRAM MEMORY
use std::str; =====================================
fn main() { | HEAP |
// A stack allocated i16 | |
+-- let x: i16 = 5; | |
| | addr[0x00F0] <-----+ |
| // A heap allocated i16 | type: i16 | |
| let y: Box<i16> = Box::new(5); | value: 5 | |
| | | | |
| | |////////////////////|//////////////|
| | | addr[0xFFE0] | |
| | | type: Box<i16> | |
| +-------------------------------> value: 0x00F0 -----+ |
| | |
| | addr[0xFFF0] |
| | type: i16 |
+---------------------------------------> value: 5 |
| STACK |
} =====================================
The program above illustrates the difference between putting something on the
stack, which is immediately available to the function, since in some sense the
current stack frame is the function. When something is put on the heap, we say
that it's boxed. In this case the Box
is essentialy just a stack allocated
struct, that internally holds a pointer to the heap, where the actual data is
stored. Illustrated by the ->
arrows above.
Three types of string storage
A string in Rust can be stored in one of three ways:
- On the heap
- On the stack
- In the
data
segment of the application binary
We need one more detour.
Application binary segments
The segments in a binary are often illustrated like this, please note that this is after loading the binary file from disk into main memory: the stack and heap segments aren't stored in the binary on disk.
The text
segment actually contains the compiled machine code, and the data
segment contains application data found by the compiler while compiling.
SEGMENTS
low mem =============================================================
| .text |
| contains machine code |
=============================================================
| .data |
| contains data known to the app binary at compile time |
=============================================================
addr[0x00F0] | HEAP |
| contains dynamically allocated data created at run time |
| * data that lives "much" longer than one function |
| * data that is too big for the stack |
| * data that must live behind a pointer, i.e. unknown size |
|////////////////////////////////////////////////////////////
| STACK |
| contains memory allocated by functions at run time |
| * data with a known size |
addr[0xFFFF] | * data that mostly does not outlive the function itself |
high mem =============================================================
In this case, the string "abc"
has been found by the compiler and put in the
data
segment. When our function is run, the x variable points directly into
the data
segment, at the address where "abc"
starts. Since the &str
also
stores a length, the program will only read three characters when run.
Two types of strings
String
The first of the two types is String
which is the heap allocated, growable,
string type. It's growable, which means that unlike Java it's possible to change
the string as long as there is enough room left. If there isn't enough room, it
will expand its size automatically.
A String
may be owned, something like let x: String = String::from("abc")
,
or referenced fn takes_string_ref(x: &mut String)
.
String
is implemented as a wrapper around Vec<u8>
.
str
A str is simple, but also very very complicated. You can either accept the standard explanation without further questioning it, or you can read my take below 👇.
A str
is a Dynamically Sized Type.
Furthermore, it is a primitive type, and unlike String
you can't find it in the
standard library since it's a compiler internal type.
str
is callad a "string slice". Unlike most other types you can NOT get an
instance of a raw str
. It is is most often seen with its buddy Mr. Ampersand,
as in: &str
. Other possibilities are Rc<str>
and Box<str>
. The use case
for Box<str>
is that it doesn't contain
the capacity
field of String
, so it takes up less memory.
Just like a &i64
is a reference to an actual i64
, an &str
is a reference.
But a reference to what? An i64
can be stack allocated, so the &i64
is
a pointer to some other place in the stack. A Box<i64>
is heap allocated,
and you can get a &i64
reference to that one, too.
But the &i64
isn't just a pointer, it has an implicit size too.
The compiler knows that since the type is 64 bits, it knows how much data to read
when reading the pointed-to reference. But what size is an &str
?
Since the Rust 2018 edition, we need to put references to trait objects,
which have an unknown compile-time size,
behind the dyn
keyword: &dyn MyTrait
. That way the compiler knows to
generate a vtable (a table of function pointers) that the runtime can use to find
the functions of the actual underlying struct. dyn MyTrait
is a Dynamically
Sized Type too, and it also has to be pointed to by a &
reference, or be
boxed somehow: e.g. Box<dyn MyTrait>
.
But what about &str
? A pointer to a string isn't enough, the computer must know
how many bytes of data to read. Fortunately, it does contain the length too,
just as a &[u8]
reference knows how many bytes to read behind the pointer.
I think there are nice similarities between how the lack of a known compile
time size of a str
forces the runtime code to store the runtime length together
with the pointer to the actual data, and how references to trait objects need to
store a pointer to a vtable to work properly. They're both Dynamically Sized Types
too.
So a &str
is basically type str { pointer: *const u8, len: usize }
.
Maybe it would have been less confusing if &str
was presented another way?
What about &str[u8; ?]
. No that's terrible, never write that again.
The way str
is presented by the standard documentation leads me to believe
that the &
in &str
is the actual pointer, and that the str
part is just
a placeholder for len: usize
and an implicit data type u8
. But that's
maybe wrong, probably?
My personal take is that str
could have been a standard library type, or
a struct
instead, and used without it being a reference "&
". That way
the pointer field could have been seen in code, and all would have been well.
But since Rust is Rust, and &
means shared reference, all the standard rules
around lifetimes and sharing kick in. That results in an overall nicer experience.
However, I find the lack of a deeper explanation or what-if explorations unsatisfying.
Three types of string storage, again
Heap
The standard library String
is always heap allocated, but it can interact with
&str
in two ways
- Anything that takes a
&str
can take a reference to aString
and it will just work - We can get a
&str
sub-slice of aString
by doing&my_string[1..3]
which for theString
"abcd"
would be"bcd"
.
Neither (1) nor (2) above need to allocate any extra memory except for the size of the pointer and the length.
Here we allocate an empty String
on the heap, and our handle to it, called x
is on the stack:
PROGRAM MEMORY
use std::str; ========================================
fn main() { | HEAP |
// An empty String | addr[0x00F0] <----+ |
let x: String = String::new(); | type: *const u8 | |
| | value: [empty] | |
| | | |
| |///////////////////|//////////////////|
| | | |
| | addr[0xFFF0] | |
| | type: String | |
+------------------------------- > value: 0x00F0 + len: 0 + capacity: 0 |
| STACK |
} ========================================
There's a bit of lying going on above, since String itself doesn't have the pointer,
it's a wrapper around Vec<u8>
, and the Vec
has the actual pointer.
Data segment
As previously mentioned, all string literals, e.g. let x = "hello"
, will have
the type &'static str
, and they are stored in the data
segment of the application
binary.
PROGRAM SEGMENTS
======================================= low mem
fn main() { | .text |
// A str "string" with the value "abc" =======================================
// which is stored in the data segment | .data |
let x: &'static str = "abc"; ----+ | addr[0x0008] <----------+ |
| | | type: *const u8 | |
| +------> value: "abc" | |
| ==========================|============
| | HEAP | |
| | | | addr[0x00F0]
| | | |
| |/////////////////////////|///////////|
| | | |
| | | |
| | | |
| | addr[0xFFF0] | |
| // x is stored on the stack | type: &'static str | | addr[0xFFFF]
+-----------------------------------> value: 0x0008 + len: 3 -+ |
// and contains a ptr and a len | STACK |
} ======================================= high mem
Stack
It's possible to store string data on the stack, one way would be to create an
array of u8
and then get a &str
slice pointing into that array.
This is stolen from the str documentation:
PROGRAM MEMORY
use std::str; =================================
fn main() { | HEAP (unused in this example) |
|///////////////////////////////|
let sparkle_heart: [u8; 4] = [240, 159, 146, 150]; | addr[0xFFE0] <----------+ |
| | type: *const u8 | |
+---------------------> value: [bytes...] | |
| | |
+-- let sparkle_heart = str::from_utf8(&sparkle_heart) | addr[0xFFF0] | |
| .unwrap(); | type: &str | |
|-------------------------------------------------------> value: 0xFFE0 + len: 4 -+ |
| STACK |
} =================================
Stack strings and Hybrid strings
For stack allocated strings, or for hybrid stack/heap implementations, there are a number of crates available. Use your favorite search engine to find them.
A helpful tip
The reference variety of String
: &String
, should be avoided in favor of &str
,
unless there is a need for a "String out parameter". A "String out parameter", or
&mut String
, can be used when a currently owned String needs to be updated by
a receiving function, without having to move it into, and then out of, that function.
In short:
- Use
String
for strings you need to change, or where aString
is a required parameter. - Use
&str
as function parameters when you need to read string data, since all types of strings can be cheaply turned into&str
. - Use
&mut String
only when you need a "String out parameter".
Final words
We learned that
String
is for growable strings on the heap.[u8; N]
(byte array) is for strings on the stack.- string literals
"abc"
are strings in thedata
segment. &str
is for peeking into string slices allocated on the heap, in thedata
segment, or on the stack.- How
&str
points into the different types of string storage.