STEP File Parsing (In Rust 😉)

10 minute read Published: 2023-02-10

This is a bit different from the stuff I've been posting about so far. By "a bit different" I mean this isn't about Blender, but it is still about 3D data structures involving Rust, so that "bit" really is just a "bit".

At work, we deal with lots of CAD data in STEP format. These STEP files contain all the data needed to define 3D parts. On top of this, they hold definitions for assemblies and, depending on the version of the format you have, can also include information regarding the colour of parts (which is particularly interesting to the company). Painfully though, it still seems like a lot of information, such as part materials, that is available in native CAD data, is lost when converted into STEP. Everything that STEP is capable of is laid out in ISO standard 10303, but I'm not abouts to spend 145 of my hard-earned Swiss Francs on something I can enjoy guessing about, so that's why I'm here.

Chronic Step Pain

When I think of a CAD file, I imagine a super dense binary file with high precision parameters defining the placement of each edge, face, curve and corner. For the most part, this is an accurate assumption. FreeCAD for example stores it's data like this, just a big old heap of bits. This is great, humans don't need to and definately don't want to read CAD data in its raw format. By the same merit, humans do want the software they use to interact with their CAD to be able to read the data super quick.

This is where STEP enters the ring. If you open up a STEP file in a text editor, you are met with a nasty surprise. It's all in ASCII. What the hell? Who dreamt this disaster up? Here's an example just in case you didn't believe me or don't have one to hand.

ISO-10303-21;
HEADER;
/* Generated by software containing ST-Developer
 * from STEP Tools, Inc. (www.steptools.com) 
 */

FILE_DESCRIPTION(
/* description */ (''),
/* implementation_level */ '2;1');

FILE_NAME(
/* name */ 'Asm v2.step',
/* time_stamp */ '2023-02-09T22:05:44Z',
/* author */ (''),
/* organization */ (''),
/* preprocessor_version */ 'ST-DEVELOPER v19.2',
/* originating_system */ 'Autodesk Translation Framework v11.17.0.187',

/* authorisation */ '');

FILE_SCHEMA (('AUTOMOTIVE_DESIGN { 1 0 10303 214 3 1 1 }'));
ENDSEC;

DATA;
#10=MECHANICAL_DESIGN_GEOMETRIC_PRESENTATION_REPRESENTATION('',(#21),#3284);
#11=ITEM_DEFINED_TRANSFORMATION($,$,#827,#869);
#12=ITEM_DEFINED_TRANSFORMATION($,$,#827,#870);
#13=(
REPRESENTATION_RELATIONSHIP($,$,#3296,#3295)
REPRESENTATION_RELATIONSHIP_WITH_TRANSFORMATION(#11)
SHAPE_REPRESENTATION_RELATIONSHIP()
);
...

As you can see, there's a bunch of metadata at the top, which is perfectly fine, you've got to put that stuff somewhere right. But the pain kicks in when you hit the DATA section towards the bottom. This is where all of our shapes, assemblies and colour lies, defined in a semicolon separated list of numbered tags, each with varying numbers of parameters. These lines then reference other lines using # symbols making it quite difficult to make your way around as a human with a text editor. This is where my annoyance hits its peak. If you're going to make a file format that's human readable, at least make it readable to the point where it's useful. The sacrafice of not using binary is completely wasted othewise.

In the file I showed a snippet of up there, there's another 3000 lines of "DATA", I don't want to kid you into thinking that STEP is super dense after all. In fact, as you probably have guessed, it's terrible at that too! The step file for the model shown below has a filesize of 254kb, wheras the FreeCAD file (created using the STEP file) is only 59kb.

The STEP file format is regarded by many as poor as far as I'm aware, just look at the wikipedia page for its not so gleaming reputation. I'm definately not a fan.

Doing Bits

I've had my rant now and I feel like a new man, less complaining from now on. So, I've written a very basic parser for the STEP file format, which is based on the EXPRESS schema. As I've said earlier, I've not purchased the ISO document that defines STEP, which sets me off to a bad start. I also think it's important to mention that this parser is far from complete. I've got it working with a few STEP files, but I'm sure there's all kinds of funky edge cases I've missed.

For parsing, I've used the nom library. It's a bit difficult to get your head around at first, but once you get the hang of it it's extremely powerful. This is the function that parses a single line of data. It takes in a &str and if the parsing is a success, it returns an Ok IResult containing the rest of the file yet to be parsed as well as a parsed DataLine object.

#[derive(Debug)]
pub struct DataLine<'a> {
    pub number: u32,
    pub chunks: Vec<Chunk<'a>>,
}

pub fn data_line(input: &str) -> IResult<&str, DataLine> {
    // Take only a single line, up to the semicolon
    let (remaining, line) = take_until(";")(input)?;
    let (next_line, _) = take(1usize)(remaining)?;

    // Take the # off the front
    let (line, _) = char('#')(line)?;

    // Get the digits inbetween"# and =
    let (line, number) = take_until("=")(line)?;

    // Eat the =
    let (line, _) = char('=')(line)?;

    // Turn the remaining text into chunks
    let (_, chunks) = to_chunks(line)?;

    let data_line: DataLine = DataLine {
        number: number.parse::<u32>().unwrap(),
        chunks,
    };

    Ok((next_line, data_line))
}

You'll notice that on each line that we use a combinator, we unwrap the result with a ?. This means if the combinator failed, the error will get bubbled up and depending on how the original function was called nom can handle the result differently. For example, the combinator that is calling our data_line function is a many1, which will keep running a combinator on the returned remaining data until it returns an error, at which point it will return all the data it managed to parse as a Vec. This is perfect for us as we want to keep parsing lines until we hit something we can't parse as a data line, such as an ENDSEC; which breaks up sections in STEP.

With this so far we have parsed the #13= and the trailing ; out of this:

#13=(
REPRESENTATION_RELATIONSHIP($,$,#3296,#3295)
REPRESENTATION_RELATIONSHIP_WITH_TRANSFORMATION(#11)
SHAPE_REPRESENTATION_RELATIONSHIP()
);

Leaving just this:

(
REPRESENTATION_RELATIONSHIP($,$,#3296,#3295)
REPRESENTATION_RELATIONSHIP_WITH_TRANSFORMATION(#11)
SHAPE_REPRESENTATION_RELATIONSHIP()
)

to get sent to the to_chunks combinator.

Because I don't know the official names of anything in here, I decided to call these things REPRESENTATION_RELATIONSHIP($,$,#3296,#3295) chunks. to_chunks then uses the alt combinator to get the chunks that make up a line (sometimes a single line holds a few, such as the one in the example above).

#[derive(Debug)]
pub struct Chunk<'a> {
    pub tag: &'a str,
    pub elements: Vec<Element<'a>>,
}

pub fn to_chunks(input: &str) -> IResult<&str, Vec<Chunk>> {
    alt((delimited_chunks, single_chunk_as_vec))(input)
}

pub fn single_chunk_as_vec(input: &str) -> IResult<&str, Vec<Chunk>> {
    let (remaining, chunk) = single_chunk(input)?;
    Ok((remaining, vec![chunk]))
}

pub fn single_chunk(input: &str) -> IResult<&str, Chunk> {
    // Find the tag
    let (line, tag_value) = take_until("(")(input.trim())?;

    // Optionally grab the list of elements
    let (line, elements) = opt(delimited_elements)(line)?;

    Ok((
        line,
        Chunk {
            tag: tag_value,
            // If there were no elements, just return vec![]
            elements: elements.unwrap_or_default(),
        },
    ))
}

pub fn delimited_chunks(input: &str) -> IResult<&str, Vec<Chunk>> {
    let (remaining, chunks) = delimited(char('('), many0(single_chunk), char(')'))(input)?;
    Ok((remaining, chunks))
}

Everything here ultimately relies on the single_chunk combinator.

At this point we have parsed REPRESENTATION_RELATIONSHIP($,$,#3296,#3295) to give us REPRESENTATION_RELATIONSHIP as the tag and ($,$,#3296,#3295) to pass into our delimited_elements combinator.

The delimited_elements combinator makes use of the delimited and the separated_list0 combinators to call the element function on each comma separated value, in between a pair of brackets.

pub enum Element<'a> {
    Reference(u64),
    Dollar,
    Asterisk,
    Elements(Vec<Element<'a>>),
    String(&'a str),
    DotString(&'a str),
    Float(f64),
    Integer(i64),
    Chunk(Chunk<'a>),
}

pub fn delimited_elements(input: &str) -> IResult<&str, Vec<Element>> {
    delimited(char('('), separated_list0(tag(","), element), char(')'))(input)
}

pub fn element(input: &str) -> IResult<&str, Element> {
    alt((
        reference,
        dollar,
        asterisk,
        nested_elements,
        string,
        dot_string,
        float,
        integer,
        chunk,
    ))(input.trim())
}

The element function makes use of the alt to try each possible element type and then jam it into an Element enum. I won't go through every element parser I wrote, but here are a few for you to get an idea of what they do. If you want the whole thing, the source code is available here.

pub fn reference(input: &str) -> IResult<&str, Element> {
    let (line, _) = char('#')(input)?;
    let (remaining, number) = digit1(line)?;

    Ok((
        remaining,
        Element::Reference(number.parse::<u64>().unwrap()),
    ))
}

pub fn dollar(input: &str) -> IResult<&str, Element> {
    let (remaining, _) = char('$')(input)?;
    Ok((remaining, Element::Dollar))
}

pub fn string(input: &str) -> IResult<&str, Element> {
    let (remaining, string_value) = delimited(tag("'"), take_until("'"), tag("'"))(input)?;
    Ok((remaining, Element::String(string_value)))
}

And just like that, we're done. Well, we're not done because I'm sure there's plenty of examples of STEP files that won't work with it, but it does work for my step files, so I'm happy for now. The first improvement to make would probably to stick some more .trim's around the place just to take out some whitespace. It would also be nice to have a unit test for each of the combinators which would be quite easy and satisfying to do.

Serde, JSON and Binary

Because it's easy to, I decided to serialise the resulting data structure into json which comes out looking like this:

[
    {
        "number": 10,
        "chunks": [
            {
                "tag": "MECHANICAL_DESIGN_GEOMETRIC_PRESENTATION_REPRESENTATION",
                "elements": [
                    {
                        "String": ""
                    },
                    {
                        "Elements": [
                            {
                                "Reference": 21
                            }
                        ]
                    },
                    {
                        "Reference": 3284
                    }
                ]
            }
        ]
    },
    ...

Just as bad as STEP, now usable in every programming language you can think of... and 3x the file size 😝. It is however probably a little nicer to work with now that you have all the tools you'd normally have for looking at json at your disposal such as JsonPath. Because of the nature of the data structure, this json is marshalled, in a similar way to how DynamoDB records are, which is a bit awkward. There's probably a way of dealing with this marshalling using serde, I just don't care enough to investigate it right now.

Also, because it's easy to do, I decided to encode and compress the resulting data structure using bincode and flate2, and ended up with a file only 44kb big 🥳. I'm not counting this as a win over FreeCAD though, I'm sure that's much more efficient and stores way more data that what I'm doing. Still, pretty cool compared to the STEP file's 254kb.


I don't know what compelled me to reccomend a song in my last post, but I think it's fun, so I'm going to carry on with it.

This is Caroline Polachek's remix of Oh Yeah by A.G. Cook. Probably not for everyone this, but I think it's a really fun song. The original version feels a bit dead after listening to the remix, but you can pretend it doesn't exist. Enjoy!