technology

Using python type checking to improve security

Share

Here at Genie we care about our customer's data, and we strive to stay at the forefront of the newest technology. So it follows that we’re always looking to use new features to keep data safe - like ensuring our data is secure through a novel use of Python’s type checker, a method inspired by similar approaches in Haskell.

In this post, we’ll answer all your questions about what exactly a type checker is, how it’s used and how it can be used creatively to keep data secure.

A comparatively new addition to Python is the type annotation, which allows for adding type hints to variables, function parameters and function's return values. It conceptually changes the nature of Python and alters what it's like to use Python as a programming language. Put simply, though, the type annotation is used to document the "type" of a variable.

Type annotations allow for adding type hints to variables, parameters and functions:


Some static programming languages, like C, C++, or Java, need anotations like this for the code to work. For these languages,  a compiler is used to convert source code into machine code - a set of simple instructions that the computer's hardware can run (with some extra steps in the case of Java) - and during this process, type information is used to work out which functions, methods, or CPU instructions must be used with a particular piece of data [1].

Here is an example of the type system in C:


In contrast, Python is a dynamic language and can run without any information about the types of variables or functions - which at times is a definite benefit. Instead Python allows the programmer to add type information which can be used by third party tools, or just for the benefit of people reading the code.

Rather than requiring type information and compiling to machine code, dynamic languages like Python and Javascript use something called dynamic-binding. Here, every piece of data includes information about which functions it can be used with; and while the program is running, this information is discovered. This means that the first purpose of types, for a compiler to know which function to use, is redundant. But a more general purpose, ensuring that data "supports the right operations," is useful [2].

Type checkers

In a nutshell, Python type annotations provide information about the data that goes into and comes out of a function (and is used within it). A separate tool, called a type checker, can then be used to check that the input of functions support the "right operations" and behave in the correct way. For example, a function might require something that can be iterated, can be indexed, has a .foo() method or can be accepted as an argument to json.dump.

mypy is the unofficial "official" type checker - in that Python's former "benevolent dictator for life", Guido van Rossum, develops it. There are alternative type checkers as well, like pyre, written by Facebook.

This sort of approach exists in other dynamic languages too - like Javascript with Erlang and Typescript - though Typescript is a language that compiles both to Javascript as well as to type systems.

Type checkers as theorem provers

Moving from static to dynamic languages makes clear a more general role for type checkers: ensuring that certain facts are true about functions and the data fed into them. This is made even more clear by the existence of protocols in mypy that exist almost entirely to "state facts" about the values passed into functions.

In other words, type checkers are a means of proving certain things about your code before running it. This differs from unit testing in that it allows you to use the type checker to ensure things you want to be true about your program are always true, rather than just true for some specific examples, an idea borrowed from mathematical formal proofs.

A formal proof is a sequence of steps following certain rules to "demonstrate" that a fact is true. You can think of a type checker as automatically generating and checking this proof for you, an idea formalised in the Curry Howard Correspondence.

At this point, it is worth mentioning that while type systems allow you to prove things about your program, there are facts that are, in general, impossible for a program to prove. Turing's halting problem is a common example of such an uncheckable-by-machine fact, but there are numerous other examples.

The limits of type systems for proving things about your code

Type systems represent an oddly constrained way of expressing facts about your program, but they allow you to express facts about your program nevertheless. Consider the example below, that attaches a type to a variable, as well as to the return value and input parameter of a function. A type checker will ensure that "types" line up:


But, as a way of stating facts about your program, using types and a type checker is not very expressive: you can attach "types" to variables, and your type checker then proves that these types match up - but that's about it.  You cannot generally express that a fact is true about your code using a type. For example, saying "this line of code never gets executed" or "this number is always even" can be quite tricky to express when you can only say things like "this variable is an integer," "this variable is a floating point" or even "this variable is of custom type X." However, despite these limitations, it is often possible to carefully arrange (as we do later) that these type constraints imply some quite specific things.

There are more general tools, often associated with particular program languages, that allow you to more expressively state and prove things about your program. Why3 is one such example. But these come with a downside: you can express things that cannot easily be shown, but to help your proof tool prove things about your program, you'll have to provide hints.

These expressive constraints of type systems represent not only a historic legacy from static languages, but also a trade-off between ease of verification and expressive power.

What can this theory actually be used for?

But that's a lot of talk and theory. How can it help you?

One particularly useful application is improving the security of a system by ensuring that certain facts about a program's execution and the flow of information through the system are true.

This has found applications in Haskell, see for example: Ensuring Information Security by Using Haskell's Advanced Type System. Though, Haskell has a very complete type system, and also has a succinct way of "emulating the execution of code" using a rather odd construction called a monad.

Here I'll share a similar application we used in Python:

The problem in question was ensuring the security of confidential data while using this data to produce output. To do this we placed the "sensitive" data in a box as a class, created data types to represent anonymised data and then used the type system to limit the conversion of "sensitive" data to anonymous data.

At some level, if you are are writing a program yourself, it is impossible for you to prove it is secure. In other words, if you make a mistake writing it, you could also make a mistake checking it. But this approach allows you first to create a minimal number of primitives that you know maintain a certain property carefully verified by hand, and second to have the type system limit how you can compose these primitives.

Below is an illustration of this approach. This can be used together with --warn-return-any in mypy to ensure that all types (potentially in certain models) are checked.


This process allowed us to keep that confidential data just that: confidential, which is important for companies like Genie AI.

There are plenty of other uses cases on how Python type checking can be used for innovative purposes.  If you're looking at how you can implement type checking and need some inspiration, get in touch!

Notes

[1] Depending on implementation, static languages may still include and use runtime information in objects to work out what code to use some of the time. See, for example, run-time polymorphism.

[2] Language interpreters sometimes provide optimisations that make a program run "as if" dynamic binding were taking place; but, in reality, certain optimisations are used to make the program run faster. See, for example, PyPy and V8.

Photo by Brook Anderson on Unsplash