Starting with the basics - Part I

A foreword of sorts…

Some months ago, I had a conversation with a colleague of mine, about cloud applications and why Kubernetes workloads need to scale out (more individual pods, or horizontal scaling) and not just up (more powerful pods, or vertical scaling). This went on a tangent about distributed systems and, eventually, on why creating a large number of threads and even processes to be handled by a processor eventually hits a ceiling. Whether that ceiling is truly relevant, in the context of modern hardware and software is an entirely separate topic.

During our conversation, it became apparent just how disconnected modern software development is from the underlying hardware, or even from the Operating Systems the application processes run on. Coincidentally, I was getting ready to finish up my first SwiftUI introductory book. I decided to push the release of that book a bit farther, and capture the information I wanted to convey in that conversation, in a better articulated manner and through the lens of Apple’s systems.

In this post, I will cover the first two sections of that introductory chapter. The rest will be included in a few other posts.

About (Swift) Applications and (Apple) Operating Systems

Today, Software Development as a profession is often performed on very high level software frameworks, which usually hide away many of the more complex tasks.

Some frameworks are developed by large companies, such as Apple, Google or Microsoft, to help developers create software for their ecosystems, or to solve very specific problems they encounter in their business. Others are created by individual developers or small teams, to solve a problem they encounter ( such as Laravel, Ruby on Rails, Vue.js etc.). This model works very well because it allows framework users to focus on their application’s core logic, while framework developers handle the underlying technical challenges.

Applications work well when both sides understand their roles: users must grasp the framework's rules, and developers must understand the types of applications their users are building. However, this mutual understanding breaks down. Sometimes, the framework evolves in a direction that excludes specific types of applications. Other times, we use the wrong framework to solve a particular problem, influenced by hype or by the incomplete understanding of the framework’s purpose. More often still, framework creators lock users into their particular vision of how applications should function. The degree to which framework maintainers establish and enforce their own views (opinions) over how their users should build applications determines if a framework is opinionated or not. Apple’s Combine framework, together with the SwiftUI framework, for example, are opinionated frameworks. They expect applications to be constructed by declaring how data is transformed over time and howthe User Interface should use the data, without requiring the users of the frameworks to specifically call a render or an update function, for example. They are declarative, reactive frameworks which, by their very nature, require a specific structure.

Since frameworks hide complex lower-level interactions, developers often make assumptions that are partially or completely incorrect, leading to various issues. Meanwhile, framework maintainers, focused on their own development priorities, may create constraints that make certain application requirements difficult to implement. When these constraints are not properly documented, the resulting applications are more likely to behave incorrectly.

With experience and as frameworks evolve, users become experienced enough with architecture of the frameworks they use and form their thinking models in line with the framework maintainers’. In the same time, the framework documentation improves, leaving less room for incorrect assumptions.

Because this model generally works well and allows application developers to move faster, it also allows developers to become completely disconnected from mechanisms and notions that are not made visible (or usable) by their frameworks of choice.

Eventually, some developers (hopefully most of them) reach a point where they wonder what actually lies beneath the surface exposed by those frameworks. They may, for example, wonder how a touch on their screen (or click of their mouse) results in the actions they see on their screens - whether it’s their favorite application starting up or it’s they favorite game character moving. Or, perhaps, they would start wondering how macOS works and why it’s built the way it is. Not just in terms of design and polish, but the internal mechanisms.

As a developer, you can spend most of your career writing applications for the Apple platform without knowing what the Mach kernel is or ever seeing an IPC message. You can write fast and safe concurrent multithreaded applications, without knowing anything about POSIX Threads. In fact, since Swift 5.5, you may not even need to know what the Grand Central Dispatch is.

However, if and when you do, you suddenly start to understand why Operating Systems, Application Frameworks and Programming Languages work the way they do. You start understanding the techniques others used to solve problems - and you gain the ability to apply those notions independently. You begin to realize just how complex and vast the technology space is, even in areas that seem specialized, such as the development of applications for the iPhone.

The purpose of this section is to explore how the Swift source code we write fits into the wider and more complex context of Apple Operating Systems running on Apple devices. This will help form a mental model of the way your own application’s code fits within this entire ecosystem. Then, we are going to follow the events that take place when an end-user touches the screen of their phone (or clicks a button on their mouse) to activate a button in an application’s user interface.

I find this helpful, though not necessarily mandatory, for a few key reasons:

  • You gain the vocabulary you need to express concepts and implementations. This is particularly useful when collaborating with other developers, or during your own research. Many technical terms carry specific implementation details, so a shared understanding between you and your interlocutors can go a long way. Additionally, the richer your vocabulary becomes, the more precisely you can ask the questions you need, to get the answers you seek;

  • All Operating Systems essentially solve the same problem: they act as a bridge between the end-user, the applications they need to use and the hardware those applications run on. This is all orchestrated in a safe (relative to the system itself, not your data) and hopefully intuitive manner. The mechanisms they use to accomplish these tasks differ in certain implementation details, but the general ideas and concepts remain the same across all ecosystems (partly for convenience, partly because the ideas were that good to begin with). For example, all Operating Systems use drivers to describe devices (such as a mouse or printer) in ways that are useful to the software running on the operating system. However, each OS may (and usually does) have its own systems used to build those drivers.

  • Generally, low level systems do not change in fundamental ways and if they do, they rarely change quickly and suddenly. Therefore, if you know how a specific kernel worked 5 years ago, it will likely work the same now and 5 years from now. The same goes for kernel extensions. As this section will show, there are portions of the macOS Operating System that were written in the 1980s and 1990s. They are still relevant and important, close to half a century later. This is why some of the references provided throughout this chapter are sourced from Apple’s archives website.

When referring to software components and concepts, we can generally use the terms “low level” and “high level”. For example, SwiftUI is the higher level framework, while UIKit is the lower level framework. Even lower level framework examples would be CoreAnimation or CoreGraphics. Regardless of the framework, the direction remains consistent. Lower level indicates that a concept is closer to the device’s hardware space, while higher level indicates that the concept is closer to the end-user space. This is a useful way to describe frameworks and programming languages because, the closer we get to the hardware, the less protection mechanisms we generally. As a result, the impact of potential programming errors is generally higher in lower level environments.

When analyzing complex software systems, it’s important to set a context or abstraction domain and then ground the analysis to that context. This is especially useful because a single term can mean multiple things, depending on the context.

For example, the diagram below presents a Button in various contexts. While “Button” conceptually means an interactive control, its technical definition and characteristics vary, based on the domain (or even frameworks of the same domain). Specifically:

  • In SwiftUI, the button exists as a Button view built upon the UIControl class from UIKit, which itself derives from the UIResponder class.

  • At a lower level, the button is expressed in the context of the screen ( display device) and it can be represented by a Touch Area and a CoreAnimation Layer

  • In Memory, it simply becomes a block of data, representing its state and, potentially, references to other objects(such as the associated functions to be executed when touched)

All of these perspectives are valid and individually accurate, but no single one tells the whole story.

 

The concept of a Button, represented in various contexts

 

When referring to a button in UI Frameworks, you are referring to the control, which is a high level abstraction that contains a visual element (the rendered button) and a logical element (what should happen when the button’s action is called). These concepts exist entirely in the Application’s User Interface domain. In the context of the screen’s display component (the OLED), you have an area of pixels. In the context of the screen’s digitizer (the assembly that converts physical interactions to digital signals) , you have various signals from a sensor array (changes in the electrical/magnetic field, recorded for each sensor) which the Hardware Controller registers and processes to extract information about touched areas. In the context of Memory, you have the frame buffer, which contains the information related to the visual representation of the Button, so that it can be drawn on the screen - and so on.

 

The terms framework and library are sometimes used interchangeably, especially in front end development and in discussions about React. This is usually because they both provide code that you can reuse and because some developers have very specific expectations from frameworks.

However, the two terms describe different requirements - and it’s useful to clarify them here, even though you will likely also use the terms loosely.

 

Libraries, such as the ArgumentParser library, provide reusable code you can integrate into your application. Their purpose is to add additional functionality, by allowing you to import their files and use them directly, as an integral part of your code. In other words, libraries integrate into your project.

Frameworks, such as SwiftUI, UIKit and others also provide some reusable code you can integrate into your application, but with a different purpose. The code exposed by frameworks acts as a connection point between the framework and your own code. In the case of SwiftUI, you use the View protocol to define views. When SwiftUI builds or updates views, it looks at structures conforming to that protocol. When it needs to render a view, it checks the structure’s body property, which is required by the View protocol. In other words, your code provides specialized behavior, conforming to the framework’s requirements, which the framework uses in certain cases. This is known as inversion of control.

Put simply, libraries integrate into your code, whereas your code integrates into frameworks.

 

Frameworks vs Libraries

 

Brief introduction to Operating Systems

If you own a personal computer, it is likely running an operating system that is either Apple’s macOS, Microsoft’sWindows or a variant of Linux. Of the three, two use a kernel (a complex set or abstractions that bridge the hardware of the device with the ) that represents a variant of POSIX/Unix (Portable Operating System Interface/ UNIplexed Information and Computing Service) and the other uses the Windows NT kernel.

There are numerous criteria to classify Operating Systems - and one of them is the way they structure Input and Output abstractions. There are others, such as promises related to Response Times (Real Time Operating Systems vs Regular Operating Systems).

In POSIX, the core philosophy is that everything is a stream of bytes (or a file). In Windows NT, on the other hand, everything is a specialized, securable object. This may seem like a small difference, but it essentially dictated the path of the operating systems that used them. For example:

  • As its name suggests, the POSIX approach is to ensure the conforming Kernels are portable. By abstracting everything as a stream of bytes, you can effectively configure anything in a file (in Apple’s case, two files, because you typically include a .plist file as well). With Windows, on the other hand, you use the Windows Registry. The kernel is not as easily ported to another operating system.

  • On POSIX, you use Native, Lower Level Abstractions. You can write native C tools to interact with the POSIX APIs, or you can use Apple’s C-level Core APIs - all the way up to Objective-C or Swift code. On the NT Kernel, however, since it is proprietary from the ground up, you usually interact with Higher Level Abstractions provided by Microsoft (.NET Core, UWP/WPF) . This does not necessarily make one approach safer or easier to use than the other, they just require different mindsets.

Overall, both ecosystems encourage developers to use the highest-level APIs possible and have invested heavily in toolsets that help developers build more complex, stable, and safe applications faster.

Despite these differences in approach, every Operating System (macOS, iOS, watchOS etc. for Apple or Windows for Microsoft) comprises, among many other complex systems, the same main components: an OS Kernel, a Service Manager, one or several Event Manager(s) and a User Interface Manager:

  • The OS Kernel is the piece of software that acts as the bridge between the Hardware and the Operating System’s higher order components (daemons, UI and end-user facing applications). It is a collection of services (often C functions, objects and data structures) that provide fundamental capabilities (assigning work to the CPU, allocating memory, reading from memory etc.) to higher level components. A key concept to grasp is that kernel services are loaded in memory when the operating system boots up and, unlike higher order constructs (like applications or web services), kernel services never stop. If an error is encountered in the kernel functions, it has the potential to bring the whole machine down, requiring a complete reboot and reload of the kernel. Apple, uses the XNU kernel, which is a combination of FreeBSD and Mach. Together with other core components in Apple’s ecosystem, the XNU kernel is part of the Darwin Operating System. Over the years, Apple developed numerous Kernel Extensions, to enhance the functionality and organized them in various ways, to be bundled with their various devices.

  • The Service Manager is a process (a workload container in which a program’s code runs, identified by a process ID - PID, for short) which starts after the OS completes its loading process. For Apple’s ecosystem, it’s launchdand for Linux - it’s init. Sometimes, it’s called a System Manager and it starts with the Operating System, as the very first User Mode process created, which is why it receives the Process ID (PID) value of 1. It runs as a daemon process (a background process which runs in an infinite loop and performs tasks without user interaction) and it starts the applications that should load up after the Operating System boots. Finally, once the startup level applications are started, the system manager daemon process continues running and waiting for instructions, for as long as the Operating System runs.

  • The Event Manager is responsible for handling various types of events. There are usually multiple event handlers, each handling specific types of interactions - and in various activity spaces ( file system events, system events, touch events and so on). For macOS, mouse clicks are registered via WindowServer, while on iOS, touches are handled by backboardd (which was introduced in iOS6, as a break-out daemon from SpringBoard. They all receive the the events from the low-level IOKit kernel extensions.

  • The User Interface Server, is responsible for managing the User Interface elements. For MacOS, it is WindowServer, whose main function is to open CGXServer (Core Graphics X Server). For iOS, it’s SpringBoard. Additionally, there is a WindowManager component, as well - and it is responsible for grouping managing window positions in various configurations (for example, on multiple virtual desktops). There is also a RederServercomponent, which ensures that the correct data is processed by the GPU at the correct time, so that the connected displays can show the UI correctly.

In most scenarios, you will rarely (if at all) interact with these components directly - especially the OS Kernel. Instead, you would typically write code that integrates into Apple’s frameworks which, in turn, would interact with the OS Kernel on your behalf.

Since the Operating System is responsible for the management of hardware resources and all applications that try to use them, providing higher level frameworks for lower level interactions is not sufficient to ensure its integrity and performance. The OS is not inherently secure simply by following the distinction between high level or low level, nor is it secure by dividing work among more processes. For this reason, as a bare minimum, the software running on a device is divided into two main spaces:

  • The User Space, where applications external to the OS, such as eBook Readers, browsers, games etc. run. Software running in this space needs to be protected from other software running in the same space. This is accomplished by ensuring that each application receives its own memory (memory address space) they can read into. Generally, one application cannot read another application’s memory directly.

  • The Kernel Space, where the Kernel runs. The software running in this space can potentially access any location in memory, can run any operations and can control all input/output address spaces. Since it essentially has unrestricted control everywhere, this space needs to be separated from the User Space - and access to its resources is tightly controlled.

This separation ensures that an application running in the user space cannot take up more resources than the Operating System considers to be safe - and that an application crash does not take the whole operating system down. There are cases where these issues do still occur - but the separation between the User Space and the Kernel Space aims to lower the number of occurrences.

In general, User Space applications use System Calls (wrapper functions within the libSystem dynamic library) to prepare instructions to request from the the software running in the Kernel Space. The Kernel (in the case of XNU, either the BSD components or the Mach components) can then request the CPU to execute the commands. Lastly, The CPU executes kernel-issued commands only when it receives a specific trap, which signals its transition from the User Mode (Ring 3 on x86/64 architectures or EL0 - Exception Level 0 - on ARM) to the Kernel Mode (Ring 0 on x86/64 architecture or EL1 on ARM).

Besides the logical separation between the kernel and end-user-facing applications, the design of an operating system includes other concerns. To function effectively, applications need to interact with memory. From the application’s executable code, to dynamic libraries, to runtime variables, every useful piece of an application is stored somewhere in memory. As such, memory use, management and security are critical concerns for any operating system and its surrounding components.

When any application starts up, it goes through a Setup Phase, where the OS Kernel assigns a dedicated Virtual Memory Address Space for internal use. This serves as the application’s addressable memory space. During the application’s normal operation (or the Frequent Access Phase), its instructions access addresses from the Virtual Memory Address Space.

This system enables true multitasking and the parallel execution of processes. Since every process operates within its own virtual memory address space, two processes could store different data, at the same virtual address. When each of the two processes needs to access the address, the MMU converts the virtual address to a unique, real memory location. Since the physical locations are unique, the two processes can run without interfering with each other.

The access to physical memory is orchestrated through a separate hardware component known as the Memory Management Unit, which translates between virtual memory space and physical memory space, using a page table. By using a separate orchestrator for memory access, a given process cannot access the data of another process, unless the MMU itself becomes compromised.

The diagram below showcases the separation between Exception Levels, as well as the way memory is allocated on setup and used during the frequent access state of applications.

 

Separation between Application Memory Addresses

 

To further improve security, the Operating System (and the dynamic loader) use a mechanism called ASLR (Address Space Layout Randomization), which randomizes the virtual memory addresses where critical program components are loaded when a process starts up. Rather than always using the same address to load executables, libraries or other data, ASLR introduces controlled randomness to the memory layout. For example, the configuration data of a process may be loaded in address 0x01020304 during one program execution, but in address 0x0402301 during the next execution. Similarly, the stack could begin at 0x7ffdb9v3e0000 in one instance and in 0x8ffe123456789 in another one. This makes it exponentially difficult for attackers to predict where a specific piece of data is stored by a certain application. ASLR has a direct consequence on the compilers, because the Object files they generate need to be expressed into Position-Independent Machine Code. In most cases, addresses are expressed relative to the current position of the stack pointer.

Perhaps equally as important, because virtual memory abstracts away the physical memory layer, it can essentially create the illusion of an extended memory lane. For example, a 32 bit application can use up to 4GB of addressable space. A 64 bit application could use up to 18 ExaBytes (1,000,000 GB) of addressable space. This allows the operating systems to use various types of non-RAM memory (such as the Hard Disk) as extensions to RAM.


About Memory and Data Transfer over various mediums

Computers (and phones) function by processing electrical signals as data. Any piece of information is represented as a sequence of 1 and 0, or a Binary Sequence. To illustrate this concept, we can take the letter F as a simple example. To represent it in a way that can be expressed as a sequence of 1 and 0, we need to choose an encoding. This represents the protocol that both the sender and the receiver need to use, in order to understand the binary sequence. This protocol is used by the sender to encode the letter F into a binary sequence. Then, the sequence is decoded by the receiver using the same protocol. As long as the two parties use the same encoding, they can communicate with each other. There are several encodings for characters (String Runes):

  • ASCII is the foundational character set for the English language. It uses 7 bits to represent 128 characters, but the data is almost always stored in a standard 8-bit byte.

  • UTF-8 is the dominant encoding for the web. It's a variable-length encoding designed to represent every character in the Unicode standard. It is fully backward-compatible with ASCII.

  • UTF-16 is another variable-length encoding for Unicode. Its basic unit is a 16-bit (2-byte) chunk.

  • UTF-32 is another variable-length encoding for Unicode. Its basic unit is a 32-bit (4-byte) chunk.

Wherever an encoding uses more than 1 byte, we usually specify whether the sequence is read with the most significant byte first (Big Endian) or the least significant byte first (Little Endian). Essentially, endianness represents the order in which the sequence is read. Big Endian represents the “natural order”, whereas Little Endian represents the “reverse order”. Depending on endianness, the same sequence can mean two different things. For example, the UTF-16 character represented by the binary sequence 00000000 01000110, can be read as:

  • The Uppercase English LetterF’, when using Big Endian, because 00000000 01000110 is read as 0x0046 in Hexadecimal

  • The CJK Unified Ideograph’, when using Little Endian, because 00000000 01000110 is read as 0x4600 in Hexadecimal

Because the order in which bytes are read matters, it’s useful to know that, in network protocols, Big Endian is the default. For file formats (especially text), we specify the endianness using a BOM (Byte Order Mark), U+FEFF (ZERO-WIDTH NO BREAK-SPACE) which is an invisible character at the beginning of the file. For example, if a UTF-16 starts with the byte sequence FE FF, the file was saved as Big Endian. If it starts with FF FE, it was saved as Little Endian.

Since any type of information can be expressed in binary, as long as both the sender and the receiver agree on the encoding ( or format, or protocol, or standard, depending on the context) and endianness , we can save information, for later use, in some type of memory - or we can transfer it to other systems.

 

Although encoding can slightly obfuscate the data being converted, it is not a security mechanism. For example, you can easily take a base64 encoded string and then decode it. To properly secure data, you would need to use encryption, which requires cryptographic algorithms.

 

In the context of storing data for later use, one of the simpler examples consists of the common Dynamic Random Access Memory (DRAM), which stores data by using billions of small constructs, known as memory cells. Each memory cell stores a single bit and it consists of a capacitor (which can store charge) and a transistor (which acts as a switch). In a nutshell, when a capacitor is charged, it represents the value 1 - and if it’s discharged, it represents the value 0. The diagram below represents a single DRAM memory cell. DRAM cells are arranged in a matrix. This type of memory is wiped when the system is powered off, since the capacitors are very small and they cannot hold a charged state for too long. For this reason this type of memory is known as Volatile Memory.

 

DRAM Memory Cell (Left) and DRAM Memory SubArray (Right)

 

To read from DRAM, the system places a small electrical current on a specific Word Line, which closes the transistors connected to it. As a result, the circuit between each transistor’s capacitor and the associated Bit Line is closed, as well. This activates the cell. If the cell’s capacitor is full, the Sense Amplifier on the Bit Line detects a small increase in the Bit Line’s current, which signifies a 1. On the other hand, If the capacitor is empty, the Sense Amplifier detects a small decrease in the Bit Line’s electrical current, signifying a 0. When a capacitor is discharged, it needs to be recharged to show the same value on the next read operation. For this reason, read operations on the DRAM are destructive and they usually result in an immediate write operation (essentially, the controller reads the memory, then writes it again, while also sending the information to the requestor). Since DRAM is based on capacitors, which leak current, this type of memory needs to be periodically refreshed, by reading and rewriting the data. Most modern DRAM memory blocks refresh once every 64ms.

There are many other storage mechanisms, such as SRAM (Static RAM), based on a transistors feedback loop, which does not require a refresh ( hence, the name static), Magnetic Discs ( the normal HDD) , Flash memory and so on - each with their own requirements and implementations. At a high enough level, though, all memory storage mechanisms serve the same function: they store binary data, which can be retrieved, or modified.

 

All operations (reading, writing, refreshing) are highly dependent on a well synchronized clock signal. For example, it’s important to ensure that the data is read by the controller exactly when the sense amplifiers would detect the discharge. For this reason, everything in computing is, at the lowest possible level, completely synchronous. Clock speeds are measured in Hertz (Hz). 1 Hz is the equivalent of a tick per second. To put this in perspective, a 2GHz clock ( a modern processor) would “tick” 2 billion times every second.

 

Lastly, data can be transferred from one medium to another. This is accomplished by passing electrical current, either through independent (floating) wires or through traces of conductor, printed on Printed Circuit Board. Generally, the wires or traces that make up a bus are called lines. A grouping of these lines is known as a Bus. There are 3 main types of buses, conceptually (though at a very high level, the Bus represents the whole group). If the bus is used to transfer the address for which the read or write operation should be performed (Read from Address or Write to Address), it’s an Address Bus. If it’s used to send a clock signal, which acts as a synchronization mechanism (metronome) for the system, it’s a Clock Bus . Finally, if the bus is used to transfer the actual binary data, it’s a Data Bus.

Besides their purpose, buses are characterized by a width, or a number of lanes, which represent the number of wires the bus uses to transfer signals. Generally, Address Buses and Data Buses have a higher width, while the Clock Bus is usually 1 line wide (only has one wire).

Not all standards separate the address from the data lines. In many cases (I2C for example), both data and address information are sent on the same wire, but in different frames (a data frame and an address frame).

Data Buses can be serial (they transmit the value of one bit at a time) or parallel (they transmit the value of several bits at a time). The advantage with parallel buses is that they can send more data faster, because they can send more bits in a single clock cycle. However, when printed in a curve, the lines have different lengths, so it’s more difficult to ensure that all bits reach their destination at the same time.

The diagram below showcases the example of how the uppercase letter F is transmitted, in UTF-8 binary encoding, over a serial and over a parallel bus. As a reminder, the binary value for this character is 01000110.

Data can be transferred either in a Little Endian or in a Big Endian order (Least Significant Bit is transferred first or Most Significant Bit is transferred first). In this case, we are looking at a Little Endian system, so the least significant bit is transferred first. The example is simplified, as parallel buses usually transmit data through much wider data buses. For example, a parallel bus might use 32 wires, to transmit 32 bits per clock cycle, instead of the 4 bit wide data bus in the example.

 

Serial (left) and Parallel(right) Data Buses with dedicated global clock line

 

Over long distances (more than the usual distance between components on a single circuit board), this mechanism with a dedicated clock line becomes less and less feasible. Due to the physical characteristics of the wires, the timing signals can drift slightly, between timed objects. This is known as the clock skew or timing skew. To address this, most hardware protocols used to transfer data outside the realm of a single computer board (such as SATA, USB, Ethernet etc.) use various mechanisms to transfer the clock information with the data signal itself. This is colloquially known as Clock Embedding or Self-Clocking. For example, the Ethernet protocol specifies that, at the beginning of each transferred frame (more specifically, in the Preamble of each frame, or the first 8 bytes of the frame) there must be a sequence of high and low bits (10101010101), which are used by the receiving Network Controller to synchronize its own internal clock. This mechanism is known as Clock and Data Recovery.

To be Continued…

Previous
Previous

Starting with the basics - Part II

Next
Next

Taking the first byte