Author: Doug, 10+ year veteran of embedded development.

Public number: [IOT town], focusing on: C/C++, Linux operating system, application design, Internet of Things, SCM and embedded development and other fields. Public account reply [books], get Linux, embedded field classic books.

Transfer: welcome to reprint the article, reprint need to indicate the source.

A few important segment registers

Linear address ranges in Linux 2.6

A “complete” 8086 assembler

In the first two articles, we studied the basic usage of CPU and memory in the 8086 processor, focusing on the segment register and memory addressing.

Some people may be disapproving: nowadays, there are modern multi-core processors and operating systems have become very powerful. Why still learn such antique knowledge?

Answer this question:

“We all want to learn what’s new and useful, but the learning process is objective.”

“Any reasonable learning process (eliminating detours, blind exploration and disorganization as much as possible) is a gradual process.”

“We must first learn and explore general laws and methods through a thing that is easy to grasp comprehensively.”

Take the Linux operating system for example. As a long-term learning plan, it is unlikely to start by reading the latest Linux 5.13 version of the code.

It is more likely to learn version 0.11 first, understand some of the principles and ideas, and then gradually learn and explore the higher version.

So for “Linux learn from scratch” this series of articles, I hope that I can learn a longer route, starting from the underlying hardware mechanism, driving principle, from simple to complex, step by step finally Linux operating system this hard bone to chew down.

So today we’ll continue with 8086 and look at the basic structure of a relatively “complete” program.

Several important segment registers

The segment addressing mechanism and associated registers are so important in x86 systems that I can’t resist summarizing a few segment registers here.

Code segment: used to store code, the base address of the segment is placed in register CS, the instruction pointer register IP is used to represent the offset address of the next instruction in the segment;

Data segment: used to store the data processed by the program. The base address of the segment is stored in register DS. When operating on a certain data in the data segment, the offset address is specified directly in assembly code by immediate number or register.

Stack segment: It is essentially used to store data, but it operates in a special way: through PUSH and POP instructions. The base address of the segment is stored in register SS, and the offset address of the top stack unit is stored in register IP.

The segment here is essentially a contiguous chunk of memory dedicated to a particular type of data.

We can do this because the CPU makes this arrangement possible through the above registers.

In a nutshell: the CPU treats the contents of a segment of memory as code because CS:IP points to it; The CPU treats a segment as a stack because CS:SP points to it.

In a previous article, we demonstrated exactly what segments are included in an ELF executable file called “The compiled, linked foundation of Linux systems — ELF files: Peel back its layers and Explore bytecode granularity” :

Although the segment structure described in this diagram is more complex, it is essentially the same as the segment structure described in 8086!

Linear address ranges in Linux 2.6

In a modern operating system, the address space used by a process is called a virtual address (also known as a logical address).

The virtual address is transformed into a linear address. Then the linear address goes through paging conversion to get the final physical address.

Here again long-winded, many books squadron memory address more, are according to the author’s habit to call.

I understand it the way above: the compiler generates an address called a virtual address, also known as a logical address, and then goes through a two-level transformation to get to the final physical address.

In Linux 2.6 code, since Linux treats the entire 4GB address space as a “flat” result (segment base address is 0x0000_0000, offset address maximum is 4GB), virtual addresses (logical addresses) are numerically equal to linear addresses.

Let’s combine this picture from last time to understand:

In Linux 2.6, the user code segment starts at address 0 and has a maximum range of 4 GB; The start address of the user data segment is 0, and the maximum range is also 4 GB; The same goes for data and code segments of the kernel.

Why: Virtual addresses (logical addresses) are numerically equal to linear addresses?

Linear address = segment base address + virtual address (offset). Because segment base address is 0, the linear address is numeric equal to the virtual address.

Linux does this because it doesn’t want to take too much advantage of the segment mechanism provided by x86 for memory address management, and instead wants to take advantage of paging for more flexible address management.

One more word of caution:

In each of the above descriptions, I indicate a mechanism or policy that is provided by the x86 platform or by the Linux operating system.

The same is true for paging, which is provided by x86 hardware, but Linux has extended it for more flexible memory address management.

Therefore, when you read some books, you should have a picture in mind: what is the context of the current description?

When we create a process, all the linear address ranges that the process has are recorded in the kernel.

All linear address ranges owned by a process are a dynamic process, expanding or shrinking at any time according to the needs of the program. For example: map a file to memory, dynamically load/unload a dynamic library, etc.

As we know, the kernel manages physical memory in a “page box” unit.

A page box can contain 1-N pages, and each page is typically 4 KB in size, which is physical memory management.

A linear address range can contain multiple physical pages. Each linear address is eventually translated through a multilevel page table to a physical address.

Note: In the figure above, linear address interval 1 maps to N pages in the physical address space, which may or may not be contiguous.

Although discontiguous in physical memory, we use contiguous space in our applications because it is shielded by the paging conversion mechanism.

A “complete” 8086 assembler

Let’s go back to the 8086.

The address described here, after segment address translation, is a physical address without complex page table translation.

That’s why we use 8086 as a learning platform: to get away from the complexities of the operating system and explore the underlying stuff.

In this simplest assembler, three segments are used: code segment, data segment, and stack segment.

A segment is an address space. Since it is an address space, it must contain two elements: where to start and how long.

Or go directly to the code:

assume ds:addr1, ss:addr2, cs:addr3 addr1 segment ; Place the data segment at this position db 32 dup (0); Addr1 end addr2 segment; Put the stack segment at this position db 32 dup(0); Addr2 end addr3 segment; Start mov ax, addr1 mov ds, ax; Mov ax, addr2 mov ss, ax; Set stack register mov sp, 20h; Set the top pointer register... ; Addr3 Ends end startCopy the code

This is the basic program structure of an assembly code, which we have arranged for three segments.

The three labels, addr1, addr2, and addr3, represent the beginning address of each segment. At the beginning of the code segment, assign the address represented by the data segment number ADdr1 to the DS register; Assign the address represented by stack label ADdr2 to register SS.

Is this label very similar to the GOto label in C? They all represent an address.

Note that the SP register assigned to the top pointer of the stack is 20H.

Because the stack segment is used from high to low, the top pointer needs to be set to the next address space of the maximum address cell.

SP = sp-2, SS:SP refers to 1000:001E, and then 1234H is stored in the address space:

In addition, the last sentence of the code is used to tell the compiler that the address represented by the start label in the code segment is the entry address of the program, and this entry address information will be written into the executable program after compilation.

After the executable is loaded into memory, the loader finds the entry address and sets CS:IP to point to it to start executing the first instruction.

Compare the entry addresses in the ELF executable listed in “ELF Files: Peeling away their layers and Exploring the Granularity of bytecode” with the entry addresses represented by the start symbol at 8086 above:





Recommended reading

[1] C language pointer – from the underlying principle to the tricks, with graphics and code to help you explain thoroughly [2] step by step analysis – how to use C to achieve object-oriented programming [3] The original GDB underlying debugging principle is so simple [4] inline assembly is terrible? Finish this article and end it!

Other series albums: selected articles, C language, Linux operating system, application design, Internet of Things