Socket creation in SerenityOS June 2, 2020 on tomleb's blog

I have been interested in SerenityOS, a graphical Unix-like operating system developed by Andreas Kling.

This blog post will dive in the internals of SerenityOS to look at the network stack and how a listening socket works. There will be a bit of assembly as well as c++.

Network in SerenityOS

There are two major parts to describe the network stack of SerenityOS. The first part is the socket syscall that is called from userland. The second part is the actual receiving of packets from the NIC. I will only cover briefly the first part and focus more on the second one because I find it more interesting. We will look at the source code at commit c00076de829bfb55a64816f53257ad0e641ccd05, which is the latest at the time of writing. Note that I am not the author of the code on this blogpost, but the contributors of SerenityOS are.

For the rest of this blog, we will look at a very simple example: a program creating a socket to receive UDP packets.

The socket syscall

When a program for SerenityOS calls the socket syscall, the function Process::sys$socket() is called from Kernel/Process.cpp. This creates a socket and adds it to the fds of the process.

int Process::sys$socket(int domain, int type, int protocol)
{
    ...
    auto result = Socket::create(domain, type, protocol);
    if (result.is_error())
        return result.error();
    auto description = FileDescription::create(*result.value());
    ...
    m_fds[fd].set(move(description), flags);
    return fd;
}

The type of socket created depends on the value of protocol. In the case of a UDP packet, the protocol will be AF_INET. This calls IPv4Socket::create() which itself calls UDPSocket::create() in Kernel/Net/UDPSocket.cpp. Finally, this calls a chains of constructors: UDPSocket, IPv4Socket and Socket.

IPv4Socket::IPv4Socket(int type, int protocol)
    : Socket(AF_INET, type, protocol)
{
    ...
    LOCKER(all_sockets().lock());
    all_sockets().resource().set(this);
}

In the IPv4Socket constructor, the socket is added to a global list of sockets. We will be able to find that socket from the port number later when we receive packets from the NIC. Also, among other things, the receive buffer is initialized. This is used later to give access to packets in userspace. Now that we have a socket to receive packets for the process, let’s see all the work that is done to receive packets from the NIC. The more interesting part (to me anyway)!

Receiving packets from the NIC

There are a few steps to do before we can receive packets from the NIC. The operating system needs to detect and configure the card. It must also read its MAC address and configure interrupts to handle packet reception.

SerenityOS implements a driver for the e1000 ethernet card. This card is emulated by QEMU and you can see the source code here: hw/net/e1000.c. This seems to be a common card to support, since the page Intel Ethernet i217 on osdev.org contains code to make it work as well as a link to its datasheet.¹ Note that the e1000 card uses PCI bus to communicate with the operating system and uses the interrupt number 11.

During the first stage of boot, SerenityOS configures among other things the Interrupt Description Table (IDT), virtual memory and the PCI bus. The init() function eventually calls idt_init().

// Kernel/init.cpp
extern "C" [[noreturn]] void init()
{
    ...
    gdt_init();
    idt_init();
    ...
}

For all possible interrupts, an Interrupt Service Routine (ISR) is defined to handle events triggered by interrupts. Basically, when the interrupt 11 is triggered, the CPU will run the interrupt_11_asm_entry routine.

// Kernel/Arch/i386/CPU.cpp
void register_interrupt_handler(u8 index, void (*f)())
{
    s_idt[index].low = 0x00080000 | LSW((f));
    s_idt[index].high = ((u32)(f)&0xffff0000) | 0x8e00;
}

...

void idt_init()
{
    ...
    register_interrupt_handler(0x5b, interrupt_11_asm_entry);
    ...
}

This routine is defined in Kernel/Arch/i386/ISRStubs.h with a macro like so GENERATE_GENERIC_INTERRUPT_HANDLER_ASM_ENTRY(11, 0x5b). This macro is defined in Kernel/Arch/i386/Interrupts.h like so:

#define GENERATE_GENERIC_INTERRUPT_HANDLER_ASM_ENTRY(interrupt_vector, isr_number) \
    extern "C" void interrupt_##interrupt_vector##_asm_entry();                    \
    asm(".globl interrupt_" #interrupt_vector "_asm_entry\n"                       \
        "interrupt_" #interrupt_vector "_asm_entry:\n"                             \
        "    pushw $" #isr_number "\n"                                             \
        "    pushw $0\n"                                                           \
        "    jmp interrupt_common_asm_entry\n");

This simply pushes the ISR number and 0 for the exception code on the stack and jumps to interrupt_common_asm_entry. This routine is defined as the following. I only kept the relevant part for brevity.

asm(
    ".globl interrupt_common_asm_entry\n"
    "interrupt_common_asm_entry: \n"
    "    pusha\n"
    "    pushl %ds\n"
    "    pushl %es\n"
    "    pushl %fs\n"
    "    pushl %gs\n"
    "    pushl %ss\n"
    "    mov $0x10, %ax\n"
    "    mov %ax, %ds\n"
    "    mov %ax, %es\n"
    "    cld\n"
    "    call handle_interrupt\n"
    ...
    "    iret\n");

Again, this simply pushes the content of some registers on the stack to have access to them with the struct RegisterState. Finally, this calls handle_interrupt(), where we are finally back in c++ land.

void handle_interrupt(RegisterState regs)
{
    clac();
    ++g_in_irq;
    ASSERT(regs.isr_number >= IRQ_VECTOR_BASE && regs.isr_number <= (IRQ_VECTOR_BASE + GENERIC_INTERRUPT_HANDLERS_COUNT));
    u8 irq = (u8)(regs.isr_number - 0x50);
    ASSERT(s_interrupt_handler[irq]);
    s_interrupt_handler[irq]->handle_interrupt(regs);
    s_interrupt_handler[irq]->increment_invoking_counter();
    --g_in_irq;
    s_interrupt_handler[irq]->eoi();
}

Ignoring the clac() call, handle_interrupt() finds the correct handler to run based on the isr_number that was pushed from assembly and runs it. In our case, our ISR number is 0x5b, so irq is equal to 11. But, where does s_interrupt_handler[11] gets initialized ? This is done in the stage two of the boot process. Let’s take a look.

At the end of the init() function, a kernel thread is spawned to initiate init_stage2(). Here is a small snippet of this function.

void init_stage2()
{
    ...

    PCI::initialize();

    ...

    E1000NetworkAdapter::detect();
    ...
}

The PCI bus gets initialized first, which the e1000 card depends on. Finally, the operating system attempts to detect the e1000 card. Detection is done by enumerating the available PCI devices until one matches.

void E1000NetworkAdapter::detect()
{
    static const PCI::ID qemu_bochs_vbox_id = { 0x8086, 0x100e };

    PCI::enumerate([&](const PCI::Address& address, PCI::ID id) {
        if (address.is_null())
            return;
        if (id != qemu_bochs_vbox_id)
            return;
        u8 irq = PCI::get_interrupt_line(address);
        (void)adopt(*new E1000NetworkAdapter(address, irq)).leak_ref();
    });
}

The irq number (11 in our case) is detected from PCI. A E1000NetworkAdapter is created which inherits from NetworkAdapter and PCI::Device. The latter inherits IRQHandler which itself inherits GenericInterruptHandler. In this chain of inheritance, the irq is passed to GenericInterruptHandler which finally registers the handler as seen in the code below.

GenericInterruptHandler::GenericInterruptHandler(u8 interrupt_number)
    : m_interrupt_number(interrupt_number)
{
    register_generic_interrupt_handler(InterruptManagement::acquire_mapped_interrupt_number(m_interrupt_number), *this);
}

Before going back to the call s_interrupt_handler[irq]->handle_interrupt(regs); where we left off above, let’s see how the card gets configured in E1000NetworkAdapter’s constructor.

Configuring the E1000 Ethernet card

E1000NetworkAdapter::E1000NetworkAdapter(PCI::Address address, u8 irq)
    ...
{
    set_interface_name("e1k");

    ...

    detect_eeprom();
    klog() << "E1000: Has EEPROM? " << m_has_eeprom;
    read_mac_address();
    const auto& mac = mac_address();
    klog() << "E1000: MAC address: " << String::format("%b", mac[0]) << ":" << String::format("%b", mac[1]) << ":" << String::format("%b", mac[2]) << ":" << String::format("%b", mac[3]) << ":" << String::format("%b", mac[4]) << ":" << String::format("%b", mac[5]);

I am skipping a few lines here because I don’t know much about the PCI specification yet. As for the rest, the simplified version is that detect_eeprom() determines if the NIC has an eeprom. It does so by using special in* and out* instructions to read and write, respectively, to IO ports.² After detecting the eeprom, SerenityOS reads the MAC address of the NIC from it.

Then, we set the link up bit of the REG_CTRL register. If we look at the datasheet of the Intel Ethernet i217 card, this bit is not found. So it seems to be specific to QEMU’s emulated card. We also configure the Interrupt Throttling Register. From the datasheet on page 180, this register configures the minimum inter-interrupt delay. This is in 256 nanoseconds units, so if we calculate 256 ns times 6000, we get 1.536 ms, as commented in the code.

    u32 flags = in32(REG_CTRL);
    out32(REG_CTRL, flags | ECTRL_SLU);

    out16(REG_INTERRUPT_RATE, 6000); // Interrupt rate of 1.536 milliseconds

    initialize_rx_descriptors();
    ...

Finally, we initialize the rx descriptors.³ The first two lines configures the base address of the receive buffer. This tells the card where, to the main memory, it should write the packets as they arrive. We set REG_RXDESCLEN to the number of bytes allocated for the buffer. We also set the REG_RXDESCHEAD and REG_RX_DESCTAIL register to point to the first and last descriptor in the allocated buffer. You can find more information about those three last registers on page 189 and 190 of the datasheet. The very last line enables the receive function with a few other flags which you can learn more about on page 184.

void E1000NetworkAdapter::initialize_rx_descriptors()
{
    ...
    out32(REG_RXDESCLO, m_rx_descriptors_region->physical_page(0)->paddr().get());
    out32(REG_RXDESCHI, 0);
    out32(REG_RXDESCLEN, number_of_rx_descriptors * sizeof(e1000_rx_desc));
    out32(REG_RXDESCHEAD, 0);
    out32(REG_RXDESCTAIL, number_of_rx_descriptors - 1);

    out32(REG_RCTRL, RCTL_EN | RCTL_SBP | RCTL_UPE | RCTL_MPE | RCTL_LBM_NONE | RTCL_RDMTS_HALF | RCTL_BAM | RCTL_SECRC | RCTL_BSIZE_8192);
}

To end E1000NetworkAdapter’s constructor, we configure the REG_INTERRUPT_MASK_SET register to enable interrupts when packets are received. Finally, we clear the REG_INTERRUPT_CAUSE_READ by reading it. This is not intuitive, but this register is a read-clear register and its behavior is to clear when it is read.⁴ We will see the REG_INTERRUPT_CAUSE_READ register later. This register informs the operating system which kind of interrupt got triggered. Finally, we enable the irq so that our CPU will run interrupt_11_asm_entry when the NIC receives data.

    ...

    out32(REG_INTERRUPT_MASK_SET, 0x1f6dc);
    out32(REG_INTERRUPT_MASK_SET, INTERRUPT_LSC | INTERRUPT_RXT0);
    in32(REG_INTERRUPT_CAUSE_READ);

    klog() << "E1000: irqcontroller model: " << InterruptManagement::the().get_responsible_irq_controller(m_interrupt_line)->model();
    enable_irq();
}

Now that we have seen how the network card gets configured, we can go back to handling the reception of packets.

Handling interrupts from the NIC

We previously left off where the method handle_interrupt() is called by the E1000NetworkAdapter. This method directly calls the method handle_irq().

void E1000NetworkAdapter::handle_irq(const RegisterState&)
{
    out32(REG_INTERRUPT_MASK_CLEAR, 0xffffffff);

First, we disable all interrupts from the NIC to make sure that new interrupts won’t be triggered.

Then, we read from the REG_INTERRUPT_CAUSE_READ register which we mentioned previously. Again, this register is cleared on read, and describes what caused the interrupt. We handle two possible interrupts:

status & 4: This checks for a link status change and sets the Status Link Up bit in that case.
status & 0x80: This actually checks for packets reception. It looks for the Receiver Timer Interrupt. In that case, the receive() method is called.

There is a third check, for a threshold value, but nothing is done in that case.

    u32 status = in32(REG_INTERRUPT_CAUSE_READ);
    if (status & 4) {
        u32 flags = in32(REG_CTRL);
        out32(REG_CTRL, flags | ECTRL_SLU);
    }
    if (status & 0x80) {
        receive();
    }
    if (status & 0x10) {
        // Threshold OK?
    }

At the end of the handle_irq() method, we re-enable the interrupts.

    ...

    out32(REG_INTERRUPT_MASK_SET, INTERRUPT_LSC | INTERRUPT_RXT0 | INTERRUPT_RXO);
}

Let’s look at the receive() method that we saw above, which is called when the NIC received a packet and triggered an interrupt. This method finds the new packet in the ring buffer that was allocated using the registers REG_RXDESCTAIL and REG_RXDESCHEAD.

void E1000NetworkAdapter::receive()
{
    auto* rx_descriptors = (e1000_tx_desc*)m_rx_descriptors_region->vaddr().as_ptr();
    u32 rx_current;
    for (;;) {
        rx_current = in32(REG_RXDESCTAIL) % number_of_rx_descriptors;
        if (rx_current == (in32(REG_RXDESCHEAD) % number_of_rx_descriptors))
            return;
        rx_current = (rx_current + 1) % number_of_rx_descriptors;
        if (!(rx_descriptors[rx_current].status & 1))
            break;
        auto* buffer = m_rx_buffers_regions[rx_current].vaddr().as_ptr();
        u16 length = rx_descriptors[rx_current].length;
        ASSERT(length <= 8192);
#ifdef E1000_DEBUG
        klog() << "E1000: Received 1 packet @ " << buffer << " (" << length << ") bytes!";
#endif
        did_receive(buffer, length);
        rx_descriptors[rx_current].status = 0;
        out32(REG_RXDESCTAIL, rx_current);
    }
}

The did_receive() method is called with a pointer to the packet received. I have ignored some code here for simplicity’s sake. The packet is copied to a KBuffer which is then added to the packet queue m_packet_queue. Finally, the callback on_receive() is called if it was defined.

void NetworkAdapter::did_receive(const u8* data, size_t length)
{
    ...
    Optional<KBuffer> buffer;

    if (m_unused_packet_buffers.is_empty()) {
        buffer = KBuffer::copy(data, length);
    } else {
        ...
    }

    m_packet_queue.append(buffer.value());

    if (on_receive)
        on_receive();
}

So far, we have seen how the packet is stored in main memory from the NIC, copied to a kernel buffer and then inserted to a queue of packets. However, nothing is done to the packet unless something consumes it. We will now see who consumes the packet queue using the on_receive() callback.

Consuming the packet queue

At the end of init_stage2(), the operating system spawns the NetworkTask kernel thread. This runs the NetworkTask_main() function found in the file Kernel/Net/NetworkTask.cpp.

In that task, each NetworkAdapter has its on_receive() callback defined. The callback wakes up threads waiting on the WaitQueue packet_wait_queue.

void NetworkTask_main()
{
    WaitQueue packet_wait_queue;
    ...
    NetworkAdapter::for_each([&](auto& adapter) {
        ...
        adapter.on_receive = [&]() {
            pending_packets++;
            packet_wait_queue.wake_all();
        };
    });

The NetworkTask then enters an infinite loop that will forever dequeue packets from the NetworkAdapters and push them to the sockets. Thus, NetworkTask is the consumer of the packet queue. Note that the NetworkTask thread will sleep whenever there are no packets in the queue and only be awakened when packets are received.

    for (;;) {
        size_t packet_size = dequeue_packet(buffer, buffer_size);
        if (!packet_size) {
            Thread::current->wait_on(packet_wait_queue);
            continue;
        }
        ...
        auto& eth = *(const EthernetFrameHeader*)buffer;

        ...

        switch (eth.ether_type()) {
        case EtherType::ARP:
            handle_arp(eth, packet_size);
            break;
        case EtherType::IPv4:
            handle_ipv4(eth, packet_size);
            break;
        case EtherType::IPv6:
            // ignore
            break;
        default:
            klog() << "NetworkTask: Unknown ethernet type 0x" << String::format("%x", eth.ether_type());
        }
    }
}

Moving packet to userland

Continuing our UDP packet example, the packet will go through handle_ipv4(), then handle_udp(). As you can see from the code below, in this function, the socket that is responsible for the udp packet will be found and the packet will be copied to the socket.

void handle_udp(const IPv4Packet& ipv4_packet)
{
    ...
    auto& udp_packet = *static_cast<const UDPPacket*>(ipv4_packet.payload());
    ...
    auto socket = UDPSocket::from_port(udp_packet.destination_port());
    if (!socket) {
        klog() << "handle_udp: No UDP socket for port " << udp_packet.destination_port();
        return;
    }

    ASSERT(socket->type() == SOCK_DGRAM);
    ASSERT(socket->local_port() == udp_packet.destination_port());
    socket->did_receive(ipv4_packet.source(), udp_packet.source_port(), KBuffer::copy(&ipv4_packet, sizeof(IPv4Packet) + ipv4_packet.payload_size()));
}

The socket object is from the IPv4Socket class from Kernel/Net/IPv4Socket.cpp. Then, its did_receive() function is called, which eventually adds the packet to the process’s receive queue. Note that the packet is protected by a lock because it can be accessed by multiple parallel threads.

bool IPv4Socket::did_receive(const IPv4Address& source_address, u16 source_port, KBuffer&& packet)
{
    LOCKER(lock());
    ...
    if (buffer_mode() == BufferMode::Bytes) {
        ...
    } else {
        m_receive_queue.append({ source_address, source_port, move(packet) });
        m_can_read = true;
    }
    ...
    return true;
  }

That’s finally it, our packet can now be delivered to userspace!

The odd thing is that the e1000 driver does not conform exactly to the Intel Ethernet i217 card, since that card does not have an eeprom, but the e1000 qemu card does. ↩︎
I say special because those instructions are only accessible by the kernel when in Protected Mode. ↩︎
The tx descriptors are also configured but for this blog I’m only interested on the receiving end. ↩︎
It’s a bit more complex but this is enough to know. ↩︎

Contribute to the discussion in my public inbox by sending an email to ~tomleb/public-inbox@lists.sr.ht [mailing list etiquette]