Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project
Previous Next

Part No: 819-5580-13

March, 2008

4150 Network Circle
Santa Clara, CA 95054
U.S.A.

Copyright 2008 SMI

Sun Microsystems, Inc. has intellectual property rights relating to technology embodied in the product that is described in this document. In particular, and without limitation, these intellectual property rights may include one or more U.S. patents or pending patent applications in the U.S. and in other countries.

U.S. Government Rights – Commercial software. Government users are subject to the Sun Microsystems, Inc. standard license agreement and applicable provisions of the FAR and its supplements.

This distribution may include materials developed by third parties.

Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark in the U.S. and other countries, exclusively licensed through X/Open Company, Ltd.

Sun, Sun Microsystems, the Sun logo, the Solaris logo, the Java Coffee Cup logo, docs.sun.com, Java, and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc.

The OPEN LOOK and SunTM Graphical User Interface was developed by Sun Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Sun's licensees who implement OPEN LOOK GUIs and otherwise comply with Sun's written license agreements.

Products covered by and information contained in this publication are controlled by U.S. Export Control laws and may be subject to the export or import laws in other countries. Nuclear, missile, chemical or biological weapons or nuclear maritime end uses or end users, whether direct or indirect, are strictly prohibited. Export or reexport to countries subject to U.S. embargo or to entities identified on U.S. export exclusion lists, including, but not limited to, the denied persons and specially designated nationals lists is strictly prohibited.

DOCUMENTATION IS PROVIDED “AS IS” AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.

Sun Microsystems, Inc. détient les droits de propriété intellectuelle relatifs à la technologie incorporée dans le produit qui est décrit dans ce document. En particulier, et ce sans limitation, ces droits de propriété intellectuelle peuvent inclure un ou plusieurs brevets américains ou des applications de brevet en attente aux Etats-Unis et dans d'autres pays.

Cette distribution peut comprendre des composants développés par des tierces personnes.

Certaines composants de ce produit peuvent être dérivées du logiciel Berkeley BSD, licenciés par l'Université de Californie. UNIX est une marque déposée aux Etats-Unis et dans d'autres pays; elle est licenciée exclusivement par X/Open Company, Ltd.

Sun, Sun Microsystems, le logo Sun, le logo Solaris, le logo Java Coffee Cup, docs.sun.com, Java et Solaris sont des marques de fabrique ou des marques déposées de Sun Microsystems, Inc. aux Etats-Unis et dans d'autres pays. Toutes les marques SPARC sont utilisées sous licence et sont des marques de fabrique ou des marques déposées de SPARC International, Inc. aux Etats-Unis et dans d'autres pays. Les produits portant les marques SPARC sont basés sur une architecture développée par Sun Microsystems, Inc.

L'interface d'utilisation graphique OPEN LOOK et Sun a été développée par Sun Microsystems, Inc. pour ses utilisateurs et licenciés. Sun reconnaît les efforts de pionniers de Xerox pour la recherche et le développement du concept des interfaces d'utilisation visuelle ou graphique pour l'industrie de l'informatique. Sun détient une licence non exclusive de Xerox sur l'interface d'utilisation graphique Xerox, cette licence couvrant également les licenciés de Sun qui mettent en place l'interface d'utilisation graphique OPEN LOOK et qui, en outre, se conforment aux licences écrites de Sun.

Les produits qui font l'objet de cette publication et les informations qu'il contient sont régis par la legislation américaine en matière de contrôle des exportations et peuvent être soumis au droit d'autres pays dans le domaine des exportations et importations. Les utilisations finales, ou utilisateurs finaux, pour des armes nucléaires, des missiles, des armes chimiques ou biologiques ou pour le nucléaire maritime, directement ou indirectement, sont strictement interdites. Les exportations ou réexportations vers des pays sous embargo des Etats-Unis, ou vers des entités figurant sur les listes d'exclusion d'exportation américaines, y compris, mais de manière non exclusive, la liste de personnes qui font objet d'un ordre de ne pas participer, d'une façon directe ou indirecte, aux exportations des produits ou des services qui sont régis par la legislation américaine en matière de contrôle des exportations et la liste de ressortissants spécifiquement designés, sont rigoureusement interdites.

LA DOCUMENTATION EST FOURNIE "EN L'ETAT" ET TOUTES AUTRES CONDITIONS, DECLARATIONS ET GARANTIES EXPRESSES OU TACITES SONT FORMELLEMENT EXCLUES, DANS LA MESURE AUTORISEE PAR LA LOI APPLICABLE, Y COMPRIS NOTAMMENT TOUTE GARANTIE IMPLICITE RELATIVE A LA QUALITE MARCHANDE, A L'APTITUDE A UNE UTILISATION PARTICULIERE OU A L'ABSENCE DE CONTREFACON.

Module 1What is the OpenSolaris Project?Objectives

The objective of this course is to learn about operating system computing by using the SolarisTM Operating System source code that is freely available through the OpenSolaris project.


Tip - To receive a free OpenSolaris Starter Kit that includes training materials, source code, and developer tools, register online at http://get.opensolaris.org.


We'll start by showing you the user groups, portals, and documentation you will use to get started with UNIX® computing. Next, we'll show you where to go to access the code, communities, discussions, projects, and source browser for the OpenSolaris project. Then, we'll give you steps to configure zones, ZFS, networking, and the environment. Finally, we'll demonstrate debugging processes, applications, page faults, and device drivers with DTrace in the lab exercises.


Note - Go to http://opensolaris.org/os/community/edu/curriculum_development/general_test/ to download the scripts, source code, and makefile for CCtest used later in this document to demonstrate C++ debugging with DTrace.


The OpenSolaris project was launched on June 14, 2005 to create a community development effort using the Solaris OS code as a starting point. It is a nexus for a community development effort where contributors from Sun and elsewhere can collaborate on developing and improving operating system technology. The OpenSolaris source code will find a variety of uses, including being the basis for future versions of the Solaris OS product, other operating system projects, third-party products and distributions of interest to the community. The OpenSolaris project is currently sponsored by Sun Microsystems, Inc.

In the first two years, over 60,000 participants have become registered members. The engineering community is continually growing and changing to meet the needs of developers, system administrators, and end users of the Solaris Operating System.

Teaching with the OpenSolaris project provides the following advantages over instructional operating systems:

  • Access to code for the revolutionary technologies in the Solaris 10 operating system

  • Access to code for a commercial OS that is used in many environments and that scales to large systems

  • Superior observability and debugging tools

  • Hardware platform support including SPARC, x86 and x64 architectures

  • Leadership on 64–bit computing

  • $0.00 for infinite right-to-use

  • Free, exciting, innovative, complete, seamless, and rock-solid code base

  • Availability under the OSI-approved Common Development and Distribution License (CDDL) allows royalty-free use, modification, and derived works

Country Portals

The Internationalization and Localization Community is helping to translate the OpenSolaris English web site into many languages. So far, eight country portals are under development, as follows:

Portals for Germany, Russia, Czech Republic, Spain, Korea, and Mexico are planned. See the OpenSolaris Portals project to get involved, or chat on one of the seven OpenSolaris chat rooms using IRC at irc.freenode.net. See http://opensolaris.org/os/chat/

Web Resources for OpenSolaris

You can download the OpenSolaris source, view the license terms and access instructions for building source and installing the pre-built archives at: http://www.opensolaris.org/os/downloads.

The icons in the upper-right of the OpenSolaris web pages link you to discussions, communities, projects, downloads, and source browser resources.

In addition, the OpenSolaris web site provides search across all of the site content and aggregated blogs.

Discussions

Discussions provide you with access to the experts who are working on new open source technologies. Discussions also provide an archive of previous conversations that you can reference for answers to your questions. See http://www.opensolaris.org/os/discussions for the complete list of forums to which you can subscribe.

Communities

Communities provide connections to other participants with similar interests in the OpenSolaris project. Communities form around interest groups, technologies, support, tools, and user groups, for example:

Academic and Research

http://www.opensolaris.org/os/community/edu

DTrace

http://www.opensolaris.org/os/community/dtrace

ZFS

http://www.opensolaris.org/os/community/zfs

Networking

http://www.opensolaris.org/os/community/networking

Zones

http://www.opensolaris.org/os/community/zones

Documentation

http://www.opensolaris.org/os/community/documentation

Device Drivers

http://www.opensolaris.org/os/community/device_drivers

Tools

http://www.opensolaris.org/os/community/tools

Advocates

http://www.opensolaris.org/os/community/advocacy

Security

http://www.opensolaris.org/os/community/security

Performance

http://www.opensolaris.org/os/community/performance

Storage

http://www.opensolaris.org/os/community/storage

System Administrators

http://www.opensolaris.org/os/community/sysadmin

These are only a few of 30 communities actively working on OpenSolaris. See http://www.opensolaris.org/os/communities for the complete list.

Sun intends to have non-Sun contributors and wants to foster collaborations across industrial and academic affiliations.

The OpenSolaris project will empower and expand the existing Solaris community.

The OpenSolaris project will allow for the creation of entirely new communities.

Projects

Projects hosted on the http://www.opensolaris.org/ web site are collaborative efforts that produce objects such as code changes, documents, graphics, or joint-authored products. Projects have code repositories and committers and may live within a community or independently.

New projects are initiated by participants by request on the discussions. Projects that are submitted and accepted by at least one other interested participant in the sponsoring community are given space on the projects page to get started. See http://www.opensolaris.org/os/projects for the current list of new projects.

Projects give you the opportunity to share files, disk space, and an email alias.

You can collaborate with other engineers across the globe to work on a new technology or an improvement to existing technology.

Chime Visualization Tool for DTrace

http://www.opensolaris.org/os/project/dtrace-chime

Google Summer of Code

http://www.opensolaris.org/os/project/powerPC

Indiana

http://www.opensolaris.org/os/project/indiana

OpenGrok

http://www.opensolaris.org/os/project/opengrok

Programming Contest

http://www.opensolaris.org/os/project/contest

Starter Kit

http://www.opensolaris.org/os/project/starterkit

Solaris iSCSI Target

http://www.opensolaris.org/os/project/iscsitgt

Source Repositories

Centralized and distributed source repositories are hosted on the opensolaris.org web site. The centralized source management model uses the Subversion (SVN) source control management program. Repositories managed in a distributed fashion use the Mercurial (hg) source control management program.

The creation of a source repository on opensolaris.org is completed by a Project Leader by using the Project web pages. Developers with commit rights access repositories through their opensolaris.org accounts. Commit rights are managed by Project Leaders. If you need an account, you may sign up to acquire one. Additionally, you will have to provide a Secure Shell (SSH) public key. Refer to the tools community for the most recent source control information, downloads and instructions http://opensolaris.org/os/community/tools

OpenGrok

OpenGrok™ is the fast and usable source code search and cross reference engine used in OpenSolaris. See http://cvs.opensolaris.org/source to try it out!

The first project to be hosted on opensolaris.org was OpenGrok. See http://www.opensolaris.org/os/project/opengrok to find out about the ongoing development project.

Take an online tour of the source and you'll discover cleanly written, extensively commented code that reads like a book. If you're interested in working on an OpenSolaris project, you can download the complete codebase. If you just need to know how some features work in the Solaris OS, the source code browser provides a convenient alternative. OpenGrok understands various program file formats and version control histories like SCCS, RCS, and CVS, so that you can better understand the open source.

Module 2OpenSolaris AdvocacyObjectives

The Advocates Community exists to help people around the world get involved in the OpenSolaris Community. We welcome participation from people of all languages and cultures and people with all levels of technical and non-technical skills. Everyone has something to contribute.

See http://opensolaris.org/os/community/advocacy/

In the Advocates community you will find independent user group projects, presentations, news, articles, blogs, technical & non-technical content, videos and podcasts, events and conferences, community metrics, swag, badges, buttons, and a variety of other promotional projects.

Why Use OpenSolaris?

This section describes practical reasons to consider choosing to use OpenSolaris as your development platform.

Price

Since the availability of the Solaris 10 Operating System in January 2005, its popularity has exploded. As of July 2007, in excess of 8.7 million copies were registered, more than all of the previous versions of the Solaris OS combined. Further fueling this frenzy was the release of OpenSolaris in June 2005. Given this surge in the number of users, more developers (commercial and open-source alike) are seeing the Solaris operating system as a viable target for their software.

One of the reasons the Solaris OS enjoyed a huge popularity boost was its price: $0 for everyone, for any use (commercial and non-commercial), on any machine (using both SPARC and x86 platforms). Another reason was Sun's promise (and delivering on that promise) of making the Solaris source code available under an OSI-approved open-source license, the Common Development and Distribution License (CDDL).

Innovative Core Features

However, the most important reason for the popularity of the Solaris OS is arguably the vast wealth of features it offers. In no particular order, these include the following:

  • Solaris Zones– Provide the ability to partition a machine into numerous virtual machines, each of which is isolated from the others.

  • DTrace – A comprehensive dynamic tracing tool for investigating system behavior, safely on production machines.

  • New IP stack– Providing vastly increased performance.

  • ZFS– A 128-bit, state-of-the-art file system, with end-to-end error checking and correction, a simple command-line interface, and virtually limitless storage capacity.

Backward Compatibility

All of these features build on what long-time Solaris OS users have come to expect: rock-solid stability, huge scalability, high performance, and guaranteed backwards compatibility. The last of these is especially important to commercial software developers, because maintenance is usually the largest expense associated with a piece of software. With its backwards compatibility guarantee, software vendors know that (provided they use only published APIs) software built for Solaris N will run correctly on Solaris N+1 and subsequent versions. (Contrast this with some other operating systems, where incompatible changes to system components -- for example, libraries -- are made without regard to the effect on applications. The net effect is application breakage, resulting in increased maintenance costs and frustration for application vendors and users.)

Hardware Platform Neutrality

The preceding paragraphs give some reasons why we should develop for the Solaris OS, but there are additional reasons to develop on the Solaris platform. One is that Solaris is a multi-OSOplatform OS, supporting both SPARC and x86 architectures (a community-driven port to Power is in the works, too). Although there was an issue a few years ago with the Solaris OS for x86 platforms, the fact that Sun has introduced a range of AMD-based servers and workstations demonstrates the company's commitment to x86 technology.

From the developer's perspective, the Solaris versions for SPARC and x86 platforms have the same feature set and APIs. This means that developers can concentrate on the other issues endemic to cross-platform development, like CPU endianness. The SPARC platform is big-endian and x86 is little-endian, so an application that is developed and tested on the Solaris platform has a high probability of being free from endian-related problems. The Solaris OS also supports both 32-bit and 64-bit applications on both platforms, thus helping to eliminate bugs due to assumptions about word size.

Perhaps the most compelling reason to develop software on the Solaris OS is the wealth of professional-grade development tools available for it.

Development Tools

One of the most important features of an OS from a developer's point of view is the variety and quality of the development tools available. Compilers and debuggers are the most obvious examples of these tools, but other examples include code checkers (to ensure that our code is free from subtle errors the compiler might not catch), cross-reference generators (to see which functions reference other functions and variables), and performance analyzers.

The Sun Studio suite is the product of choice for Solaris OS developers. Available as a free download from the http://developers.sun.com web site, Sun Studio software is a collection of professional-grade compilers and tools. It includes C, C++, and FORTRAN compilers, code analysis tools, an integrated development environment (IDE), the dbx source-level debugger, and editors. Other tools included with Studio software are cscope (an interactive source browser), ctrace (a tool to generate a self-tracing version of our programs), cxref (a C program cross-referencer), dmake (for distributed parallel makes), and lint (the C program checker).

The Solaris OS ships with the GNU C compiler, gcc, and its companion source-level debugger, gdb. The Solaris OS also ships with the very powerful modular debugger, mdb. However, mdb is not a source-level debugger. It is most useful when we are debugging kernel code, or performing post-mortem analysis on programs for which the source is not available. See the Solaris Modular Debugger Guide and Solaris Performance and Tools by McDougall, Mauro, and Gregg for more information about mdb.

Acknowledgments

The following members of the OpenSolaris Community reviewed and provided feedback on this document:

  • Boyd Adamson

  • Pradhap Devarajan

  • Alan Coopersmith

  • Brian Gupta

  • Rainer Heilke

  • Eric Lowe

  • Ben Rockwood

  • Cindy Swearingen

The following OpenSolaris community members provided excellent new content:

  • Dong-Hai Han

  • Narayana Janga

  • Shivani Khosa

  • Rich Teer

  • Sunay Tripathi

  • Yifan Xu

Many thanks also go to Steven Cogorno, David Comay, Teresa Giacomini, Stephen Hahn, Patrick Finch, and Sue Weber for their work to make the initial version possible.

To participate in future reviews of this document, use the instructions at the following URL:

http://www.opensolaris.org/os/community/documentation/reviews

Module 3Planning the OpenSolaris EnvironmentObjectives

The objective of this module is to understand the system requirements, support information, and documentation available for the OpenSolaris project installation and configuration.

Additional Resources Development Environment Configuration

There is no substitute for hands-on experience with operating system code and direct access to kernel modules. The unique challenges of kernel development and access to root privileges for a system are made simpler by the tools, forums, and documentation provided for the OpenSolaris project.


Tip - To receive an OpenSolaris Starter Kit that includes training materials, source code, and developer tools, go to http://get.opensolaris.org.


The OpenSolaris 64-bit kernel is seamless: 32-bit applications run unmodified on it.

One may alternate between the 32-bit and 64-bit kernel with only a reboot.

All of the architectures supported by the OpenSolaris project are built from the source code basis. The 64-bit kernel isn't a separate version or variant of the system.

32-bit and 64-bit applications and libraries coexist seamlessly.

Consider the following features of OpenSolaris as you plan your development environment:

Table 3-1 Configurable Lab Component Support

Configurable Component

Support From the OpenSolaris Project

Hardware

OpenSolaris supports systems that use the SPARC® and x86 families of processor architectures: UltraSPARC® , SPARC64, AMD64, Pentium, and Xeon EM64T. For supported systems, see the Solaris OS Hardware Compatibility List at http://www.sun.com/bigadmin/hcl.

Source files

See http://opensolaris.org/os/downloads for detailed instructions about how to build from source.

Install images

Pre-built OpenSolaris distributions are limited to the Solaris Express: Community Edition [DVD Version], Build 32 or newer, Solaris Express: Developer Edition, Nexenta, Schillix, Martux and Belenix.

For the OpenSolaris kernel with the GNU user environment, try http://www.gnusolaris.org/gswiki/Download-form.

BFU archives

The on-bfu-DATE.PLATFORM.tar.bz2 file is provided if you are installing from pre-built archives.

Build tools

The SUNWonbld-DATE.PLATFORM.tar.bz2 file is provided if you build from source.

Compilers and tools

Sun Studio 11 compilers and tools are freely available for use by OpenSolaris developers. See http://www.opensolaris.org/os/community/tools/sun_studio_tools/ for instructions about how to download and install the latest versions. Also, refer to http://www.opensolaris.org/os/community/tools/gcc for the gcc community.

Memory/Disk Requirements

  • Memory requirement: 256M minimum (text installer only), 1GB recommended

  • Memory Requirement: 768M minimum Solaris Express Developer Edition installer.

  • Disk space requirement: 350M bytes

Virtual OS environments

Zones and Branded Zones in OpenSolaris provide protected and virtualized operating system environments within an instance of Solaris, allowing one or more processes to run in isolation from other activity on the system.

OpenSolaris supports Xen, an open-source virtual machine monitor developed by the Xen team at the University of Cambridge Computer Laboratory. See http://www.opensolaris.org/os/community/xen/ for details and links to the Xen project.

OpenSolaris is also a VMWareTM guest, see http://www.opensolaris.org/os/project/content for a recent article describing how to get started.

Problem: machines are underutilized; utilization can be increased through virtualization with Zones. Each zone looks, feels, and smells like its own machine, you can even reboot them!

Most other virtualization technologies virtualize at the hardware layer.

Zones are a new facility in OpenSolaris that instead virtualizes at the operating system layer.

Refer to Module 4 for more information about how Zones and Branded Zones enable kernel and user mode development of Solaris and Linux applications without impacting developers in separate zones.

Participation in the OpenSolaris project can improve overall performance across your network with the latest technologies. Your lab environment becomes self-sustaining when hosted on OpenSolaris because you are always running the latest and greatest environment, empowered to update it yourself.

Networking

The OpenSolaris project meets future networking challenges by radically improving your network performance without requiring changes to your existing applications.

  • Speeds application performance by about 50 percent by using an enhanced TCP/IP stack

  • Supports many of the latest networking technologies, such as 10 Gigabit Ethernet, wireless networking, and hardware offloading

  • Accommodates high-availability, streaming, and Voice over IP (VoIP) networking features through extended routing and protocol support

  • Supports current IPv6 specifications

Find out more about ongoing networking developments from the OpenSolaris Networking Community: http://www.opensolaris.org/os/community/networking.

Network Auto-Configuration Daemon

The Solaris Express Developer Edition 5/07 release booting process runs the nwamd daemon. This daemon implements an alternate instance of the SMF service, svc:/network/physical, which enables automated network configuration with minimal intervention.

The nwamd daemon monitors the Ethernet port and automatically enables DHCP on the appropriate IP interface. If no cable is plugged into a wired network, the nwamd daemon conducts a wireless scan and sends queries to the user for a WiFi access point to connect to.

You don't need to spend extensive amounts of time manually configuring the interfaces on your systems. Automatic configuration also aids you in administration, because you can reconfigure network addresses with minimal intervention.

To view your NWAM status, type the following command in a terminal window.

# svcs nwam

STATE STIME FMRI

online 11:29:50 svc:/network/physical:nwam

The OpenSolaris Network Auto-Magic Phase 0 page and nwamd man page contain further details, including instructions for turning off the nwamd daemon, if preferred. For more information and a link to the nwamd(1M) man page, see http://www.opensolaris.org/os/project/nwam.

Zone Overview

A zone can be thought of as a container in which one or more applications run isolated from all other applications on the system. Most software that runs on OpenSolaris will run unmodified in a zone. Since zones do not change the OpenSolaris Application Programming Interface (APIs) or Application Binary Interface (ABI), recompiling an application is not necessary in order to run it inside a zone.

Zones Administration

Zone administration consists of the following commands:

  • zonecfg – Creates zones, configures zones (add resources and properties). Stores the configuration in a private XML file under /etc/zones.

  • zoneadm – Performs administrative steps for zones such as list, install, (re)boot, and halt.

  • zlogin – Allows user to log in to the zone to perform maintenance tasks.

  • zonename – Displays the current zone name.

The following global scope properties are used with zones:

  • zonepath – Path in the global zone to the root directory under which the zone will be installed

  • autoboot – To boot or not to boot when global zone boots

  • pool – Resource pools to which zones should be bound

    Resources may include any of the following types:

    • fs – file system

    • Inherit-pkg-dir – Directory that has its associated packages inherited from the global zone

    • net – Network device

    • device – Devices

Getting Started With Zones Administration

This lab exercise will introduce you to creating zones.

Summary

This exercise uses detailed examples to help you understand the process of creating, installing, and booting a zone.


Note - This procedure does not apply to an lx branded zone.


To Create, Install, and Boot a Zone
  1. Use the following example to configure your new zone:

    Note - The following example uses a shared-IP stack, which is the default for a zone.


    # zonecfg -z Apache
    Apache: No such zone configured
    Use 'create' to begin configuring a new zone.
    zonecfg:Apache> create
    zonecfg:Apache> set zonepath=/export/home/Apache
    zonecfg:Apache> add net
    zonecfg:Apache:net> set address=192.168.0.50
    zonecfg:Apache:net> set physical=bge0
    zonecfg:Apache:net> end
    zonecfg:Apache> verify
    zonecfg:Apache> commit
    zonecfg:Apache> exit
  2. Use the following example to install and boot your new zone:
    # zoneadm -z Apache install
    Preparing to install zone <Apache>.
    Creating list of files to copy from the global zone.
    Copying <6029> files to the zone.
    Initializing zone product registry.
    Determining zone package initialization order.
    Preparing to initialize <1038> packages on the zone.
    Initialized <1038> packages on zone.
    Zone <Apache> is initialized.
    Installation of these packages generated warnings: ....
    The file </export/home/Apache/root/var/sadm/system/logs/install_log> 
    contains a log of the zone installation. 

    The necessary directories are created. The zone is ready for booting.

  3. View the directories:
    # ls /export/home/Apache/root
    bin       etc       home      mnt       platform  sbin      
    tmp       var     dev       export    lib       opt      
     proc      system    usr

    Packages are not reinstalled.

    # /etc/mount 
    /export/home/Apache/root/lib on /lib read only
    /export/home/Apache/root/platform on /platform read only
    /export/home/Apache/root/sbin on /sbin read only
    /export/home/Apache/root/usr on /usr read only
    /export/home/Apache/root/proc on proc 
    read/write/setuid/nodevices/zone=Apache
  4. Boot the zone.
    # ifconfig -a
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL>
     mtu 8232 index 1 inet 127.0.0.1 netmask ff000000
    bge0: flags=1004803<UP,BROADCAST,MULTICAST,DHCP,IPv4> mtu 1500 index 2
            inet 192.168.0.4 netmask ffffff00 broadcast 192.168.0.255
            ether 0:c0:9f:61:88:c9
    # zoneadm -z Apache boot
    #  ifconfig -a
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL>
     mtu 8232 index 1  inet 127.0.0.1 netmask ff000000
    lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL>
     mtu 8232 index 1 zone Apache  inet 127.0.0.1
    bge0: flags=1004803 inet 192.168.0.4 netmask ffffff00 broadcast
     192.168.0.255  ether 0:c0:9f:61:88:c9
    bge0:1: flags=1000803mtu 1500 index 2 zone Apache inet 
    192.168.0.50 netmask ffffff00 broadcast 192.168.0.255
  5. Configure the zone and login:
    # zlogin -C Apache
    [Connected to zone 'Apache' pts/5]
    # ifconfig -a
    lo0:2: flags=2001000849  mtu 8232 index 1 inet 127.0.0.1 
    netmask ff000000
    bge0:2: flags=1000803 inet 192.168.0.50 netmask ffffff00
     broadcast 192.168.0.255
    # ping -s 192.168.0.50
    64 bytes from 192.168.0.50: icmp_seq=0. time=0.146 ms
    # exit
    [Connection to zone 'Apache' pts/5 closed]
Web Server Virtualization With Zones

Each zone has its own characteristics, for example, zonename, IP addresses, hostname, naming services, root and non-root users. By default, the OS runs in a global zone. The administrator can virtualize the execution environment by defining one or more non-global zones. Network services can be run limiting the damage possible in the event of security violation. Since zones are implemented in software, they aren't limited to granularity defined by hardware boundaries. Instead zones offer sub-CPU granularity.

Creating Non-Global Zones

This lab exercise will demonstrate how to support two different sets of web server user groups on one physical host.

Summary

Simultaneous access to both web servers will be configured so that each web server and system will be protected should one become compromised.

Creating Two Non-Global Zones
  1. Create non-global zone Apache1:
    # zonecfg -z Apache1 info
    zonepath: /export/home/Apache1
    autoboot: false
    pool:
    inherit-pkg-dir:  dir: /lib
    inherit-pkg-dir: dir: /platform
    inherit-pkg-dir: dir: /sbin
    inherit-pkg-dir: dir: /usr
    net: address: 192.168.0.100/24
            physical: bge0
  2. Create non-global zone Apache2:
    # zonecfg -z Apache2 info
    zonepath: /export/home/Apache2
    autoboot: false
    pool:
    inherit-pkg-dir:  dir: /lib
    inherit-pkg-dir: dir: /platform
    inherit-pkg-dir: dir: /sbin
    inherit-pkg-dir: dir: /usr
    net: address: 192.168.0.200/24
            physical: bge0
  3. Log in to Apache1 and install the application:
    # zlogin Apache1
    # zonename
    Apache1
    # ls /Apachedir
    apache_1.3.9    apache_1.3.9-i86pc-sun-solaris2.270.tar   
    #cd /Apachedir/apache_1.3.9 ; ./install-bindist.sh /local
    You now have successfully installed the Apache 1.3.9 HTTP server.
  4. Log in to Apache2 and install the application:
    # zlogin Apache2
    # zonename
    Apache2
    # ls /Apachedir
    httpd-2.0.50     httpd-2.0.50-i386-pc-solaris2.8.tar
    # cd /Apachedir/httpd-2.0.50; ./install-bindist.sh /local
    You now have successfully installed the Apache 2.0.50 HTTP server.
  5. Start the Apache1 application:
    # zonename
    Apache1
    # hostname
    Apache1zone
    # /local/bin/apachectl start
    /local/bin/apachectl start: httpd started
  6. Start the Apache2 application:
    # zonename
    Apache2
    # hostname
    Apache2zone
    # /local/bin/apachectl start
    /local/bin/apachectl start: httpd started
  7. In the global zone, edit /etc/hosts file:
    #  cat /etc/hosts
    #
    # Internet host table
    #
    127.0.0.1       localhost 
    192.168.0.1     loghost
    192.168.0.100   Apache1zone
    192.168.0.200   Apache2zone
  8. Open a web browser and navigate to the following URL:

    http://apache1zone/manual/index.html

    The Apache1 web server is up and running.

  9. Open a web browser and navigate to the following URL:
  10. http://apache2zone/manual/

    The Apache2 web server is up and running.

Discussion

The end user sees each zone as a different system. Each web server has it's own name service:

  • /etc/nsswitch.conf

  • /etc/resolv.conf

A malicious attack on one web server is contained to that zone. Port conflicts are no longer a problem!

Creating ZFS Storage Pools and File Systems

Each ZFS storage pool is comprised of one or more virtual devices, which describe the layout of physical storage and its fault characteristics.

The most basic building block for a storage pool is a piece of physical storage. This can be any block device of at least 128 Mbytes in size. Typically, this is a hard drive that is visible to the system in the /dev/dsk directory.

A storage device can be a whole disk (c0t0d0) or an individual slice (c0t0d0s7). The recommended mode of operation is to use an entire disk, in which case the disk does not need to be specially formatted. ZFS formats the disk using an EFI label to contain a single, large slice.

In this module, we'll start by learning about mirrored storage pool configuration. Then we'll show you how to create a RAID-Z configuration.

ZFS uses the concept of storage pools to manage physical storage. Historically, file systems were constructed on top of a single physical device. To address multiple devices and provide for data redundancy, the concept of a volume manager was introduced to provide the image of a single device so that file systems would not have to be modified to take advantage of multiple devices. This design added another layer of complexity and ultimately prevented certain file system advances, because the file system had no control over the physical placement of data on the virtualized volumes.

Application issues a read. ZFS mirror tries the first disk.

Checksum reveals that the block is corrupt on disk. ZFS tries the second disk.

Checksum indicates that the block is good. ZFS returns good data to the application and repairs the damaged block on the first disk.

Creating a Mirrored ZFS Storage Pool

The objective of this lab exercise is to create and list a mirrored storage pool using the zpool command.

For information about determining whether a ZFS mirrored storage pool configuration or a RAID-Z storage pool configuration is right for your environment, go to: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

Summary

ZFS is easy, so let's get on with it! It's time to create your first pool:

To Create Mirrored Storage Pools
  1. Open a terminal window.
  2. Create a mirrored storage pool named tank. Then, display information about the pool.
    # zpool create tank mirror c1t1d0 c2t2d0
    # zpool status tank
      pool: tank
     state: ONLINE
     scrub: none requested
    config:
    
            NAME        STATE     READ WRITE CKSUM
            tank        ONLINE       0     0     0
              mirror    ONLINE       0     0     0
                c1t1d0  ONLINE       0     0     0
                c2t2d0  ONLINE       0     0     0
    
    errors: No known data errors

    The capacity of the c1t1d0 and c2t2d0 disks is 36 Gbytes each. Because the disks are mirrored, the total capacity of the pool reflects the approximate size of one of the disks. Pool metadata consumes a small quantity of disk space. For example:

    # zpool list
    NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
    tank                   33.8G     89K   33.7G     0%  ONLINE     -
Creating ZFS File Systems as Home Directories

The objective of this lab exercise is to learn how to set up ZFS file systems as several home directories.

By using ZFS file system features, available in the OpenSolaris project, you might be able to simplify your kernel development environment by implementing snapshots and their rollback features.

Summary

In this lab, we'll use the zfs command to create a filesystem and set its mountpoint.

To Create ZFS File Systems as Home Directories
  1. Display the default ZFS file system that is created automatically when the storage pool is created.
    # zfs list
    NAME   USED  AVAIL  REFER  MOUNTPOINT
    tank    86K  33.2G  24.5K  /tank
  2. Create the tank/home file system:
    # zfs create tank/home
  3. Then, set the mount point for the tank/home file system:
    # zfs set mountpoint=/export/home tank/home
  4. Finally, create tank/home file systems for all of your developers:
    # zfs create tank/home/developer1
          # zfs create tank/home/developer2
          # zfs create tank/home/developer3
          # zfs create tank/home/developer4

    The mountpoint property is inherited as a pathname prefix. That is, tank/home/developer1 is automatically mounted at /export/home/developer1 because tank/home is mounted at /export/home.

  5. Confirm that the ZFS file systems are created:
    # zfs list
    NAME                   USED  AVAIL  REFER  MOUNTPOINT
    tank                   246K  33.2G  26.5K  /tank
    tank/home              128K  33.2G  29.5K  /export/home
    tank/home/developer1  24.5K  33.2G  24.5K  /export/home/developer1
    tank/home/developer2  24.5K  33.2G  24.5K  /export/home/developer2
    tank/home/developer3  24.5K  33.2G  24.5K  /export/home/developer3
    tank/home/developer4  24.5K  33.2G  24.5K  /export/home/developer4
  6. Take a recursive snapshot of the tank/home file system. Then, display the snapshot information:
    # zfs snapshot -r tank/home@today
    # zfs list
    NAME                         USED  AVAIL  REFER  MOUNTPOINT
    tank                         252K  33.2G  26.5K  /tank
    tank/home                    128K  33.2G  29.5K  /tank/home
    tank/home@today                 0      -  29.5K  -
    tank/home/developer1        24.5K  33.2G  24.5K  /tank/home/developer1
    tank/home/developer1@today      0      -  24.5K  -
    tank/home/developer2        24.5K  33.2G  24.5K  /tank/home/developer2
    tank/home/developer2@today      0      -  24.5K  -
    tank/home/developer3        24.5K  33.2G  24.5K  /tank/home/developer3
    tank/home/developer3@today      0      -  24.5K  -
    tank/home/developer4        24.5K  33.2G  24.5K  /tank/home/developer4
    tank/home/developer4@today      0      -  24.5K  -

    For more information, see zfs.1m.

Creating a RAID-Z Configuration

The objective of this lab exercise is to introduce you to the RAID-Z configuration.

Summary

You might want to create a RAID-Z configuration as an alternative to a mirrored storage pool configuration if you need to maximize disk space.

To Create a RAID-Z Configuration
  1. Open a terminal window.
  2. Create a pool with a single RAID-Z device consisting of 5 disks. Then, display information about the storage pool.
    # zpool create tank raidz c1t1d0 c2t2d0 c3t3d0 c4t4d0 c5t5d0
    # zpool status tank
      pool: tank
     state: ONLINE
     scrub: none requested
    config:
    
            NAME        STATE     READ WRITE CKSUM
            tank        ONLINE       0     0     0
              raidz1    ONLINE       0     0     0
                c1t1d0  ONLINE       0     0     0
                c2t2d0  ONLINE       0     0     0
                c3t3d0  ONLINE       0     0     0
                c4t4d0  ONLINE       0     0     0
                c5t5d0  ONLINE       0     0     0
    
    errors: No known data errors

    Disks can be specified by using their shorthand name or the full path. For example, /dev/dsk/c4t4d0 is identical to c4t4d0.

    It is possible to use disk slices for both mirrored and RAID-Z storage pool configurations, but these configurations are not recommended for production environments. For more information about using ZFS in production environments, go to: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide.

Module 4Userland ConsolidationsObjectives

The objective of this module is to introduce you to the userland consolidations of OpenSolaris. In general, you can think of userland consolidations as existing outside of the kernel and as components with which users interact. Each of the following consolidations deliver source files to the opensolaris.org web site or download center. To access each consolidation, refer to the following URL: http://opensolaris.org/os/downloads/

Userland Consolidations and Descriptions

Application Server

The Glassfish Application Server

Developer Product Tools (DevPro)

The system math library, the media library, the microtasking library, SCCS and make and C++ runtime libraries.

Documentation (Docs)

Developer and administration technical documentation.

Globalization Support (G11N)

Internationalization and localization support.

Installation Support (Install)

Installation support and packaging tools.

Java Desktop (JDS)

A secure and comprehensive enterprise desktop software solution.

Java Platform, Standard Edition (Java SE)

Binaries for the Java Development Kit (JDK) and Java Runtime Environment (JRE) are available.

Manual Pages

Source code to the SunOS Reference Manual Pages.

Message Queue

The Sun Java System Message Queue.

Network Storage (NWS)

Network storage device support.

SFW (Solaris FreeWare)

Open source software that is bundled with Solaris/OpenSolaris.

SPARC Graphics Support

The SPARC graphics consolidation has drivers available in binary form.

Test

OpenSolaris Test Suites and Test Tools.

X Window System (X11)

X11 software.

Module 5Core Features of the Solaris OSObjectives

The objective of this module is to describe the core features of the Solaris OS and how the features have fundamentally changed operating system computing.

Additional Resources

OpenSolaris Development Process; http://www.opensolaris.org/os/community/onnv/os_dev_process/

C Style and Coding Standards for SunOS; http://www.opensolaris.org/os/community/documentation/getting_started_docs/

Development Process and Coding Style

The development process steps and the coding style that is used by the OS/Net consolidation (ON) are used to deliver the core Operating System and Networking components to Solaris. ON contains the source for the kernel and all platforms (on all architectures), the bulk of the drivers, filesystems, core libraries, and basic commands that you'd expect to find on a Solaris system.The development process for the OpenSolaris project follows the following high-level steps:

  1. Idea

    First, someone has an idea for an enhancement or has a gripe about a defect. Search for an existing bug or file a new bug or request for enhancement (RFE) by using the http://bugs.opensolaris.org/ web page. Next, announce it to other developers on the appropriate E-mail list. The announcement has the following benefits:

    • Precipitate discussion of the change or enhancement

    • Determine the complexity of the proposed change(s)

    • Gauge community interest

    • Identify potential team members

  2. Design

    The Design phase determines whether or not a formal design review is even needed. If a formal review is needed, complete the following next steps:

    • Identify design and architectural reviewers

    • Write a design document

    • Write a test plan

    • Conduct design reviews and get the appropriate approvals

  3. Implementation

    The Implementation phase consists of the following:

    • Writing of the actual code in accordance with policies and standards

      Download C Style and Coding Standards for SunOS here:http://www.opensolaris.org/os/community/documentation/getting_started_docs/.

    • Writing the test suites

    • Passing various unit and pre-integration tests

    • Writing or updating the user documentation, if needed

    • Identifying code reviewers in preparation for integration

  4. Integration

    Integration happens after all reviews have been completed and permission to integrate has been granted.

The Integration phase is to make sure everything that was supposed to be done has in fact been done, which means conducting reviews for code, documentation, and completeness.

Sometimes, the integrated change needs to be communicated by sending heads-up messages to appropriate communities and possibly presenting a transfer of information (TOI) to a support organization to help them understand the change.

The formal process document for OpenSolaris describes the previous steps in greater detail, with flow charts that illustrate the development phases. That document also details the following design principles and core values that are to be applied to source code development for the OpenSolaris project:

  • Reliability – OpenSolaris must perform correctly, providing accurate results with no data loss or corruption.

  • Availability – Services must be designed to be restartable in the event of an application failure and OpenSolaris itself must be able to recover from non-fatal hardware failures.

  • Serviceability – It must be possible to diagnose both fatal and transient issues and wherever possible, automate the diagnosis.

  • Security – OpenSolaris security must be designed into the operating system, with mechanisms in place in order to audit changes done to the system and by whom.

  • Performance – The performance of OpenSolaris must be second to none when compared to other operating systems running on identical environments.

  • Manageability – It must allow for the management of individual components, software or hardware, in a consistent and straightforward manner.

  • Compatibility – New subsystems and interfaces must be extensible and versioned in order to allow for future enhancements and changes without sacrificing compatibility.

  • Maintainability – OpenSolaris must be architected so that common subroutines are combined into libraries or kernel modules that can be used by an arbitrary number of consumers.

  • Platform Neutrality – OpenSolaris must continue to be platform neutral and lower level abstractions must be designed with multiple and future platforms in mind.

Refer to http://www.opensolaris.org/os/community/onnv/os_dev_process/ for more detailed information about the process that is used for collaborative development of OpenSolaris code.

Like many projects, OpenSolaris enforces a coding style on contributed code, regardless of its source. This style is described in detail at http://opensolaris.org/os/community/onnv/.

This coding style is very similar to that used by the Linux kernel, BSD systems, and many other non-GNU projects (the GNU project uses its own unique coding style). Also, examine the files in usr/src/prototypes; these provide examples of the correct general layout and style for most types of source files.

Two tools for checking many elements of the coding style are available as part of the OpenSolaris distribution. These tools are cstyle(1) for verifying compliance of C code with most style guidelines, and hdrchk(1) for checking the style of C and C++ headers.

There are style mistakes that cannot be caught by any reasonable tool, and others that cannot be caught by the particular implementations.

Improving the accuracy and completeness of these tools is an ongoing task.

Overview

Now that you have considered the development environment, processes, and values applied to engineering by OpenSolaris developers, let's discuss in more depth, features of the operating system that exemplify performance, security, serviceability, and manageability:

  • Performance

    • FireEngine

    • Nemo

    • Crossbow

  • Security

    • Least Privilege

    • Packet Filtering

    • Zones

    • Branded Zones (BrandZ)

  • Serviceability

    • Predictive Self-Healing

    • Dynamic Tracing Facility (DTrace)

    • Modular Debugger (MDB)

  • Manageability

    • Services Management Facility (SMF)

    • ZFS

FireEngine

The "FireEngine" approach in Solaris 10 merges all protocol layers into one STREAMS module that is fully multithreaded. Inside the merged module, instead of per-data structure locks, a per-CPU synchronization mechanism called "vertical perimeter" is used. The "vertical perimeter" is implemented using a serialization queue abstraction called "squeue." Each squeue is bound to a CPU, and each connection is in turn bound to a squeue that provides any synchronization and mutual exclusion needed for the connection-specific data structures.

Synchronization

Since the stack is fully multithreaded (barring the per-CPU serialization enforced by the vertical perimeter), it uses a reference-based scheme to ensure that the connection instance is available when needed. For an established TCP connection, three references are guaranteed to be on it. Each protocol layer has a reference on the instance (one each for TCP and IP) and the classifier itself has a reference since it is an established connection. Each time a packet arrives for the connection and the classifier looks up the connection instance, an extra reference is placed, which is dropped when the protocol layer finishes processing that packet.

TCP, IP, and UDP

The Solaris 10 OS provides the same view for TCP as previous releases -- that is, TCP appears as a clone device but it is actually a composite, with the TCP and IP code merged into a single D_MP STREAMS module. The operational part of TCP is fully protected by the vertical perimeter that entered through the squeue primitives. FireEngine changes the interface between TCP and IP from the existing STREAMS- based message passing interface to a functional call-based interface, both in the control and data paths.

There is a fully multithreaded UDP module running under the same protection domain as IP. Though UDP and IP are running in the same protection domain, they are still separate STREAMS modules. Therefore, STREAMS plumbing is kept unchanged and a UDP module instance is always pushed above IP. The Solaris 10 platform allows for the following plumbing modes:

  • Normal– IP is first opened and later UDP is pushed directly on top. This is the default action that happens when a UDP socket or device is opened.

  • SNMP– UDP is pushed on top of a module other than IP. When this happens, only SNMP semantics will be supported.

GLDv3

Solaris 10 software introduces a new device driver framework called GLDv3 along with the new stack. Most of the major device drivers were ported to this framework, and all future and 10Gb device drivers will be based on this framework. This framework also provided a STREAMS-based DLPI layer for backward compatibility (to allow external, non-IP modules to continue to work). GLDv3 architecture virtualizes Layer two of the network stack. A one-to-one correspondence between network interfaces and devices no longer exists.

Refer to the Nemo Project hosted on opensolaris.org for more information about the framework, the MAC services module, and the Data-Link Services module.

Virtualization

Project Crossbow creates virtual stacks around any service (HTTP, HTTPS, FTP, NFS, etc.), protocol (TCP, UDP, SCTP, etc.), or Solaris Containers technology. The virtual stacks are separated by means of a H/W classification engine such that traffic for one stack does not impact other virtual stacks. Each virtual stack can be assigned its own priority and bandwidth on a shared NIC without causing performance degradation to the system or the service/container. The architecture dynamically manages priority and bandwidth resources, and can provide better defense against denial-of-service attacks directed at a particular service or container by isolating the impact to just that service or container.

Least Privilege

UNIX® has historically had an all-or-nothing privilege model that imposes the following restrictions:

  • No way to limit root user privileges

  • No way for non-root users to perform privileged operations

  • Applications needing only a few privileged operations must run as root

  • Very few are trusted with root privileges and virtually no students are so trusted

In the Solaris OS we've developed fine-grained privileges. Fine-grained privileges allows applications and users to run with just the privileges they need. The least privilege allows students to be granted the privileges that they need to complete their course work, participate in research, and maintain a portion of the campus or department infrastructure.

Packet Filtering

Solaris IP Filter provides stateful packet filtering and network address translation (NAT). Solaris IP Filter is derived from the open source IP Filter software. IP Filter can filter by IP address, port, protocol, or network interface according to filter rules.

IP Filter

The Packet Filtering Hooks (PFHooks) API has been introduced since Solaris 10 Update 4, to replace the STREAMS-based implementation of IP Filter. Using the PFHooks framework, the performance of firewall software like IP Filter is significantly improved. PFhooks also provides the ability to intercept loopback and inter-zone traffic. Third-party firewall software is developed and registered with the PFHooks API using the net_register_hook(info, event, hook); hook.

Enabling Simple Packet Filters

The objective of this exercise is to learn about packet filtering. Solaris IP Filter is installed with the Solaris operating system. However, packet filtering is not enabled by default. IP Filter can filter by IP address, port, protocol, or network interface according to filter rules. Following is an example filter rule:

block in on ce0 proto tcp from 192.168.0.0/16 to any port = 23

To use Solaris IP Filter, simply enter your filter rules in the /etc/ipf/ipf.conf file. Then, enable and restart the svc:network/ipfilter service by using the svcadm command.


Note - You can also use the ipf command to work with the rule sets.


Solaris IP Filter can perform network address translation (NAT) for a source address or a destination address according to NAT rules. Following is an example of a NAT rule:

map ce0 192.168.1.0/24 -> 10.1.0.0/16

To use network address translation, enter your NAT rules in the /etc/ipf/ipnat.conf file. Then, enable and restart the svc:/network/ipfilter service by using the svcadm command.


Note - You can also use the ipnat command to work with rule sets.


Sample Packet Filtering Rules

This section includes various examples of filtering rule syntax. Invoke the rules by adding them to the /etc/ipf/ipf.conf file. Then, enable Solaris IP Filter and reboot your machine as detailed in the previous exercise.

Log all inbound packets on le0 which has IP options present.

log in on le0 from any to any with ipopts

Block any inbound packets on le0 which are fragmented and too short on which to do any meaningful comparison. This actually only applies to TCP packets which can be missing the flags/ports (depending on which part of the fragment you see).

block in log quick on le0 from any to any with short frag

Log all inbound TCP packets with the SYN flag (only) set.


Note - If it was an inbound TCP packet with the SYN flag set and it had IP options present, this rule and the above rule would cause it to be logged twice.


log in on le0 proto tcp from any to any flags S/SA

Block and log any inbound ICMP unreachables.

block in log on le0 proto icmp from any to any icmp-type unreach

Block and log any inbound UDP packets on le0 which are going to port 2049 (the NFS port).

block in log on le0 proto udp from any to any port = 2049

Quickly allow any packets to/from a particular pair of hosts

pass in quick from any to 10.1.3.2/32
pass in quick from any to 10.1.0.13/32
pass in quick from 10.1.3.2/32 to any
pass in quick from 10.1.0.13/32 to any

Block (and stop matching) any packet with IP options present.

block in quick on le0 from any to any with ipopts

Allow any packet through

pass in from any to any

Block any inbound UDP packets destined for these subnets.

block in on le0 proto udp from any to 10.1.3.0/24
block in on le0 proto udp from any to 10.1.1.0/24
block in on le0 proto udp from any to 10.1.2.0/24

Block any inbound TCP packets with only the SYN flag set that are destined for these subnets.

block in on le0 proto tcp from any to 10.1.3.0/24 flags S/SA
block in on le0 proto tcp from any to 10.1.2.0/24 flags S/SA
block in on le0 proto tcp from any to 10.1.1.0/24 flags S/SA

Block any inbound ICMP packets destined for these subnets.

block in on le0 proto icmp from any to 10.1.3.0/24
block in on le0 proto icmp from any to 10.1.1.0/24
block in on le0 proto icmp from any to 10.1.2.0/24
Zones

A zone is a virtual operating system abstraction that provides a protected environment in which applications run. The applications are protected from each other to provide software fault isolation. To ease the labor of managing multiple applications and their environments, they co-exist within one operating system instance, and are usually managed as one entity.

A small number of applications which are normally run as root or with certain privileges may not run inside a zone if they rely on being able to access or change some global resource. An example might be the ability to change the system's time-of-day clock. The few applications which fall into this category may need applications to run properly inside a zone or in some cases, should continue to be used within the global zone.

Here are some guidelines:

  • An application which accesses the network and files, and performs no other I/O, should work correctly.

  • Applications which require direct access to certain devices, for example, a disk partition, will usually work if the zone is configured correctly. However, in some cases this may increase security risks.

  • Applications which require direct access to these devices may need to be modified to work correctly. For example, /dev/kmem, or a network device. Applications should instead use one of the many IP services.

Super-user in a zone can't affect or obtain privileges in other zones.

This allows students a safe sandbox in which to experiment.

Zones can be used as instructional tool or infrastructure component

For example, you can allocate each student an IP address and a zone and allow them all to safely share one machine.

Zones can be combined with the resource management facilities which are present in OpenSolaris to provide more complete, isolated environments. While the zone supplies the security, name space and fault isolation, the resource management facilities can be used to prevent processes in one zone from using too much of a system resource or to guarantee them a certain service level. Together, zones and resource management are often referred to as containers.

See http://opensolaris.org/os/community/zones/faq for answers to a large number of common questions about zones and links to the latest administration documentation.

Zones provide protected environments for Solaris applications.Separate and protected run-time environments are available through the OpenSolaris project, by using BrandZ.

Branded Zones (BrandZ)

BrandZ is a framework that extends the zones infrastructure to create Branded Zones, which are zones that contain non-native operating environments. A branded zone may be as simple as an environment where the standard Solaris utilities are replaced by their GNU equivalents, or as complex as a complete Linux user space.

BrandZ extends the Zones infrastructure in user space in the following ways:

  • A brand is an attribute of a zone, set at zone configuration time.

  • Each brand provides its own installation routine, which allows us to install an arbitrary collection of software in the branded zone.

  • Each brand may provide pre-boot and post-boot scripts that allow us to do any final boot-time setup or configuration.

  • The zonecfg and zoneadm tools can set and report a zone's brand type.

BrandZ provides a set of interposition points in the kernel:

  • These points are found in the syscall path, process loading path, thread creation path, etc.

  • These interposition points are only applied to processes in a branded zone.

  • At each of these points, a brand may choose to supplement or replace the standard behavior of the Solaris OS.

  • Fundamentally different brands may require new interposition points.

The lx brand enables Linux binary applications to run unmodified on Solaris, within zones that are running a complete Linux user space. The lx brand enables user-level Linux software to run on a machine with a OpenSolaris kernel, and includes the tools necessary to install a CentOS or Red Hat Enterprise Linux distribution inside a zone on a Solaris system. The lx brand will run on x86/x64 systems booted with either a 32-bit or 64-bit kernel. Regardless of the underlying kernel, only 32-bit Linux applications are able to run. This feature is only available for x86 and AMD x64 architectures at this time. However, porting to SPARC might be an interesting community project because BrandZ lx is still very much a work in progress.

Refer to http://opensolaris.org/os/community/brandz/install for the installation requirements and instructions.

The OpenSolaris project addresses the unique challenges of operating system development and testing for application performance using features like zones.

Zones Networking

Solaris zones can be designated as one of the following:

  • Exclusive-IP zone

  • Shared-IP zone

Exclusive-IP zones have their own IP stacks and may have their own physical interfaces. An exclusive-IP zone may also have its own VLAN interfaces. The configuration of exclusive-IP zones is the same as that of a physical machine.

Shared-IP zones share the IP stack with the global zone, so shared-IP zones are shielded from the configuration details for devices, routing and so on. Each shared-IP zone can be assigned IPv4/IPv6 addresses. Each shared-IP zone also has its own port space. Applications can bind to INADDR_ANY and will only receive traffic for that zone.

Both type of zones cannot see the traffic of other zones. Packets coming from a zone have a source address belonging to that zone. A shared-IP zone can only send packets on an interface on which it has an address. A shared-IP zone can only use a default router if it is directly reachable from the zone. The default router has to be in the same IP subnet as the zone.

Shared-IP zones cannot change their network configuration or routing table and cannot see the configuration of other zones. /dev/ip is not present in the shared-IP zone. SNMP agents must open /dev/arp instead. Multiple shared-IP zones can share a broadcast address and may join the same multi-cast group.

Shared-IP zones have the following networking limitations:

  • Can not put a physical interface inside a zone

  • IPFilter does not work between zones

  • No DHCP for Zones IP addresses

  • No Dynamic Routing

Exclusive-IP zones do not have the above limitations, and can change their network configuration or routing table inside the zone. /dev/ip is present in the exclusive-IP zone.

Zones Identity, CPU Visibility, and Packaging

Each zone controls its node name, timezone, and naming services like LDAP and NIS. The sysidtool can set this up. Separate /etc/passwd files mean that root privileges can be delegated to the zone. User IDs may map to different names when domains differ.

By default, all zones see all CPUs. Restricted view is enabled automatically when resource pools are enabled.

Zones can add their own packages. Patches can be made to those packages. System Patches are applied in the global zone. Then, in non-global zones the zone will automatically boot -s to apply the patch. The SUNW_PKG_ALLZONES package should be kept consistent between the global zone and all non-global zones. The SUNW_PKG_HOLLOW causes package name to appear in non-global zones (NGZ) for dependency purposes but the contents are not installed.

Zones Devices

Each zone has its own devices. Zones see a subset of safe pseudo devices in their /dev directory. Applications reference the logical path to a device presented in /dev. The /dev directory exists in non-global zones, the /devices directory does not. Devices like random, console, and null are safe, but others like /dev/kmem are not.

Zones can modify the permissions of their devices but cannot issue mknod(2). Physical device files like those for raw disks can be put in a zone with caution. Devices maybe shared among zones, but need careful security concerns before doing this.

For example, you might have devices that you want to assign to specific zones. Allowing unprivileged users to access block devices could permit those devices to be used to cause system panic, bus resets, or other adverse effects. Placing a physical device into more than one zone can create a covert channel between zones. Global zone applications that use such a device risk the possibility of compromised data or data corruption by a non-global zone.

Predictive Self-Healing

Predictive self-healing was implemented in two ways in the Solaris 10 OS. This section describes the new Fault Management Architecture and Services Management Facility that make up the self-healing technology.

Fault Management Architecture (FMA)

The Solaris OS provides a new architecture, FMA, for building resilient error handlers, error telemetry, automated diagnosis software, response agents, and a consistent model of system failures for a management stack. Many parts of Solaris are already participating in FMA, including the CPU and Memory error handling for UltraSPARC III and IV, the UltraSPARC PCI HBAs, and Opteron. A variety of projects are underway, including full support for CPU, Memory, and I/O faults on Opteron, conversion of key device drivers, and integration with various management stacks.

When a subsystem is converted to participate in Fault Management, error handling is made resilient so that the system can continue to operate despite some underlying failure, and telemetry events are produced that drive automated diagnosis and response. The Fault Management tools and architecture enable development of self-healing content for software and hardware failures, for both microscopic and macroscopic system resources, all with a unified, simple view for administrators and system management software.

The Fault Manager associates diagnosis state with persistent identifiers corresponding to the system resources, such as hardware serial numbers. As a result, the Fault Manager automatically updates this state after most repair actions, without requiring any manual intervention.

See http://opensolaris.org/os/community/fm for information about how to participate in the Fault Management community or to download the Fault Management MIB that is currently in development.

The legacy UNIX failure model was simply to leave error handling up to each subsystem author, and simply provide the ability to emit an error message for a human to the system log in a non-standard format.

Dynamic Tracing (DTrace)

DTrace provides a powerful infrastructure to permit administrators, developers, and service personnel to concisely answer arbitrary questions about the behavior of the operating system and user programs. DTrace enables you to do the following:

Historically, it has been difficult to look inside of complicated software systems with no way to see interactions between processes and no way to observe kernel activity.

This makes it difficult to understand even a single application.

DTrace is a new facility in the OpenSolaris project for systemic dynamic instrumentation.

By using a special-purpose control language, DTrace can give concise answers to arbitrary questions about the system.

  • Dynamically enable and manage thousands of probes

  • Dynamically associate predicates and actions with probes

  • Dynamically manage trace buffers and probe overhead

  • Examine trace data from a live system or from a system crash dump

  • Implement new trace data providers that plug into DTrace

  • Implement trace data consumers that provide data display

  • Implement tools that configure DTrace probes

Find the DTrace community pages here http://opensolaris.org/os/community/dtrace.


Note - Go to http://opensolaris.org/os/community/edu/curriculum_development/general_test/ to download the scripts, source code, and makefile for CCtest used later in this document to demonstrate C++ debugging with DTrace.


In addition to DTrace, the OpenSolaris project provides debugging facilities for low-level types of development, for example, device driver development.

This gives students the means to take software apart and understand its inner workings.

It enables computer science educators to show principles from the classroom on a real, production machine.

It allows researchers to better and more quickly understand and improve software systems.

Modular Debugger (MDB)

MDB is a debugger designed to facilitate analysis of problems that require low-level debugging facilities, examination of core files, and knowledge of assembly language to diagnose and correct. Generally, kernel and device developers rely on mdb to determine why and where their code went wrong.

MDB is available as two commands that share common features: mdb and kmdb. You can use the mdb command interactively or in scripts to debug live user processes, user process core files, kernel crash dumps, the live operating system, object files, and other files. You can use the kmdb command to debug the live operating system kernel and device drivers when you also need to control and halt the execution of the kernel.

There is an active community for MDB, where you can ask the experts or review previous conversations and common questions. See http://opensolaris.org/os/community/mdb

ZFS File System

ZFS filesystems are not constrained to specific devices, so they can be created easily and quickly like directories. They grow automatically within the space allocated to the storage pool.

Checksumming and Data Recovery

With ZFS, all data and metadata is checksummed using a user-selectable algorithm. All checksumming and data recovery is done at the file system layer, and is transparent to applications. In addition, ZFS provides for self-healing data. ZFS supports storage pools with varying levels of data redundancy, including mirroring and a variation on RAID-5. When a bad data block is detected, ZFS fetches the correct data from another redundant copy, and repairs the bad data, replacing it with the good copy.

Traditional file systems that do provide checksumming have performed it on a per-block basis, out of necessity due to the volume management layer and traditional file system design. The traditional design means that certain failure modes, such as writing a complete block to an incorrect location, can result in properly checksummed data that is actually incorrect. ZFS checksums are stored in a way such that these failure modes are detected and can be recovered from gracefully.

Old way: create one filesystem, such as /export/home, to manage many user subdirectories.

ZFS way: just create one filesystem per user.

ZFS presents a pooled storage model that eliminates the concept of volumes and the associated problems of partitions, provisioning, wasted bandwidth, and stranded storage.

Thousands of file systems can draw from a common storage pool, each one consuming only as much space as it actually needs.

The combined I/O bandwidth of all devices in the pool is available to all filesystems at all times.

Each storage pool is comprised of one or more virtual devices, which describe the layout of physical storage and its fault characteristics. See http://opensolaris.org/os/community/zfs/demos/basics for 100 Mirrored Filesystems in 5 Minutes, a demonstration of administering ZFS file systems.

RAID-Z

In addition to pooled storage, ZFS provides redundant mirrored and RAID-Z data redundancy configurations. A RAID-Z configuration is a virtual device that stores data and parity on multiple disks, similar to RAID-5.

All traditional RAID-5-like algorithms including RAID-4. RAID-5. RAID-6, RDP, and EVEN-ODD, for example, suffer from a problem known as the "RAID-5 write hole": if only part of a RAID-5 stripe is written, and power is lost before all blocks have made it to disk, the parity will remain out of sync with data.

The parity is therefore useless forever, unless a subsequent full-stripe write overwrites it.

In RAID-Z, ZFS uses variable-width RAID stripes so that all writes are full-stripe writes. This feature is only possible because ZFS integrates filesystem and device management in such a way that the filesystem's metadata has enough information about the underlying data replication model to handle variable-width RAID stripes. RAID-Z is the world's first software-only solution to the RAID-5 write hole.

Services Management Facility (SMF)

SMF creates a supported, unified model for management of an enormous number of services, such as email delivery, ftp requests, and remote command execution in the OpenSolaris project. The smf(5) framework replaces (in a compatible manner) the existing init.d(4) startup mechanism and includes an enhanced inetd(1M). SMF gives developers the following:

  • Automated restart of services in dependency order due to administrative errors, software bugs, or uncorrectable hardware errors

  • A single API for service management, configuration, and observation

  • Access to service-based resource management

  • Simplified boot-process debugging

See http://opensolaris.org/os/community/smf/scfdot to see a graph of the SMF services and their dependencies on an x86 system freshly installed with the Solaris OS Nevada.

Module 6Programming ConceptsObjectives

This module provides a high-level description of the fundamental concepts of the OpenSolaris programming environment, as follows:

  • Process and System Management

  • Threaded Programming

  • Kernel Overview

  • CPU Scheduling

  • Process Debugging

Additional Resources
  • Solaris Internals (2nd Edition), Prentice Hall PTR (May 12, 2006) by Jim Mauro and Richard McDougall

  • Solaris Systems Programming, Prentice Hall PTR (August 19, 2004), by Rich Teer

  • Multithreaded Programming Guide. Sun Microsystems, Inc., 2005.

  • STREAMS Programming Guide. Sun Microsystems, Inc., 2005.

  • Solaris 64-bit Developer’s Guide. Sun Microsystems, Inc., 2005.

Process and System Management

The basic unit of workload is the process. Process IDs (PIDs) are numbered sequentially throughout the system. By default, each user is assigned by the system administrator to a project, which is a network-wide administrative identifier. Each successful login to a project creates a new task, which is a grouping mechanism for processes. A task contains the login process as well as subsequent child processes.

The resource pools facility brings together process-bindable resources into a common abstraction called a pool. Processor sets and other entities are configured, grouped, and labelled such that workload components are associated with a subset of a system's total resources. When the pools facility is disabled, all processes belong to the same pool, pool_default, and processor sets are managed through the pset() system call. When the pools facility is enabled, processor sets must be managed by using the pools facility. New pools can be created and associated with processor sets. Processes may be bound to pools that have non-empty resource sets.

If we search OpenGrok for pool.c, we find extensive code comments that describe relationships between tasks, pools, projects, and processes, as follows:

The operation that binds tasks and projects to pools is atomic. That is, either all processes in a given task or a project will be bound to a new pool, or (in case of an error) they will be all left bound to the old pool. Processes in a given task or a given project can only be bound to different pools if they were rebound individually one by one as single processes. Threads or LWPs of the same process do not have pool bindings, and are bound to the same resource sets associated with the resource pool of that process.

Processes can optionally be run inside a zone. Zones are set up by system administrators, often for security purposes, in order to isolate groups of users or processes from one another.

Threaded Programming

Now that we've learned about processes in the context of tasks, projects, resource pools, zones, and branded zones, let's discuss processes in the context of threads. Traditional UNIX already supports the concept of threads. Each process contains a single thread, so programming with multiple processes is programming with multiple threads. But, a process is also an address space, and creating a process involves creating a new address space.

Creating a thread is less expensive than creating a new process because the newly created thread uses the current process address space.

The time that is required to switch between threads is less than the time required to switch between processes.

A switch between threads is faster because no switching between address spaces occurs.

Communication between the threads of one process is simple because the threads share everything, inlcuding a common address space and open file descriptors. So, data produced by one thread is immediately available to all the other threads.

The libraries are libpthread for POSIX threads, and libthread for OpenSolaris threads. Multithreading provides flexibility by decoupling kernel-level and user-level resources. In OpenSolaris, multithreading support for both sets of interfaces is provided by the standard C library.

Use pthread_create(3C) to add a new thread of control to the current process.

int        pthread_create(pthread_t *tid, const pthread_attr_t *tattr,
    void*(*start_routine)(void *), void *arg);

The pthread_create() function is called with attr that has the necessary state behavior. start_routine is the function with which the new thread begins execution. When start_routine returns, the thread exits with the exit status set to the value returned by start_routine. pthread_create() returns zero when the call completes successfully. Any other return value indicates that an error occurred. Go to /on/usr/src/lib/libc/spec/threads.spec in OpenGrok for the complete list of pthread functions and declarations.

Thread synchronization enables you to control program flow and access to shared data for concurrently executing threads. The four synchronization objects are mutex locks, read/write locks, condition variables, and semaphores.

  • Mutex locks allow only one thread at a time to execute a specific section of code, or to access specific data.

  • Read/write locks permit concurrent reads and exclusive writes to a protected shared resource. To modify a resource, a thread must first acquire the exclusive write lock. An exclusive write lock is not permitted until all read locks have been released.

  • Condition variables block threads until a particular condition is true.

  • Counting semaphores typically coordinate access to resources. The count is the limit on how many threads can have access to a semaphore. When the count is reached, the thread that is trying to access the resource blocks.

Synchronization

Synchronization objects are variables in memory that you access just like data. Threads in different processes can communicate with each other through synchronization objects that are placed in threads-controlled shared memory. The threads can communicate with each other even though the threads in different processes are generally invisible to each other. Synchronization objects can also be placed in files. The synchronization objects can have lifetimes beyond the life of the creating process.

We can use OpenGrok to find libthread in the source code tree, and the second most relevant result is found in mutex.c, accompanied by the following code comment excerpt:

We can use OpenGrok to find libthread in the source code tree, and the second most relevant result is found in mutex.c, accompanied by the following code comment excerpt:

Implementation of all threads interfaces between ld.so.1 and libthread. In a non-threaded environment all thread interfaces are vectored to noops. When called via _ld_concurrency() from libthread these vectors are reassigned to real threads interfaces. Two models are supported:

TI_VERSION == 1 Under this model libthread provides rw_rwlock/rw_unlock, through which we vector all rt_mutex_lock/rt_mutex_unlock calls. Under lib/libthread these interfaces provided _sigon/_sigoff (unlike lwp/libthread that provided signal blocking via bind_guard/bind_clear.

TI_VERSION == 2 Under this model only libthreads bind_guard/bind_clear and thr_self interfaces are used. Both libthreads block signals under the bind_guard/bind_clear interfaces. Lower level locking is derived from internally bound _lwp_ interfaces. This removes recursive problems encountered when obtaining locking interfaces from libthread. The use of mutexes over reader/writer locks also enables the use of condition variables for controlling thread concurrency (allows access to objects only after their .init has completed).

Now that you understand a bit about how synchronization objects are defined in multi-threaded programming, let's learn how these objects are managed by using scheduling classes.

CPU Scheduling

Processes run in a scheduling class with a separate scheduling policy applied to each class, as follows:

  • Realtime (RT) – The highest-priority scheduling class provides a policy for those processes that require fast response and absolute user or application control of scheduling priorities. RT scheduling can be applied to a whole process or to one or more lightweight processes (LWPs) in a process. You must have the proc_priocntl privilege to use the Realtime class. See the privileges(5) man page for details.

  • System (SYS) – The middle-priority scheduling class, the system class cannot be applied to a user process.

  • Timeshare (TS) – The lowest-priority scheduling class is TS, which is also the default class. The TS policy distributes the processing resource fairly among processes with varying CPU consumption characteristics. Other parts of the kernel can monopolize the processor for short intervals without degrading the response time seen by the user.

  • Inter-Active (IA) – The IA policy distributes the processing resource fairly among processes with varying CPU consumption characteristics, while also providing good responsiveness for user interaction.

  • Fair Share (FSS) – The FSS policy distributes the processing resource fairly among projects, independent of the number of processes they own by specifying shares to control the process entitlement to CPU resources. Resource usage is remembered over time, so that entitlement is reduced for heavy usage and increased for light usage with respect to other projects.

  • Fixed-Priority (FX) – The FX policy provides a fixed priority preemptive scheduling policy for those processes requiring that the scheduling priorities do not get dynamically adjusted by the system and that the user or application have control of the scheduling priorities. This class is a useful starting point for affecting CPU allocation policies.

A scheduling class is maintained for each lightweight process (LWP). Threads have the scheduling class and priority of their underlying LWPs. Each LWP in a process can have a unique scheduling class and priority that are visible to the kernel. Thread priorities regulate contention for synchronization objects.

The RT and TS scheduling classes both call priocntl(2) to set the priority level of processes or LWPs within a process. Using OpenGrok to search the code base for priocntl, we find the variables that are used in the RT and TS scheduling classes in the rtsched.c file as follows:

We can use OpenGrok to quickly find the file and view its comments.

27 #pragma ident    "@(#)rtsched.c    1.10    05/06/08 SMI"
     28 
     29 #include "lint.h"
     30 #include "thr_uberdata.h"
     31 #include <sched.h>
     32 #include <sys/priocntl.h>
     33 #include <sys/rtpriocntl.h>
     34 #include <sys/tspriocntl.h>
     35 #include <sys/rt.h>
     36 #include <sys/ts.h>
     37 
     38 /*
     39  * The following variables are used for caching information
     40  * for priocntl TS and RT scheduling classs.
     41  */
     42 struct pcclass ts_class, rt_class;
     43 
     44 static rtdpent_t *rt_dptbl;    /* RT class parameter table */
     45 static int rt_rrmin;
     46 static int rt_rrmax;
     47 static int rt_fifomin;
     48 static int rt_fifomax;
     49 static int rt_othermin;
     50 static int rt_othermax;
...

Typing the man priocntl command in a terminal window shows the details of each scheduling class and describes attributes and usage. For example:

We can use the man command to view detailed man pages for more usage information and an explanation of the command options.

% man priocntl
Reformatting page.  Please Wait... done

User Commands                                         priocntl(1)

NAME
     priocntl - display or set scheduling parameters of specified
     process(es)

SYNOPSIS
     priocntl -l

     priocntl -d [-i idtype] [idlist]

     priocntl  -s  [-c class]  [  class-specific    options]   [-
     i idtype] [idlist]

     priocntl -e [-c class] [  class-specific   options]  command
     [argument(s)]

DESCRIPTION
     The priocntl command displays or sets scheduling  parameters
     of the specified process(es). It can also be used to display
     the current configuration information for the system's  pro-
     cess  scheduler or execute a command with specified schedul-
     ing parameters.

     Processes  fall  into  distinct  classes  with  a   separate
     scheduling policy applied to each class. The process classes
     currently supported are the  real-time  class,  time-sharing
     class,  interactive  class,  fair-share class, and the fixed
     priority class. The characteristics of these classes and the
     class-specific  options  they  accept are described below in
     the USAGE section under the headings Real-Time Class,  Time-
     Sharing  Class,  Inter-Active  Class,  Fair-Share Class, and
     --More--(4%)
Kernel Overview

Now that you have a high-level understanding of processes, threads, and scheduling, let's discuss the kernel and how kernel modules are different from user programs. The Solaris kernel does the following:

  • Manages the system resources, including file systems, processes, and physical devices.

  • Provides applications with system services such as I/O management, virtual memory, and scheduling.

  • Coordinates interactions of all user processes and system resources.

  • Assigns priorities, services resource requests, and services hardware interrupts and exceptions.

  • Schedules and switches threads, pages memory, and swaps processes.

The following section discusses several important differences between kernel modules and user programs.

Execution Differences Between Kernel Modules and User Programs

The following characteristics of kernel modules highlight important differences between the execution of kernel modules and the execution of user programs:

  • Kernel modules have separate address space. A module runs in kernel space. An application runs in user space. System software is protected from user programs. Kernel space and user space have their own memory address spaces.

  • Kernel modules have higher execution privilege. Code that runs in kernel space has greater privilege than code that runs in user space.

  • Kernel modules do not execute sequentially. A user program typically executes sequentially and performs a single task from beginning to end. A kernel module does not execute sequentially. A kernel module registers itself in order to serve future requests.

  • Kernel modules can be interrupted. More than one process can request your kernel module at the same time. For example, an interrupt handler can request your kernel module at the same time that your kernel module is serving a system call. In a symmetric multiprocessor (SMP) system, your kernel module could be executing concurrently on more than one CPU.

  • Kernel modules must be preemptable. You cannot assume that your kernel module code is safe just because your driver code does not block. Design your driver assuming your module might be preempted.

  • Kernel modules can share data. Different threads of an application program need not share data. By contrast, the data structures and routines that constitute a driver are shared by all threads that use the driver. Your driver must be able to handle contention issues that result from multiple requests. Design your driver data structures carefully to keep multiple threads of execution separate.

Structural Differences Between Kernel Modules and User Programs

The following characteristics of kernel modules highlight important differences between the structure of kernel modules and the structure of user programs:

  • Kernel modules do not define a main program. Kernel modules, including device drivers, have no main() routine. Instead, a kernel module is a collection of subroutines and data.

  • Kernel modules are linked only to the kernel. Kernel modules do not link in the same libraries that user programs link in. The only functions a kernel module can call are functions that are exported by the kernel.

  • Kernel modules use different header files. Kernel modules require a different set of header files than user programs require. The required header files are listed in the man page for each function. Kernel modules can include header files that are shared by user programs if the user and kernel interfaces within such shared header files are defined conditionally using the _KERNEL macro.

  • Kernel modules should avoid global variables. Avoiding global variables in kernel modules is even more important than avoiding global variables in user programs. As much as possible, declare symbols as static. When you must use global symbols, give them a prefix that is unique within the kernel. Using this prefix for private symbols within the module also is a good practice.

  • Kernel modules can be customized for hardware. Kernel modules can dedicate process registers to specific roles. Kernel code can be optimized for a specific processor. You can also have customized libraries as well, something which OpenSolaris has for some of the more recent x86/x64 and UltraSPARC platforms. So, while the kernel can dedicate certain registers to certain roles, otherwise customized code can be written for both kernel and user/libraries.

  • Kernel modules can be loaded and unloaded on demand. The collection of subroutines and data that constitute a device driver can be compiled into a single loadable module of object code. This loadable module can then be statically or dynamically linked into the kernel and unlinked from the kernel. You can add functionality to the kernel while the system is up and running. You can test new versions of your driver without rebooting your system.

Process Debugging

Debugging processes at all levels of the development stack is a key part of writing kernel modules.

A full search for libthread in OpenGrok, reveals the following code comments in the mdb_tdb.c file that describe the connection between multi-threaded debugging and how mdb works:

Again, use OpenGrok to quickly find the file and view its code comments, as excerpted here:

In order to properly debug multi-threaded programs, the proc target must be able to query and modify information such as a thread's register set using either the native LWP services provided by libproc (if the process is not linked with libthread), or using the services provided by libthread_db (if the process is linked with libthread). Additionally, a process may begin life as a single-threaded process and then later dlopen() libthread, so we must be prepared to switch modes on-the-fly. There are also two possible libthread implementations (one in /usr/lib and one in /usr/lib/lwp) so we cannot link mdb against libthread_db directly; instead, we must dlopen the appropriate libthread_db on-the-fly based on which libthread.so the victim process has open. Finally, mdb is designed so that multiple targets can be active simultaneously, so we could even have *both* libthread_db's open at the same time. This might happen if you were looking at two multi-threaded user processes inside of a crash dump, one using /usr/lib/libthread.so and the other using /usr/lib/lwp/libthread.so. To meet these requirements, we implement a libthread_db "cache" in this file. The proc target calls mdb_tdb_load() with the pathname of a libthread_db to load, and if it is not already open, we dlopen() it, look up the symbols we need to reference, and fill in an ops vector which we return to the caller. Once an object is loaded, we don't bother unloading it unless the entire cache is explicitly flushed. This mechanism also has the nice property that we don't bother loading libthread_db until we need it, so the debugger starts up faster.

The following mdb commands can be used to access the LWPs of a multi-threaded program:

  • $l Prints the LWP ID of the representative thread if the target is a user process.

  • $L Prints the LWP IDs of each LWP in the target if the target is a user process.

  • pid::attach Attaches to process by using the pid, or process ID.

  • ::release Releases the previously attached process or core file. The process can subsequently be continued by prun(1) or it can be resumed by applying MDB or another debugger.

  • address::context Context switch to the specified process. These commands to set conditional breakpoints are often useful.

  • [ addr ] ::bp [+/-dDestT] [-c cmd] [-n count] sym ... Set a breakpoint at the specified locations.

  • addr ::delete [id | all] Delete the event specifiers with the given ID number.

DTrace probes are constructed in a manner similar to MDB queries. We'll continue the hands-on lab exercises with DTrace and then add MDB when the debugging becomes more complex.

Module 7Getting Started With DTraceObjectives

The objective of this lab is to introduce you to DTrace using a probe script for a system call using DTrace.

The DTrace provider includes three probes, BEGIN, END, and ERROR.

BEGIN is the first probe to fire. All BEGIN clauses will fire before any other probe fires. BEGIN is typically used to initialize.

END will fire after all other probes are completed and can be used to output results.

ERROR fires under an error condition and is used for error handling.

Additional Resources
  • Solaris Dynamic Tracing Guide. Sun Microsystems, Inc., 2007.

  • DTrace User Guide, Sun Microsystems, Inc., 2006

Enabling Simple DTrace Probes

Completion of the lab exercise will result in basic understanding of DTrace probes.

Summary

We're going to start learning DTrace by building some very simple requests using the probe named BEGIN, which fires once each time you start a new tracing request. You can use the dtrace(1M) utility's -n option to enable a probe using its string name.

To Enable a Simple DTrace Probe
  1. Open a terminal window.
  2. Enable the probe:
    # dtrace -n BEGIN

    After a brief pause, you will see dtrace tell you that one probe was enabled and you will see a line of output indicating that the BEGIN probe fired. Once you see this output, dtrace remains paused waiting for other probes to fire. Since you haven't enabled any other probes and BEGIN only fires once, press Control-C in your shell to exit dtrace and return to your shell prompt:

  3. Return to your shell prompt by pressing Control-C:
    # dtrace -n BEGIN
    dtrace: description 'BEGIN' matched 1 probe
    CPU     ID              FUNCTION:NAME
      0      1                  :BEGIN
    ^C
    #

    The output tells you that the probe named BEGIN fired once and both its name and integer ID, 1, are printed. Notice that by default, the integer name of the CPU on which this probe fired is displayed. In this example, the CPU column indicates that the dtrace command was executing on CPU 0 when the probe fired.

    You can construct DTrace requests using arbitrary numbers of probes and actions. Let's create a simple request using two probes by adding the END probe to the previous example command. The END probe fires once when tracing is completed.

  4. Add the END probe:
    # dtrace -n BEGIN -n END
    dtrace: description 'BEGIN' matched 1 probe
    dtrace: description 'END' matched 1 probe
    CPU     ID              FUNCTION:NAME   0      1    :BEGIN
    ^C
    0      2                     :END
    #

    The END probe fires once when tracing is completed. As you can see, pressing Control-C to exit DTrace triggers the END probe. DTrace reports this probe firing before exiting.

Listing Traceable Probes

The objective of this lab is to explore probes in more detail and to show you how to list the probes on a system.

Summary

In the preceding examples, you learned to use two simple probes named BEGIN and END. But where did these probes come from? DTrace probes come from a set of kernel modules called providers, each of which performs a particular kind of instrumentation to create probes. For example, the syscall provider provides probes in every system call and the fbt provider provides probes into every function in the kernel.

When you use DTrace, each provider is given an opportunity to publish the probes it can provide to the DTrace framework. You can then enable and bind your tracing actions to any of the probes that have been published.

To List Traceable Probes
  1. Open a terminal window.
  2. Type the following command:
    # dtrace

    The dtrace command options are printed to the output.

  3. Type the dtrace command with the -l option:
    # dtrace -l | more
      ID   PROVIDER            MODULE          FUNCTION NAME
       1     dtrace                                     BEGIN
       2     dtrace                                     END
       3     dtrace                                     ERROR
       4   lockstat           genunix       mutex_enter adaptive-acquire
       5   lockstat           genunix       mutex_enter adaptive-block
       6   lockstat           genunix       mutex_enter adaptive-spin
       7   lockstat           genunix       mutex_exit  adaptive-release
    --More--

    The probes that are available on your system are listed with the following five pieces of data:

    • ID - Internal ID of the probe listed.

    • Provider - Name of the Provider. Providers are used to classify the probes. This is also the method of instrumentation.

    • Module - The name of the Unix module or application library of the probe.

    • Function - The name of the function in which the probe exists.

    • Name - The name of the probe.

  4. Pipe the previous command to wc to find the total number of probes in your system:
    # dtrace -l | wc -l
    30122

    The number of probes that your system is currently aware of is listed in the output. The number will vary depending on your system type.

  5. Add one of the following options to filter the list:
    • -P for provider

    • -m for module

    • -f for function

    • -n for name

    Consider the following examples:

    # dtrace -l -P lockstat
    ID   PROVIDER            MODULE          FUNCTION NAME
     4   lockstat           genunix       mutex_enter adaptive-acquire
     5   lockstat           genunix       mutex_enter adaptive-block
     6   lockstat           genunix       mutex_enter adaptive-spin
     7   lockstat           genunix       mutex_exit  adaptive-release

    Only the probes that are available in the lockstat provider are listed in the output.

    # dtrace -l -m ufs
      ID  PROVIDER            MODULE          FUNCTION NAME
      15   sysinfo               ufs     ufs_idle_free ufsinopage
      16   sysinfo               ufs ufs_iget_internal ufsiget
     356       fbt               ufs            allocg entry

    Only the probes that are in the UFS module are listed in the output.

    # dtrace -l -f open
     ID   PROVIDER            MODULE          FUNCTION NAME
      4   syscall                                 open entry
      5   syscall                                 open return
    116   fbt                genunix              open entry
    117   fbt                genunix              open return

    Only the probes with the function name open are listed.

    # dtrace -l -n start
      ID   PROVIDER            MODULE          FUNCTION NAME     
     506       proc              unix   lwp_rtt_initial start
    2766         io           genunix    default_physio start
    2768         io           genunix           aphysio start
    5909         io               nfs          nfs4_bio start

    The above command lists all the probes that have the probe name start.

Programming in D

Now that you understand a little bit about naming, enabling, and listing probes, you're ready to write the DTrace version of everyone's first program, "Hello, World."

Summary

This lab demonstrates that, in addition to constructing DTrace experiments on the command line, you can also write them in text files using the D programming language.

To Write a DTrace Program
  1. Open a terminal window.
  2. In a text editor, create a new file called hello.d.
  3. Type in your first D program:
    BEGIN
    {
        trace("hello, world");
        exit(0);
    }
  4. Save the hello.d file.
  5. Run the program by using the dtrace -s option:
    # dtrace -s hello.d
    dtrace: script 'hello.d' matched 1 probe
    CPU     ID              FUNCTION:NAME
      0        1                  :BEGIN   hello, world
    #

    As you can see, dtrace printed the same output as before followed by the text “hello, world”. Unlike the previous example, you did not have to wait and press Control-C, either. These changes were the result of the actions you specified for your BEGIN probe in hello.d. Let's explore the structure of your D program in more detail in order to understand what happened.

Discussion

Each D program consists of a series of clauses, each clause describing one or more probes to enable, and an optional set of actions to perform when the probe fires. The actions are listed as a series of statements enclosed in braces { } following the probe name. Each statement ends with a semicolon (;).

Your first statement uses the function trace() to indicate that DTrace should record the specified argument, the string “hello, world”, when the BEGIN probe fires, and then print it out. The second statement uses the function exit() to indicate that DTrace should cease tracing and exit the dtrace command.

DTrace provides a set of useful functions like trace() and exit() for you to call in your D programs. To call a function, you specify its name followed by a parenthesized list of arguments. The complete set of D functions is described in Solaris Dynamic Tracing Guide.

By now, if you're familiar with the C programming language, you've probably realized from the name and our examples that DTrace's D programming language is very similar to C and awk(1). Indeed, D is derived from a large subset of C combined with a special set of functions and variables to help make tracing easy.

If you've written a C program before, you will be able to immediately transfer most of your knowledge to building tracing programs in D. If you've never written a C program before, learning D is still very easy. But first, let's take a step back from language rules and learn more about how DTrace works, and then we'll return to learning how to build more interesting D programs.

Module 8Debugging Applications With DTraceObjectives

The objective of this module is to use DTrace to monitor application events.

Additional Resources

Application Packaging Developer’s Guide. Sun Microsystems, Inc., 2005.

Enabling User Mode Probes

DTrace allows you to dynamically add probes into user level functions. The user code does not need any recompilation, special flags, or even a restart. DTrace probes can be turned on just by calling the provider.

The pid provider is extremely flexible and allows you to instrument any instruction in user land including entry and exit.

The pid provider creates probes on the fly when they are needed. This is why they do not appear in the dtrace -l listing.

You can use the pid provider to trace Function Boundaries or any arbitrary instruction in a given function.

A probe description has the following syntax:

pid:mod:function:name
 
  • pid:           format pid processid (for example pid5234)

  • mod:           name of the library or a.out (executable)

  • function:      name of the function

  • name:          entry for function entry return for function return

DTracing Applications

In this exercise we will learn to use DTrace on user applications.

Summary

This lab builds on the use of a process ID in the probe description to trace the associated application. The steps increase in complexity to the end of the exercise, increasing the amount and depth of information about the application behavior that is output.

To DTrace gcalctool
  1. From the Application or Program menu, start the calculator.
  2. Find the process ID of the process you just started    
    # pgrep gcalctool      
    8198

    This number is the process ID of the calc process, we will call it procid.

  3. Follow the steps below to create a D-script that counts the number of times any function in the gcalctool is called.
    1. In a text editor, create a new file called proc_func.d.
    2. Use pid$1:::entry as the probe-description.

      $1 is the first argument that you will send to your script, leave the predicate part empty.

    3. In the action section, add an aggregate to count the number of times the function is called using the aggregate statement @[probefunc]=count().
      pid$1:::entry
      {             
                      @[probefunc]=count();
      } 
    4. Run the script that you just wrote.
      # dtrace -qs proc_func.d procid

      Replace procid with the process ID of your gcalctool

    5. Perform a calculation on the calculator.
    6. Press Control+C in the window where you ran the D-script.

    Note - The DTrace script collects data and waits for you to stop the collection by pressing Control+C. If you do not need to print the aggregation you collected, DTrace will print it for you.


  4. Now, modify the script to only count functions from the libc library.
    1. Copy the proc_func.d to proc_libc.d.
    2. Modify the probe description in the proc_libc.d file to the following:
      pid$1:libc::entry
    3. Your new script should look like the following:
      pid$1:libc::entry
       {                         @[probefunc]=count();
       } 
  5. Now run the script.
    # dtrace -qs  proc_libc.d  procid

    Replace procid with the process ID of your gcalctool

    1. Perform a calculation on the calculator.
    2. Press Control+C in the window where you ran the D-script to see the output.
  6. Finally, modify the script to find how much time is spent in each function.
    1. Create a file and name it func_time.d.

      We will use two probe descriptions in func_time.d.

    2. Write the first probe as follows:
      pid$1:::entry
    3. Write the second probe as follows:
      pid$1:::return
    4. In the action section of the first probe, save timestamp in variable ts.

      Timestamp is a DTrace built-in that counts the number of nanoseconds from a point in the past.

    5. In the action section of the second probe calculate nanoseconds that have passed using the following aggregation:
      @[probefunc]=sum(timestamp - ts)
    6. The new func_time.d script should match the following:
      pid$1:::entry
       {                         ts = timestamp;
       }
                    pid$1:::return /ts/
       {                        @[probefunc]=sum(timestamp - ts);
       }
  7. Run the new func_time.d script:
    # dtrace -qs  func_time.d procid

    Replace procid with the process ID of your gcalctool

    1. Perform a calculation on the calculator.
    2. Press Control+C in the window where you ran the D-script to see the output.
      ^C
      gdk_xid__equal                                        2468
      _XSetLastRequestRead                                2998
      _XDeq                                                3092
      
      ...

    The left column shows you the name of the function and the right column shows you the amount of wall clock time that was spent in that function. The time is in nanoseconds.

Module 9Debugging C++ Applications With DTraceObjectives

The examples in this module demonstrate the use of DTrace to diagnose C++ application errors. These examples are also used to compare DTrace with other application debugging tools, including Sun Studio 10 software and mdb.

The DTrace command compiles the D language Script. DTrace instructs the provider to enable the probes.

The intermediate code is checked for safety (like Java).

The compiled code is executed in the kernel by DTrace.

As soon as the D program exits all instrumentation is removed.

Using DTrace to Profile and Debug A C++ Program

A sample program CCtest was created to demonstrate an error common to C++ applications -- the memory leak. In many cases, a memory leak occurs when an object is created, but never destroyed, and such is the case with the program contained in this module.

There is no limit (except system resources) on the number of D scripts that can be run simultaneously.

Different users can debug the system simultaneously without causing data corruption or collision issues.

When debugging a C++ program, you may notice that your compiler converts some C++ names into mangled, semi-intelligible strings of characters and digits. This name mangling is an implementation detail required for support of C++ function overloading, to provide valid external names for C++ function names that include special characters, and to distinguish instances of the same name declared in different namespaces and classes.

For example, using nm to extract the symbol table from a sample program named CCtest produces the following output:

# /usr/ccs/bin/nm CCtest
...
[61] | 134549248| 53|FUNC |GLOB |0 |9  |__1cJTestClass2T5B6M_v_
[85] | 134549301| 47|FUNC |GLOB |0 |9  |__1cJTestClass2T6M_v_
[76] | 134549136| 37|FUNC |GLOB |0 |9  |__1cJTestClass2t5B6M_v_
[62] | 134549173| 71|FUNC |GLOB |0 |9  |__1cJTestClass2t5B6Mpc_v_
[64] | 134549136| 37|FUNC |GLOB |0 |9  |__1cJTestClass2t6M_v_
[89] | 134549173| 71|FUNC |GLOB |0 |9  |__1cJTestClass2t6Mpc_v_
[80] | 134616000| 16|OBJT |GLOB |0 |18 |__1cJTestClassG__vtbl_
[91] | 134549348| 16|FUNC |GLOB |0 |9  |__1cJTestClassJClassName6kM_pc_
...

Note - Go to http://opensolaris.org/os/community/edu/curriculum_development/general_test/ to download the scripts, source code, and makefile for CCtest.


From this output, you may correctly assume that a number of these mangled symbols are associated with a class named TestClass, but you cannot readily determine whether these symbols are associated with constructors, destructors, or class functions.

The Sun Studio compiler includes the following three utilities that can be used to translate the mangled symbols to their C++ counterparts: nm -C, dem, and c++filt.


Note - Sun Studio 10 software is used here, but the examples were tested with both Sun Studio 9 and 10.


If your C++ application was compiled with gcc/g++, you have an additional choice for demangling your application -- in addition to c++filt, which recognizes both Sun Studio and GNU mangled names, the open source gc++filt found in /usr/sfw/bin can be used to demangle the symbols contained in your g++ application.

Examples: Sun Studio symbols without c++filt:

# nm CCtest | grep TestClass
[65] | 134549280| 37|FUNC |GLOB |0 |9 |__1cJTestClass2t6M_v_
[56] | 134549352| 54|FUNC |GLOB |0 |9 |__1cJTestClass2t6Mi_v_
[92] | 134549317| 35|FUNC |GLOB |0 |9 |__1cJTestClass2t6Mpc_v_
...

Sun Studio symbols with c++filt:

# nm CCtest | grep TestClass | c++filt
[65] | 134549280| 37|FUNC |GLOB |0 |9 |TestClass::TestClass()
[56] | 134549352| 54|FUNC |GLOB |0 |9 |TestClass::TestClass(int)
[92] | 134549317| 35|FUNC |GLOB |0 |9 |TestClass::TestClass(char*)
...

g++ symbols without gc++filt:

[86]  | 134550070| 41|FUNC |GLOB |0 |12 |_ZN9TestClassC1EPc
[110] | 134550180| 68|FUNC |GLOB |0 |12 |_ZN9TestClassC1Ei
[114] | 134549984| 43|FUNC |GLOB |0 |12 |_ZN9TestClassC1Ev
    ...

g++ symbols with gc++filt:

# nm gCCtest | grep TestClass | gc++filt
[86]  | 134550070| 41|FUNC |GLOB |0 |12 |TestClass::TestClass(char*)
[110] | 134550180| 68|FUNC |GLOB |0 |12 |TestClass::TestClass(int)
[114] | 134549984| 43|FUNC |GLOB |0 |12 |TestClass::TestClass()
    ...

And finally, displaying symbols with nm -C:

[64] | 134549344| 71|FUNC |GLOB |0 |9 |TestClass::TestClass()
                                       [__1cJTestClass2t6M_v_]
[87] | 134549424| 70|FUNC |GLOB |0 |9 |TestClass::TestClass(const char*)
                                       [__1cJTestClass2t6Mpkc_v_]
[57] | 134549504| 95|FUNC |GLOB |0 |9 |TestClass::TestClass(int)
                                       [__1cJTestClass2t6Mi_v_]

Let's use this information to create a DTrace script to perform an aggregation on the object calls associated with our test program. We can use the DTrace pid provider to enable probes associated with our mangled C++ symbols.

To test our constructor/destructor theory, let's start by counting the following:

  • The number of objects created -- calls to new()

  • The number of objects destroyed -- calls to delete()

Use the following script to extract the symbols corresponding to the new() and delete() functions from the CCtest program:

# dem `nm CCtest | awk -F\| '{ print $NF; }'` | egrep "new|delete"
__1c2k6Fpv_v_ == void operator delete(void*)
__1c2n6FI_pv_ == void*operator new(unsigned)

The corresponding DTrace script is used to enable probes on new() and delete() (saved as CCagg.d):

#!/usr/sbin/dtrace -s

pid$1::__1c2n6FI_pv_:
{
            @n[probefunc] = count();
}
            pid$1::__1c2k6Fpv_v_:
{
            @d[probefunc] = count();
}

END
{
    printa(@n);
    printa(@d);
}

Start the CCtest program in one window, then execute the script we just created in another window as follows:

# dtrace -s ./CCagg.d `pgrep CCtest` | c++filt

The DTrace output is piped through c++filt to demangle the C++ symbols, with the following caution.


Caution - You can't exit the DTrace script with a ^C as you would do normally because c++filt will be killed along with DTrace and you're left with no output. To display the output of this command, go to another window on your system and type:

# pkill dtrace

Use this sequence of steps for the rest of the exercises:

Window 1:

# ./CCtest

Window 2:

# dtrace -s scriptname | c++filt

Window 3:

# pkill dtrace

The output of our aggregation script in window 2 should look like this:

void*operator new(unsigned)                          12
void operator delete(void*)                           8

So, we may be on the right track with the theory that we are creating more objects than we are deleting.

Let's check the memory addresses of our objects and attempt to match the instances of new() and delete(). The DTrace argument variables are used to display the addresses associated with our objects. Since a pointer to the object is contained in the return value of new(), we should see the same pointer value as arg0 in the call to delete(). With a slight modification to our initial script, we now have the following script, named CCaddr.d:

#!/usr/sbin/dtrace -s

#pragma D option quiet
/*
__1c2k6Fpv_v_ == void operator delete(void*)
__1c2n6FI_pv_ == void*operator new(unsigned)
*/

/* return from new() */
pid$1::__1c2n6FI_pv_:return
{
            printf("%s: %x\n", probefunc, arg1);
}

/* call to delete() */            pid$1::__1c2k6Fpv_v_:entry
{
            printf("%s: %x\n", probefunc, arg0);
}

Execute this script:

# dtrace -s ./CCaddr.d `pgrep CCtest` | c++filt

Wait for a bit, then type this in window 3:

# pkill dtrace

Our output looks like a repeating pattern of three calls to new() and two calls to delete():

void*operator new(unsigned): 809e480
void*operator new(unsigned): 8068a70
void*operator new(unsigned): 809e4a0
void operator delete(void*): 8068a70
void operator delete(void*): 809e4a0

As you inspect the repeating output, a pattern emerges. It seems that the first new() of the repeating pattern does not have a corresponding call to delete(). At this point we have identified the source of the memory leak!

Let's continue with DTrace and see what else we can learn from this information. We still do not know what type of class is associated with the object created at address 809e480. Including a call to ustack() on entry to new() provides a hint. Here's the modification to our previous script, renamed CCstack.d:

#!/usr/sbin/dtrace -s

#pragma D option quiet

/*
__1c2k6Fpv_v_ == void operator delete(void*)
__1c2n6FI_pv_ == void*operator new(unsigned)
*/

pid$1::__1c2n6FI_pv_:entry
{
            ustack();
}
            pid$1::__1c2n6FI_pv_:return
{
            printf("%s: %x\n", probefunc, arg1);
}
            pid$1::__1c2k6Fpv_v_:entry
{
            printf("%s: %x\n", probefunc, arg0);
}

Execute CCstack.d in Window 2, then type pkill dtrace in Window 3 to print the following output:

# dtrace -s ./CCstack.d `pgrep CCtest` | c++filt


             libCrun.so.1`void*operator new(unsigned)
             CCtest`main+0x19
             CCtest`0x8050cda
void*operator new(unsigned): 80a2bd0

             libCrun.so.1`void*operator new(unsigned)
             CCtest`main+0x57
             CCtest`0x8050cda
void*operator new(unsigned): 8068a70

             libCrun.so.1`void*operator new(unsigned)
             CCtest`main+0x9a
             CCtest`0x8050cda
void*operator new(unsigned): 80a2bf0
void operator delete(void*): 8068a70
void operator delete(void*): 80a2bf0

The ustack() data tells us that new() is called from main+0x19, main+0x57, and main+0x9a -- we're interested in the object associated with the first call to new(), at main+0x19.

To determine the type of constructor called at main+0x19, we can use mdb as follows:

# gcore `pgrep CCtest`
gcore: core.1478 dumped
# mdb core.1478
Loading modules: [ libc.so.1 ld.so.1 ]
> main::dis
main:          pushl  %ebp
main+1:        movl   %esp,%ebp
main+3:        subl   $0x38,%esp
main+6:        movl   %esp,-0x2c(%ebp)
main+9:        movl   %ebx,-0x30(%ebp)
main+0xc:      movl   %esi,-0x34(%ebp)
main+0xf:      movl   %edi,-0x38(%ebp)
main+0x12:     pushl  $0x8
main+0x14:     call   -0x2e4   <PLT=libCrun.so.1`__1c2n6FI_pv_>
main+0x19:     addl   $0x4,%esp
main+0x1c:     movl   %eax,-0x10(%ebp)
main+0x1f:     movl   -0x10(%ebp),%eax
main+0x22:     pushl  %eax
main+0x23:     call   +0x1d5   <__1cJTestClass2t5B6M_v_>
    ...

Our constructor is called after the call to new, at offset main+0x23. So, we have identified a call to the constructor __1cJTestClass2t5B6M_v_ that is never destroyed. Using dem to demangle this symbol produces:

# dem __1cJTestClass2t5B6M_v_
__1cJTestClass2t5B6M_v_ == TestClass::TestClass #Nvariant 1()

Thus, a call to new TestClass() at main+0x19 is the cause of the memory leak. Examining the CCtest.cc source file reveals:

...
t = new TestClass();
cout << t->ClassName();

t = new TestClass((const char *)"Hello.");
cout << t->ClassName();

tt = new TestClass((const char *)"Goodbye.");
cout << tt->ClassName();

delete(t);
delete(tt);
...

It's clear that the first use of the variable t = new TestClass(); is overwritten by the second use: t = new TestClass((const char *)"Hello.");. The memory leak has been identified and a fix can be implemented.

The DTrace pid provider allows you to enable a probe at any instruction associated with a process that is being examined. This example is intended to model the DTrace approach to interactive process debugging. DTrace features used in this example include: aggregations, displaying function arguments and return values, and viewing the user call stack. The dem and c++filt commands in Sun Studio software and the gc++filt in gcc were used to extract the function probes from the program symbol table and display the DTrace output in a source-compatible format. Source files created for this example:

Example 9-1 TestClass.h
class TestClass
{
    public:
        TestClass();
        TestClass(const char *name);
        TestClass(int i);
        virtual ~TestClass();
        virtual char *ClassName() const;
    private:
        char *str;
};

TestClass.cc:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include "TestClass.h"

TestClass::TestClass() {
    str=strdup("empty.");
}

TestClass::TestClass(const char *name) {
    str=strdup(name);
}

TestClass::TestClass(int i) {
    str=(char *)malloc(128);
    sprintf(str, "Integer = %d", i);
}

TestClass::~TestClass() {
    if ( str )
        free(str);
}

char *TestClass::ClassName() const {
    return str;
}
Example 9-2 CCtest.cc
#include <iostream.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include "TestClass.h"

int main(int argc, char **argv)
{
    TestClass *t;
    TestClass *tt;

    while (1) {
        t = new TestClass();
        cout << t->ClassName();

        t = new TestClass((const char *)"Hello.");
        cout << t->ClassName();

        tt = new TestClass((const char *)"Goodbye.");
        cout << tt->ClassName();

        delete(t);
        delete(tt);
        sleep(1);
    }
}
Example 9-3 Makefile
OBJS=CCtest.o TestClass.o
PROGS=CCtest

CC=CC

all: $(PROGS)
    echo "Done."

clean:
    rm $(OBJS) $(PROGS)

CCtest: $(OBJS)
    $(CC) -o CCtest $(OBJS)

.cc.o:
    $(CC) $(CFLAGS) -c $<
Module 10Managing Memory with DTrace and MDBObjectives

This module will build on what we've learned about using DTrace to observe processes by examining a page fault. Then, we'll incorporate low-level debugging with MDB to find the problem in the code.

Additional Resources

Solaris Modular Debugger Guide Sun Microsystems, Inc., 2007.

Software Memory Management

OpenSolaris memory management uses software constructs called segments to manage virtual memory of processes as well as the kernel itself. Most of the data structures involved in the software side of memory management are defined in /usr/include/vm/*.h. In this module, we'll examine the code and data structures used to handle page faults.

The particular fault shown in this module is a major page fault, that is, it results in I/O on the disk.

By contrast, a minor page fault does not result in I/O.

For example, paging in a page of code for an executable is a major fault.

Faulting a new heap page is a minor fault. Heap pages can simply be allocated and zeroed out (no need to access the disk).

Using DTrace and MDB to Examine Virtual Memory

The objective of this lab is to examine a page fault using DTrace and MDB.

Summary

We'll start with a DTrace script to trace the actions of a single page fault for a given process. The script prints the user virtual address that caused the fault, and then traces every function that is called from the time of the fault until the page fault handler returns. We'll use the output of the script to determine what source code needs to be examined for more detail.


Note - In this module, we've added text to the extensive code output to guide the exercise. Look for the <----symbol to find associated text in the output.


DTracing a Page Fault for a Single Process
  1. Open a terminal window.
  2. Create a file called pagefault.d with the following script:
    #!/usr/sbin/dtrace -s
    
    #pragma D option flowindent
    
    pagefault:entry
    /execname == $$1/
    {
        printf("fault occurred on address = %p\n", args[0]);
        self->in = 1;
    }
    
    pagefault:return
    /self->in == 1/
    {
        self->in = 0;
        exit(0);
    }
    
    entry
    /self->in == 1/
    {
    }
    
    return
    /self->in == 1/
    {
    }
  3. Run the script on Mozilla.

    Note - You need to specify mozilla-bin as the executable name, as mozilla is not an exact match with the name. Also, assertions are turned on, so you'll see various calls to mutex_owner(), for instance, which is only used with ASSERT(). Assertions are turned on only for debug kernels.


    # ./pagefault.d mozilla-bin
    dtrace: script './pagefault.d' matched 42626 probes
    CPU FUNCTION                                 
      0  -> pagefault              fault occurred on address = fb985ea2
    
      0   | pagefault:entry <-- i86pc/vm/vm_machdep.c or sun4/vm/vm_dep.c
      0    -> as_fault   <-- generic address space fault common/vm/vm_as.c
      0      -> as_segat                          
      0        -> avl_find  <-- segments are in AVL tree
      0          -> as_segcompar          <-- search segments for segment
      0          <- as_segcompar          <-- containing fault address
      0          -> as_segcompar          <-- common/vm/vm_as.c
      0          <- as_segcompar                  
      0          -> as_segcompar                  
      0          <- as_segcompar                  
      0          -> as_segcompar                  
      0          <- as_segcompar                  
      0          -> as_segcompar                  
      0          <- as_segcompar                  
      0          -> as_segcompar                  
      0          <- as_segcompar                  
      0          -> as_segcompar                  
      0          <- as_segcompar                  
      0          -> as_segcompar                  
      0          <- as_segcompar                  
      0        <- avl_find                        
      0      <- as_segat                          
      0      -> segvn_fault<-- segment containing fault is found, (not SEGV)
                               <-- common/vm/seg_vn.c
      0        -> hat_probe <-- look for page table entry for page
                              <-- i86pc/vm/hat_i86.c or sfmmu/vm/hat_sfmmu.c
      0          -> htable_getpage   <-- page tables are hashed on x86
      0            -> htable_getpte  <-- i86pc/vm/htable.c
      0              -> htable_lookup             
      0              <- htable_lookup             
      0              -> htable_va2entry           
      0              <- htable_va2entry           
      0              -> x86pte_get   <-- return a page table entry
      0                -> x86pte_access_pagetable 
      0                  -> hat_kpm_pfn2va        
      0                  <- hat_kpm_pfn2va        
      0                <- x86pte_access_pagetable 
      0                -> x86pte_release_pagetable 
      0                <- x86pte_release_pagetable 
      0              <- x86pte_get                
      0            <- htable_getpte               
      0          <- htable_getpage                
      0          -> htable_release                
      0          <- htable_release                
      0        <- hat_probe                       
      0        -> fop_getpage <-- file operation to retrieve page(s)
      0          -> ufs_getpage<--file in ufs fs(common/fs/ufs/ufs_vnops.c)
      0            -> bmap_has_holes  <-- check for sparse file
      0            <- bmap_has_holes              
      0            -> page_lookup     <-- check for page already in memory
      0              -> page_lookup_create  <-- common/vm/vm_page.c
      0              <- page_lookup_create  <-- create page if needed
      0            <- page_lookup                 
      0            -> ufs_getpage_miss <-- page wasn't in memory
      0              -> bmap_read  <-- get block number of page from inode
      0                -> bread_common            
      0                  -> getblk_common         
      0                  <- getblk_common             
      0                <- bread_common                
      0              <- bmap_read                         
      0              -> pvn_read_kluster <-- read pages (common/vm/vm_pvn.c)
      0               -> page_create_va  <-- create some pages                
      0               <- page_create_va                  
      0               -> segvn_kluster                   
      0               <- segvn_kluster                   
      0              <- pvn_read_kluster                  
      0             -> pageio_setup <-- setup page(s) for io common/os/bio.c
      0             <- pageio_setup                      
      0             -> lufs_read_strategy <-- logged ufs read
      0              -> bdev_strategy <-- read device common/os/driver.c
      0               -> cmdkstrategy <-- common disk driver (cmdk(7D))
                                         <-- common/io/dktp/disk/cmdk.c
      0                -> dadk_strategy  <-- direct attached disk (dad(7D))
                              <-- for ide disks(common/io/dktp/dcdev/dadk.c)
                       <-- driver sets up dma and starts page in
      0                <- dadk_strategy               
      0               <- cmdkstrategy                  
      0              <- bdev_strategy                   
      0             -> biowait  <-- wait for pagein complete common/os/bio.c
      0              -> sema_p  <-- wakeup sema_v from completion interrupt
      0               -> swtch  <-- let someone else run(common/disp/disp.c)
      0                -> disp  <-- dispatch to next thread to run
      0                <- disp                              
      0                -> resume <-- actual switching occurs here
                                 <-- intel/ia32/ml/swtch.s 
      0                 -> savectx <-- save old context
      0                 <- savectx     
                        <-- someone else is running here...                  
      0                 -> restorectx  <-- restore context (we're awakened)
      0                 <- restorectx    
      0                <- resume                              
      0               <- swtch                                 
      0              <- sema_p                                
      0             <- biowait                               
      0             -> pageio_done <-- undo pageio_setup
      0             <- pageio_done                           
      0             -> pvn_plist_init                        
      0             <- pvn_plist_init                        
      0            <- ufs_getpage_miss <-- page is in memory
      0           <- ufs_getpage                           
      0          <- fop_getpage                           
      0       -> segvn_faultpage <-- call hat to load pte(s) for page(s)
      0        -> hat_memload                         
      0         -> page_pptonum  <-- get page frame number
      0         <- page_pptonum                      
      0         -> hati_mkpte    <-- build page table entry                  
      0         <- hati_mkpte                      
      0         -> hati_pte_map  <-- locate entry in page table
      0          -> x86_hm_enter                  
      0          <- x86_hm_enter                  
      0          -> hment_prepare                 
      0          <- hment_prepare                 
      0          -> x86pte_set   <-- fill in pte into page table
      0            -> x86pte_access_pagetable     
      0              -> hat_kpm_pfn2va            
      0              <- hat_kpm_pfn2va            
      0            <- x86pte_access_pagetable     
      0            -> x86pte_release_pagetable    
      0            <- x86pte_release_pagetable    
      0          <- x86pte_set                    
      0          -> hment_assign                  
      0          <- hment_assign                  
      0          -> x86_hm_exit                   
      0          <- x86_hm_exit                   
      0         <- hati_pte_map                    
      0        <- hat_memload                         
      0       <- segvn_faultpage                       
      0      <- segvn_fault                           
      0     <- as_fault                              
      0   <- pagefault                             
    
    # 

    Remember that the above output has been shortened. At a high level, the following has happened on the page fault:

    • The pagefault() routine is called to handle page faults.

    • The pagefault() routine calls as_fault() to handle faults on a given address space.

    • as_fault() walks an AVL tree of seg structures looking for a segment containing the faulting address. If no such segment is found, the process is sent a SIGSEGV (segmentation violation) signal.

    • If the segment is found, a segment specific fault handler is called. For most segments, this is segvn_fault()

    • segvn_fault() looks for the faulting page already in memory. If the page already exists (but has been freed), it is "reclaimed" off the free list. If the page does not already exist, we need to page it in. Here, the page is not already in memory, so we call ufs_getpage().

    • ufs_getpage() finds the block number(s) of the page(s) within the file system by calling bmap_read().

    • Then we call a device driver strategy routine, see strategy(9E) for an overview of what the strategy routine is supposed to do.

    • While the page is being read, the thread causing the page fault blocks (i.e., switches out) via a call to swtch(). At this point, other threads will run.

    • When the paging I/O has completed, the disk driver interrupt handler wakes up the blocked mozilla-bin thread.

    • The disk driver returns through the file system code out to segvn_fault().

    • segvn_fault() then calls segvn_faultpage().

    • segvn_faultpage() calls the HAT (Hardware Address Translation) layer to load the page table entry(s) (PTE)s for the page.

    • At this point, the virtual address that caused the page fault should now be mapped to a valid physical page. When pagefault() returns, the instruction causing the page fault will be retried and should now complete successfully.

  4. Use mdb to examine the kernel data structures and locate the page of physical memory that corresponds to the fault as follows:
    1. Open a terminal window.
    2. Find the number of segments used by mozilla by using pmap as follows:
      # pmap -x `pgrep mozilla-bin` | wc
           368    2730   23105
      #

      The output shows that there are 368 segments.


      Note - The search for the segment containing the fault address found the correct segment after 8 segments. See calls to as_segcompar in the DTrace output above. Using an AVL tree shortens the search!


    3. Use mdb to locate the segment containing the fault address.

      Note - If you want to follow along, you may want to use: ::log /tmp/logfile in mdb and then !vi /tmp/logfile to search. Or, you can just run mdb within an editor buffer.


      # mdb -k
      Loading modules: [ unix krtld genunix specfs dtrace 
      ufs ip sctp usba random fctl s1394
       nca lofs crypto nfs audiosup sppp cpc fcip ptm ipc ]
      > ::ps !grep mozilla-bin  <-- find the mozilla-bin process
      R   933   919   887   885   100 0x42014000 ffffffff81d6a040 mozilla-bin
      
      > ffffffff81d6a040::print proc_t p_as | ::walk seg | ::print struct seg
         <-- Lots of output has been omitted... -->
      {
          s_base = 0xfb800000  <-- the seg we want, fault addr (fb985ea2)
          s_size = 0x561000    <-- greater/equal to base and < base+size
          s_szc = 0
          s_flags = 0
          s_as = 0xffffffff828b61d0
          s_tree = {
              avl_child = [ 0xffffffff82fa7920, 0xffffffff82fa7c80 ]
              avl_pcb = 0xffffffff82fa796d
          }
          s_ops = segvn_ops
          s_data = 0xffffffff82d85070
      }
          <-- and lots more output omitted -->
      
      > ffffffff82d85070::print segvn_data_t  <-- from s_data
      {
          lock = {
              _opaque = [ 0 ]
          }
          segp_slock = {
              _opaque = [ 0 ]
          }
          pageprot = 0x1
          prot = 0xd
          maxprot = 0xf
          type = 0x2
          offset = 0
          vp = 0xffffffff82f9e480   <-- points to a vnode_t
          anon_index = 0
          amp = 0   <-- we'll look at anonymous space later
          vpage = 0xffffffff82552000
          cred = 0xffffffff81f95018
          swresv = 0
          advice = 0
          pageadvice = 0x1
          flags = 0x490
          softlockcnt = 0
          policy_info = {
              mem_policy = 0x1
              mem_reserved = 0
          }
      }
      
      > ffffffff82f9e480::print vnode_t v_path
      v_path = 0xffffffff82f71090 
      "/usr/sfw/lib/mozilla/components/libgklayout.so"
      
      > fb985ea2-fb800000=K  <-- offset within segment
                      185ea2 <-- rounding down gives 185000 (4kpage size)
      
      > ffffffff82f9e480::walk page !wc  <-- walk list of pages on vnode_t
          1236    1236   21012  <-- 1236 pages,(not all are necessarily valid)
      
      > ffffffff82f9e480::walk page | ::print page_t<-- walk pg list on vnode
          <-- lots of pages omitted in output -->
      {
          p_offset = 0x185000  <-- here is matching page
          p_vnode = 0xffffffff82f9e480
          p_selock = 0
          p_selockpad = 0
          p_hash = 0xfffffffffae21c00
          p_vpnext = 0xfffffffffaca9760
          p_vpprev = 0xfffffffffb3467f8
          p_next = 0xfffffffffad8f800
          p_prev = 0xfffffffffad8f800
          p_lckcnt = 0
          p_cowcnt = 0
          p_cv = {
              _opaque = 0
          }
          p_io_cv = {
              _opaque = 0
          }
          p_iolock_state = 0
          p_szc = 0
          p_fsdata = 0
          p_state = 0
          p_nrm = 0x2
          p_embed = 0x1
          p_index = 0
          p_toxic = 0
          p_mapping = 0xffffffff82d265f0
          p_pagenum = 0xbd62  <-- the page frame number of page
          p_share = 0
          p_sharepad = 0
          p_msresv_1 = 0
          p_mlentry = 0x185
          p_msresv_2 = 0
      }
      
          <-- and lots more output omitted -->
      
      > bd62*1000=K  <-- multiple page frame number time page size (hex)
                      bd62000  <-- here is physical address of page
      
      > bd62000+ea2,10/K  <-- dump 16 64-bit hex values at physical address
      0xbd62ea2:      2ccec81ec8b55   e8575653f0e48300 32c3815b00000000 
                      5d89d46589003ea7 840ff6850c758be0 e445c7000007df  
                      1216e8000000    dbe850e4458d5650 7d830cc483ffeeea 
                      791840f00e4     c085e8458904468b 500c498b088b2474 
                      8b17eb04c483d1ff e8458de05d8bd465 c483ffeeeac8e850 
                      458b0000074ce904 
      
      > bd62000+ea2,10/ai  <-- data looks like code, let's try dumping as code
      0xbd62ea2:      
      0xbd62ea2:      pushq  %rbp
      0xbd62ea3:      movl   %esp,%ebp
      0xbd62ea5:      subl   $0x2cc,%esp
      0xbd62eab:      andl   $0xfffffff0,%esp
      0xbd62eae:      pushq  %rbx
      0xbd62eaf:      pushq  %rsi
      0xbd62eb0:      pushq  %rdi
      0xbd62eb1:      call   +0x5     <0xbd62eb6>
      0xbd62eb6:      popq   %rbx
      0xbd62eb7:      addl   $0x3ea732,%ebx
      0xbd62ebd:      movl   %esp,-0x2c(%rbp)
      0xbd62ec0:      movl   %ebx,-0x20(%rbp)
      0xbd62ec3:      movl   0xc(%rbp),%esi
      0xbd62ec6:      testl  %esi,%esi
      0xbd62ec8:      je     +0x7e5   <0xbd636ad>
      0xbd62ece:      movl   $0x0,-0x1c(%rbp)
      
      > ffffffff81d6a040::context <--change context from kernel to mozilla-bin
      debugger context set to proc ffffffff81d6a040, the address of process 
      
      > fb985ea2,10/ai <-- and dump from faulting virtual address
      0xfb985ea2:     
      0xfb985ea2:     pushq  %rbp  <-- looks like a match
      0xfb985ea3:     movl   %esp,%ebp
      0xfb985ea5:     subl   $0x2cc,%esp
      0xfb985eab:     andl   $0xfffffff0,%esp
      0xfb985eae:     pushq  %rbx
      0xfb985eaf:     pushq  %rsi
      0xfb985eb0:     pushq  %rdi
      0xfb985eb1:     call   +0x5     <0xfb985eb6>
      0xfb985eb6:     popq   %rbx
      0xfb985eb7:     addl   $0x3ea732,%ebx
      0xfb985ebd:     movl   %esp,-0x2c(%rbp)
      0xfb985ec0:     movl   %ebx,-0x20(%rbp)
      0xfb985ec3:     movl   0xc(%rbp),%esi
      0xfb985ec6:     testl  %esi,%esi
      0xfb985ec8:     je     +0x7e5   <0xfb9866ad>
      0xfb985ece:     movl   $0x0,-0x1c(%rbp)
      
      > 0::context
      debugger context set to kernel
      
      > ffffffff81d6a040::print proc_t p_as <-- get as for mozilla-bin
      p_as = 0xffffffff828b61d0
      
      > fb985ea2::vtop -a ffffffff828b61d0  <-- check our work
      virtual fb985ea2 mapped to physical bd62ea2  <--physical address matches

      Once the segment is found, we print the segvn_data structure. In this segment, a vnode_t maps the segment data. The vnode_t contains a list of pages that "belong to" the vnode_t. We locate the page corresponding to the offset within the segment. Once the page_t is located, we have the page frame number. We then convert the page frame number to a physical address and examine some of the data at the address. It turns out this data is code. We then check the physical address by using the vtop (virtual-to-physical) mdb command.

    4. Extra credit: walk the page tables of the process to see how a virtual address gets translated into a physical one.
Module 11Debugging Drivers With DTraceObjectives

The objective of this module is to learn about how you can use DTrace to debug your driver development projects by reviewing a case study.

Porting the smbfs Driver from Linux to the Solaris OS

This case study focuses on leveraging the DTrace capability for device driver development.

Historically, debugging a device driver required that a developer use function calls like cmn_err() to log diagnostic information to the /var/adm/messages file. This cumbersome process requires guesswork, re-compilation, and system reboots to uncover software coding errors. Developers with a talent for assembly language can use adb and create custom modules in C for mdb to diagnose software errors. However, historical approaches to kernel development and debugging are quite time-consuming.

DTrace provides a diagnostic short-cut. Instead of sifting through the /var/adm/messages file or pages of truss output, DTrace can be used to capture information on only the events that you as a developer wish to view. The magnitude of the benefit provided by DTrace can best be provided through a few simple examples.

First, create an smbfs driver template based on Sun's nfs driver. After the driver compiles successfully, test that the driver can be loaded and unloaded successfully. First copy the prototype driver to /usr/kernel/fs and attempt to modload it by hand:

# modload /usr/kernel/fs/smbfs
    can't load module: Out of memory or no room in system tables

And the /var/adm/messages file contains:

genunix: [ID 104096 kern.warning] WARNING: system call missing
from bind file

Searching for the system call missing message, reveals it is in the function mod_getsysent() in the file modconf.c, on a failed call to mod_getsysnum. Instead of manually searching the flow of mod_getsysnum() from source file to source file, here's a simple DTrace script to enable all entry and return events in the fbt (Function Boundary Tracing) provider once mod_getsynum() is entered.

#!/usr/sbin/dtrace -s

    #pragma D option flowindent

    fbt::mod_getsysnum:entry
    /execname == "modload"/
    {
        self->follow = 1;
    }

    fbt::mod_getsysnum:return
    {
        self->follow = 0;
        trace(arg1);
    }

    fbt:::entry
    /self->follow/
    {
    }
    
    fbt:::return
    /self->follow/
    {
        trace(arg1);
    }

Note - trace(arg1) displays the function's return value.


Executing this script and running the modload command in another window produces the following output:

 # ./mod_getsysnum.d
    dtrace: script './mod_getsysnum.d' matched 35750 probes


    CPU FUNCTION
      0  -> mod_getsysnum
      0    -> find_mbind
      0     -> nm_hash
      0     <- nm_hash                                          41
      0     -> strcmp
      0     <- strcmp                                   4294967295
      0     -> strcmp
      0     <- strcmp                                            7
      0    <- find_mbind                                          0
      0  <- mod_getsysnum                                4294967295

Thus either find_mbind() returning '0', or nm_hash() returning '41' is the culprit. A quick look at find_mbind() reveals that a return value of 0 indicates an error state. Viewing the source to find_mbind() in /usr/src/uts/common/os/modsubr.c, reveals that we're searching for a char string in a hash table. Let's use DTrace to display the contents of the search string and hash table.

To view the contents of the search string we add a strcmp() trace to our previous mod_getsysnum.d script:

fbt::strcmp:entry
    {
        printf("name:%s, hash:%s", stringof(arg0),
            stringof(arg1));
    }

Here are the results of our next attempt to load our driver:

 # ./mod_getsysnum.d    
    dtrace: script './mod_getsysnum.d' matched 35751 probes
    CPU FUNCTION
      0  -> mod_getsysnum
      0    -> find_mbind
      0     -> nm_hash
      0     <- nm_hash                                          41
      0     -> strcmp
      0      | strcmp:entry            name:smbfs,
                        hash:timer_getoverrun
      0     <- strcmp                                   4294967295
      0     -> strcmp
      0      | strcmp:entry            name:smbfs, 
                        hash:lwp_sema_post
      0     <- strcmp                                            7
      0    <- find_mbind                                          0
      0  <- mod_getsysnum                                4294967295

So we're looking for smbfs in a hash table, and it's not present. How does smbfs get into this hash table? Let's return to find_mbind() and observe that the hash table variable sb_hashtab is passed to the failing nm_hash() function.

A quick search of the source code reveals that sb_hashtab is initialized with a call to read_binding_file(), which takes as its arguments a config file, the hash table, and a function pointer. A few more clicks on our source code browser reveal the contents of the config file to be defined as /etc/name_to_sysnum in the file /usr/src/uts/common/os/modctl.c. It looks like we forgot to include a configuration entry for my driver. Add the following to the /etc/name_to_sysnum file and reboot.

'smbfs            177'
    (read_binding_file() is read once at boot time.)

After rebooting the driver can be loaded successfully.

# modload /usr/kernel/fs/smbfs

Verify that the driver is loaded with the modinfo command:

 # modinfo | grep smbfs
    160 feb21a58  351ac 177      1  smbfs (SMBFS syscall,client,comm)
    160 feb21a58  351ac  24      1  smbfs (network filesystem)
    160 feb21a58  351ac  25      1  smbfs (network filesystem version 2)
    160 feb21a58  351ac  26      1  smbfs (network filesystem version 3)

Note - Remember that this driver was based on an nfs template, which explains this output.


Let's make sure we can also unload the module:

# modunload -i 160
    can't unload the module: Device busy

This is most likely due to an EBUSY errno return value. But now, since the smbfs driver is a loaded module, we have access to all of the smbfs functions:

 # dtrace -l fbt:smbfs:: | wc -l
    1002

This is amazing! Without any special coding, we now have access to 1002 entry and return events contained in the driver. These 1002 function handles allow us to debug my work without a special 'instrumented code' version of the driver! Let's monitor all smbfs calls when modunload is called, using this simple DTrace script:

 #!/usr/sbin/dtrace -s

    #pragma D option flowindent

    fbt:smbfs::entry
    {
    }

    fbt:smbfs::return
    {
        trace(arg1);
    }

It seems that the smbfs code is not being accessed by modunload. So, let's use DTrace to look at modunload with this script:

#!/usr/sbin/dtrace -s

    #pragma D option flowindent

    fbt::modunload:entry
    {
         self->follow = 1;
        trace(execname);
        trace(arg0);
    }

    fbt::modunload:return
    {
         self->follow = 0;
        trace(arg1);
    }

    fbt:::entry
    /self->follow/
    {
    }

    fbt:::return
    /self->follow/
    {
         trace(arg1);
    }


Here's the output of this script:

    # ./modunload.d
    dtrace: script './modunload.d' matched 36695 probes
    CPU FUNCTION
      0  -> modunload                modunload             160
      0   | modunload:entry
      0    -> mod_hold_by_id
      0     -> mod_circdep
      0     <- mod_circdep                                       0
      0     -> mod_hold_by_modctl
      0     <- mod_hold_by_modctl                                0
      0    <- mod_hold_by_id                             3602566648
      0    -> moduninstall
      0    <- moduninstall                                       16
      0    -> mod_release_mod
      0     -> mod_release
      0     <- mod_release                              3602566648
      0    <- mod_release_mod                            3602566648
      0  <- modunload                                            16

Observe that the EBUSY return value '16' is coming from moduninstall. Let's take a look at the source code for moduninstall. moduninstall returns EBUSY in a few locations, so let's look at the following possibilities:

  1. if (mp->mod_prim || mp->mod_ref || mp->mod_nenabled != 0) return (EBUSY);

  2. if ( detach_driver(mp->mod_modname) != 0 ) return (EBUSY);

  3. if ( kobj_lookup(mp->mod_mp, "_fini") == NULL )

  4. A failed call to smbfs _fini() routine

We can't directly access all of these possibilities, but let's approach them from a process of elimination. We'll use the following script to display the contents of the various structures and return values in moduninstall:

#!/usr/sbin/dtrace -s

    #pragma D option flowindent

    fbt::moduninstall:entry
    {
        self->follow = 1;
        printf("mod_prim:%d\n",
            ((struct modctl *)arg0)->mod_prim);
        printf("mod_ref:%d\n",
            ((struct modctl *)arg0)->mod_ref);
        printf("mod_nenabled:%d\n",
            ((struct modctl *)arg0)->mod_nenabled);
        printf("mod_loadflags:%d\n",
            ((struct modctl *)arg0)->mod_loadflags);
    }

    fbt::moduninstall:return
    {
        self->follow = 0;
        trace(arg1);
    }

    fbt::kobj_lookup:entry
    /self->follow/
    {
    }

    fbt::kobj_lookup:return
    /self->follow/
    {
        trace(arg1);
    }

    fbt::detach_driver:entry
    /self->follow/
    {
    }

    fbt::detach_driver:return
    /self->follow/
    {
         trace(arg1);
    }


This script produces the following output:

    # ./moduninstall.d
    dtrace: script './moduninstall.d' matched 6 probes
    CPU FUNCTION
      0  -> moduninstall
        mod_prim:0
        mod_ref:0
        mod_nenabled:0
        mod_loadflags:1
      0    -> detach_driver
      0    <- detach_driver                                       0
      0    -> kobj_lookup
      0    <- kobj_lookup                                4273103456
      0  <- moduninstall                                         16

Comparing this output to the code tells us that the failure is not due to the mp structure values or the return values from detach_driver() of kobj_lookup(). Thus, by a process of elimination, it must be the status returned via the status = (*func)(); call, which calls the smbfs _fini() routine. And here's what the smbfs _fini() routine contains:

int _fini(void)
    {
        /* don't allow module to be unloaded */
        return (EBUSY);
    }

Changing the return value to '0' and recompiling the code results in a driver that we can now load and unload, thus we have completed the objectives of this exercise. We've used the Function Boundary Tracing provider exclusively in these examples. Note that fbt is only one of DTrace's many providers.

Previous Next