Ideas, and Code I write about code and ideas that eventually result in code. Expect a lot of low-level code mixed with various miscellaneous learnings and ideas in the technology world. My opinions are exclusively my own and do not represent those of my employer. Wed, 09 Aug 2017 23:57:04 -0700 Wed, 09 Aug 2017 23:57:04 -0700 Jekyll v3.4.3 Hello,! <p>Ideas, and Code is <a href="">now on</a>!</p> <p>I’m intending on streaming a variety of things from coding and other content similar to this blog, to video games, to music and audio. And anything else! The sky is the limit :-)</p> <p>The first stream has already completed and the recording of the broadcast can be found <a href="">here</a>!</p> Wed, 09 Aug 2017 13:00:00 -0700 Victor Engine Progress (December video) <p>Another video of the state of Victor:</p> <iframe width="560" height="315" src="" frameborder="0" allowfullscreen=""></iframe> <p>This demonstrates some of the improvements I’ve made. Recent work since this video has been primarily focused on building a voxel geometry demonstration and implementing some performance improvements.</p> <h2 id="nanogui-support">nanogui support</h2> <p><code class="highlighter-rouge">nanogui</code> is a great little GUI framework that integrates nicely with OpenGL. It’s quite extensible, which is great as I have made use of its core set of user interface widgets to build more complex user interfaces.</p> <h2 id="effects">Effects</h2> <p>I’ve supported a number of effects in Victor for a while, but this video fully shows several of them including SSAO (screen-space ambient occlusion), motion blur, and a vignette effect.</p> <h2 id="pbr">PBR</h2> <p>The last post showed a PBR-specific video, while this video shows the PBR functioning in an existing environment.</p> <p>It’s worth noting that since this video there’s been further changes to the PBR support in Victor. Some of these are still in-progress, particularly related to physically-correct lighting.</p> <h2 id="render-targets">Render Targets</h2> <p>Render targets are well-supported now, and are shown in the video by rendering an entirely new scene and showing it embedded within a nanogui window.</p> <p>I’m using these for a proof-of-concept editor that allows editing materials with a live demonstration of any changes made. It’s still in progress and the latest work to fix performance and add voxel support has broken it a bit, but once things are working properly I’ll be able to upload a video of it in action.</p> <h2 id="parallax-mapping">Parallax Mapping</h2> <p>The last video didn’t show this at all, but it’s shown in this video (though it may be difficult to see!) - the bricks on the “ground” are all parallax mapped to give an illusion of depth without requiring the extra geometry.</p> <p>The parallax mapping can be seen more obviously in the render target view, where the parallax effect stretches around the sphere.</p> <h2 id="whats-happening-now">What’s happening now?</h2> <p>My focus at the moment in Victor is primarily:</p> <ul> <li>a voxel demonstration with modifiable geometry</li> <li>performance improvements; visibility culling was a great start but Victor really needs culling via something like an octree to really start performing nicely on complex scenes (right now it still has to manipulate every object in a scene to perform the visibility cull, whereas with an octree a large number of objects could be skipped).</li> <li>fixes to PBR lighting and shadowing, which is currently a little broken</li> </ul> <p>This might take a while (months), but I’m aiming to add more videos of some of these features as they mature. The editor proof-of-concept is also something that I’m pretty excited about, so once that’s cleaned up a bit and has some key problems fixed I’ll be able to show that off too.</p> Sat, 17 Jun 2017 13:00:00 -0700 Victor Engine PBR Demo <p>I’m working on a game engine (the “Victor” engine), and I uploaded a video showing its physically-based rendering support.</p> <p>There’s a few issues with it that I’m still working on, but for the time being you can check out how it looks below.</p> <iframe width="560" height="315" src="" frameborder="0" allowfullscreen=""></iframe> Mon, 12 Dec 2016 15:00:00 -0800 Welcome to <p>I’ve moved this blog again from to</p> <p>I used this as an opportunity to play with Jekyll and so far the experience has been positive.</p> <p>I’m working on bringing across old posts that I think have significant relevance to this site. Once that’s done, I intend on removing the old blog(s) and updating links to point here.</p> Mon, 15 Feb 2016 15:48:53 -0800 move Pedigree Ports Build System <p>I’ve done some work recently to put together a new build system for Pedigree’s ports, where dependencies can be a first-class citizen and various other modes of operation can be used. For example, it’s now fairly trivial to just dump the commands that would be run into a file.</p> <p>I hope to discuss the system more once it’s a little more tested, but for now, I’ll leave you with the latest SVG (will update after each build!) of the (build - not necessary installation) dependency tree for all Pedigree ports.</p> <p><strong><a href="">Find the dependency SVG here!</a></strong></p> Mon, 17 Aug 2015 05:01:00 -0700 Pedigree: Progress Update & Python Debugging Post-Mortem <p>My last post on this blog covered off the work so far on memory mapped files. There has been quite a bit of progress since then in this area:</p> <ul> <li>Memory mapped file cleanup works as expected. Programs can remove their memory mappings at runtime, and this will be successful - including ‘punching holes’ in mappings.</li> <li>Remapping an area with different permissions is now possible. The dynamic linker uses this to map memory as required for the type of segment it is loading - for example, executable, or writeable. This means it is no longer possible to execute data as code on Pedigree on systems which support this.</li> <li>Anonymous memory maps are mapped to a single zeroed page, copy-on-write. Programs that never write to an anonymous memory map are therefore significantly lighter in physical memory.</li> <li>The virtual address space available for memory maps on 64-bit builds of Pedigree is now significantly larger.</li> </ul> <p>Other significant changes since the last post include:</p> <ul> <li>Implemented a VGA text-mode VT100 emulator and necessary fallbacks for a system that does not have the ability to support a graphics mode for Pedigree. This significantly improves the experience.</li> <li>Psuedo-terminal support has improved substantially, such that the <a href="">‘Dropbear’ SSH server</a> runs and accepts numerous connections, without nasty error messages.</li> <li>POSIX job control is functional.</li> <li>I have successfully used Pedigree on my old eee PC with a USB Mass Storage device as the root filesystem; writing files on Pedigree using Vim and reading them on a Linux system.</li> <li>The build system now uses GCC 4.8.2 and Binutils 2.24.</li> <li>Pedigree is now only 64-bit when targeting x86 hardware, in order to reduce development complexity and to acknowledge the fact that very few modern systems are 32-bit-only anymore.</li> </ul> <p>Of particular interest has been the switch to 64-bit-only when targeting x86. The following is a post-mortem from a particularly interesting side-effect of this.</p> <p>–</p> <p>Python has been a supported port in Pedigree for quite a while. Python entered the tree proper <a href="">in 2009</a>, version 2.5. The process of and lessons learned while building Python for Pedigree led to the creation of the <a href="">Porting Python page</a> on the wiki. Suffice it to say, this is a port that has great significance to the project. Our build system (<a href="">SCons</a>) also uses Python, so it is critical to support Python in order to achieve the goal of building Pedigree on Pedigree. Recently I noticed that Python was consistently hitting a segmentation fault during its startup. Noting that this is probably not a great state for the Python port to be in, I decided to take a closer look.</p> <p>All code is from Python 2.7.3.</p> <p>The problem lies in moving from 32-bit to 64-bit; I am sure by now many readers will have identified precisely what the problem is, or will do so within the first paragraph or two of reading. Read on and find out if you are correct! :-)</p> <p>The first order of business was to get an idea as to where the problem was taking place. I rebuilt Python making sure that ``-fno-omit-frame-pointer<code class="highlighter-rouge"> was in the build flags, so the Pedigree kernel debugger could do a trivial backtrace for me. I removed the code that only enters the kernel debugger when a kernel thread crashes (normally, it makes more sense for a </code>SIGSEGV<code class="highlighter-rouge"> to be sent to the faulting process - but I needed more debugging flexibility to fix this). I managed to get a backtrace and discovered that the process was crashing within the </code>_PySys_Init` function.</p> <p>With a disassembly of the Python binary in hand, and the source code available, I quickly identified that the problem line was:</p> <div class="highlighter-rouge"><pre class="highlight"><code><span class="n">PyDict_SetItemString</span><span class="p">(</span><span class="n">sysdict</span><span class="p">,</span> <span class="s">"__displayhook__"</span><span class="p">,</span> <span class="n">PyDict_GetItemString</span><span class="p">(</span><span class="n">sysdict</span><span class="p">,</span> <span class="s">"displayhook"</span><span class="p">));</span> </code></pre> </div> <p>Okay, so it turns out that somehow, the <code class="highlighter-rouge">sys</code> module’s dictionary of attributes, methods, and documentation is returning a ‘not found’. This is bad! The question is, why is the lookup failing?</p> <p>I ended up having to trace through the source with breakpoints and disassembly, which took a good 5-6 man-hours to complete. I reached a point where I could no longer isolate the issue and it was at this point I realised I needed something a bit heavier than Pedigree’s builtin debugging tools. The <a href="">QEMU emulator</a> provides a GDB stub, which is perfect for debugging this kind of thing.</p> <p>I also reached the conclusion to use GDB after a number of test runs where I ended up inspecting the raw contents of RAM to decipher the problem at hand - while this is helpful for learning a lot about how Python data structures work and how they look, it is nowhere near a sustainable solution for debugging a problem like this.</p> <p>I linked a local GDB up to the Python binary, with a <code class="highlighter-rouge">.gdbinit</code> file that made sure to transform the file paths the binary held within it so GDB could show me source references while running. The file looks a little like this:</p> <div class="highlighter-rouge"><pre class="highlight"><code>file images/local/applications/python2.7 target remote localhost:1234 directory /find/source/in/misc/src/Python-2.7.3 set substitute-path /path/to/builds/build-python27-2.7.3/build/.. /real/path/to/misc/src/Python-2.7.3 set filename-display absolute set disassemble-next-line on break *0x4fcd42 </code></pre> </div> <p>The breakpoint on the final line is set to the line of code shown above.</p> <p>The key to the <code class="highlighter-rouge">.gdbinit</code> file is that it essentially specifies a list of GDB commands to run before GDB is interactive. This saves a huge amount of time when doing the same general debug process repeatedly. So the stage is set, the actors are ready!</p> <p>Up comes Pedigree, up comes GDB, everything is connected and functioning correctly. QEMU hits the breakpoint address and hands off control to GDB. At this point, I am able to print the value of various variables in scope and start tracing. First of all, I check the sysdict dictionary to make sure it actually has items…</p> <div class="highlighter-rouge"><pre class="highlight"><code>&gt; print sysdict.ma_used </code></pre> </div> <p>(number greater than zero)</p> <p>Okay, so there’s items in the dictionary. Excellent. I’ll confess at this point I became a little bit excited - I hadn’t used GDB with QEMU before, and I hadn’t realised that it would be exactly the same as debugging any other userspace application. The entire toolset is at my fingertips.</p> <p>So I trace, stepping through function after function, nested deeply. Fortunately GDB has the <code class="highlighter-rouge">finish</code> command - which basically continues execution until the current function is about to return. Many functions included things like allocating memory, interning strings, and creating Python objects. Jumping to the end and seeing each of these functions completed successfully indicated the issue was not in any of these particular areas of the Python source tree.</p> <p>Finally, after much stepping and moving through the call tree, I ended up at the <code class="highlighter-rouge">PyDict_GetItem</code> function. Excellent - I know I’m close now!</p> <p>I’ll confess, as soon as I saw the source dump for this function I had a bit of an a-ha moment; the first line of the function is:</p> <div class="highlighter-rouge"><pre class="highlight"><code><span class="kt">long</span> <span class="n">hash</span><span class="p">;</span> </code></pre> </div> <p>From my previous memory dumping and traversing the Python codebase, I happened to have an awareness that dictionary objects use the type <code class="highlighter-rouge">Py_ssize_t</code> for their hashes. This is defined as <code class="highlighter-rouge">ssize_t</code> normally, which is fine on most systems. I had a hunch at this point, but I continued stepping - I wanted conclusive evidence before I left the GDB session and identified a fix.</p> <p>The next few steps essentially involved tracing until finding something along the lines of:</p> <div class="highlighter-rouge"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">ep</span><span class="o">-&gt;</span><span class="n">me_hash</span> <span class="o">==</span> <span class="n">hash</span><span class="p">)</span> <span class="p">{</span> </code></pre> </div> <p>Okay, GDB, do your best!</p> <div class="highlighter-rouge"><pre class="highlight"><code>&gt; print ep-&gt;me_hash -12345678 &gt; print hash -32112774748828 </code></pre> </div> <p>Oh dear.</p> <p>I aborted the GDB session here, closed QEMU, and ran a quick test to see what the actual size of Pedigree’s <code class="highlighter-rouge">ssize_t</code> on 64-bit is… and discovered that it is in fact only 4 bytes (where <code class="highlighter-rouge">size_t</code> is 8 bytes). Of course, a <code class="highlighter-rouge">long</code> on 64-bit is a full 8-byte integer. Matching the hash would be a true fluke; the dictionary lookup could never succeed.</p> <p><a href="">The problem has now been fixed</a> and Python now runs perfectly well on 64-bit systems. Python checks the size of <code class="highlighter-rouge">size_t</code> in its configure script but not the signed variant; nor should it need to - the two types should be the same size. Even so, <code class="highlighter-rouge">PyObject_Hash</code> returns a long; there is a comment to this effect in Python’s <code class="highlighter-rouge">dictobject.h</code>:</p> <div class="highlighter-rouge"><pre class="highlight"><code><span class="cm">/* Cached hash code of me_key.  Note that hash codes are C longs. * We have to use Py_ssize_t instead because dict_popitem() abuses * me_hash to hold a search finger. */</span> <span class="n">Py_ssize_t</span> <span class="n">me_hash</span><span class="p">;</span> </code></pre> </div> <p>I have not yet checked whether this is resolved in newer Python.</p> <p>It’s nice to be able to run Python code again in Pedigree. :-)</p> Sun, 25 May 2014 03:52:00 -0700 Memory Mapped Files in Pedigree <p>Well, it’s needed to happen for quite some time, and now I have finally begun the Great Memory Mapped File Overhaul of 2013 in Pedigree!</p> <p>The past memory mapped file support, whilst excellent and very functional, was very file-oriented and was particularly complicated to make work for anonymous memory mappings or things that didn’t quite back onto a real file. I have tried to make <a href="">the new interface</a> support both anonymous and file-oriented mappings, and this is particularly helpful now as anonymous memory maps in Pedigree now use a shared ‘zero’ page (as per conventional operating system kernel design). Previously, anonymous memory maps were mapped and allocated in full at the time of mapping.</p> <p>Rewind there a moment - you’re asking what on earth these mapping types are?</p> <ul> <li>An anonymous memory map is a mapping that is not backed by any file or disk storage. That is, it is purely within memory, and this memory is conventionally zeroed. This is often a very quick and easy way for an application to get a hold of a large amount of already-zeroed memory that will only be paged in when it is needed. With madvise() and other such system calls, the application can even inform the operating system that it is done with pages for now, allowing that memory to be released until the application traps and pages in physical pages again. Most modern userspace heap allocators use anonymous memory maps for large allocations. The term ‘anonymous’ refers to the fact that the mapping is not linked to a file.</li> <li>File-backed memory mappings are mappings that are backed by file/disk. A trap on a file-backed memory mapping brings the data in the file into memory, and if the mapping is created to be shared, writes to the memory region may be written back to the backing file. For regions of zeroed memory, where anonymous memory maps are not used (or unavailable), /dev/zero or /dev/null can be used. Memory mapped files can be particularly useful in this case to avoid the overhead of memory copies during I/O. For example, a web server might memory map the files it is serving, and pass the memory mapped region to I/O system calls directly, rather than keep the files in a local heap buffer.</li> </ul> <p>For both memory map types, there are circumstances where they can each be paged out of the physical address space, freeing memory. Pedigree has recently had a great deal of work done to make caches compact when the system runs out of memory - the new memory mapped files should also make it possible to free up some memory by cleaning up memory mappings.</p> <p>An example of such a circumstance might be a file that has been mapped, read, and written to. The kernel can force a memory sync back to disk on the region (ideally at a ‘high water mark’ rather than actually during an out-of-memory situation), and then start freeing physical memory in an LRU (least-recently-used) fashion. Because memory mapped files pin pages they use in the caching subsystem, being able to force a sync and release of this pin is very, very handy, and was not trivial to complete in the previous implementation.</p> <p>The latest memory mapping work has now made it possible to create a /dev/fb device for graphics, rather than requiring the use of native system calls to get a hold of a framebuffer. This is particularly useful for other developers interested in developing desktop environments for Pedigree, as it should soon be possible to expect a reasonable amount of Linux compatibility on the /dev/fb device. Once this is implemented, it will be possible to write a desktop environment for Pedigree (in combination with the rudimentary UNIX datagram socket support) that can be tested on a Linux environment, and be immediately portable to Pedigree. Ideally, this would greatly improve the testing cycle time for these userspace elements.</p> <p>The new support still needs some work, of course:</p> <ul> <li>The window manager currently aborts inside dlmalloc, potentially due to a failed mmap.</li> <li>Support for mmap flags is minimal at best, and this definitely needs to be resolved.</li> <li>Cleanup is relatively untested and may be leaky - the true test will be running an application multiple times and confirming that the memory usage on the system does not grow.</li> <li>No syncing of shared file mappings is done yet, and was not done in the past implementation either. This is not a huge deal, but it would be very nice to be able to trigger a write back on the memory region (assuming it had actually changed). Upon write back completion, the pages can be mapped read-only back to the file’s backing cache (rather than copies made during writes) - great for memory usage!</li> </ul> <p>Hopefully the work done here will also be of particular use in eventually implementing a swap subsystem for Pedigree, which can be used to free up memory by writing it back to disk. There are certainly a number of processes which would benefit from this, with pages that barely get touched during their execution. This would also be a great step towards being able to support suspend-to-disk (ie, hibernation) in the future.</p> <p>So there’s still much to be done, but there’s much being done as well - Pedigree is moving forwards. Once this memory mapping work is completed, things should be looking good to ramp up to another Unstable release for testing. So that’s exciting :)</p> <p>Stay tuned!</p> Sun, 15 Dec 2013 03:29:00 -0800 Recent Work <p>It has been quite a while (over a year, in fact) since the last post here. In that time, quite a bit has been done.</p> <p>FORGE ( has more or less been halted; I need to eventually get back around to implementing HPET support, and to get local APIC timers calibrated to a sensible timing source.</p> <p>Pedigree (, on the other hand, has moved forward quite rapidly. A general overview, likely with many gaps, is as follows:</p> <ul> <li>A new window manager has been implemented, based loosely off the i3 window manager and fully tiling.</li> <li>A port of Mesa has been completed with pure-software rendering.</li> <li>A new userspace dynamic linker has been written, which has also improved support for memory mapped files across Pedigree. Read-only code and data in executables will only be loaded into memory once and shared across every memory mapped file that references it. Writeable regions (eg, .data section) are mapped with copy-on-write. This improves memory usage and also performance, especially when programs are run multiple times.</li> <li>Psuedoterminal support has been added, and the existing text user interface has been updated to use psuedoterminals. A port of the ‘dropbear’ SSH client and server has been successful, and has been demonstrated to be usable.</li> <li>A number of TCP bugs have been resolved, including a bug where TCP segments would be provided to the userspace application out-of-order, and also issues where TCP connection termination would fail. TCP is generally more reliable now.</li> <li>The VFS subsystem has been extended to support the memory mapped file changes above.</li> <li>The cache subsystem has been extended and updated to better-handle out-of-memory conditions, and to also perform writebacks as necessary.</li> <li>A number of bugs have been fixed in the FAT filesystem driver (including one where an off-by-one error would cause long filename entries to be duplicated if the filename was precisely 13 characters long), which now makes it possible for data written to the disk to persist across reboots.</li> <li>Copy on write for address space cloning has been implemented, which has greatly improved efficiency in the typical cases where a clone takes place (a fork followed by running a new program immediately).</li> <li>A new preload daemon has been added, which brings commonly-used files into cache at startup, making initial use of the system faster. For example, ‘bash’ will load significantly faster when it is loaded from disk while the user is typing their username and password. Interestingly, it is rare to type the username and password fast enough to beat the preload daemon to loading the shell at least.</li> <li>The build system has been updated slightly to be more functional when run on Pedigree.</li> </ul> <p>The final point is of particular interest - it is now possible to build Pedigree on Pedigree, or at least the kernel and initrd. This means it will soon be viable to do development work on Pedigree.</p> <p>A general roadmap before the next release of Pedigree looks much like this (subject to change!):</p> <ul> <li>Build kernel and initrd on Pedigree, and reboot into new kernel</li> <li>Provide a way to terminate windows in the window manager, and remove them from the display.</li> <li>Provide a way to restart the window manager (this would be especially useful for making changes to the window manager).</li> <li>Resolve issues labelled with the “Unstable 0.1.3” milestone.</li> </ul> <p>It is expected that publicly available disk images and ISOs for testing Pedigree will be offered with a minimal set of software, that has been well-tested and proven to work reliably. Some software is very exciting (for example, a port of the Netsurf browser), but crashing software reflects badly on the product as a whole.</p> <p>Pedigree has improved drastically in the past 6-8 months, and it is exciting to see the progress, and to consider where it will go next. Being able to use Pedigree as a development platform for future releases of Pedigree has been a goal for quite some time, and I personally consider the ability to do so a very good indicator of the stability and usability of a system.</p> <p>Additionally, I have been working on a few other projects, such as a small kernel in the Rust programming language (, and contributions to Rust itself. I have also put up the C unit test framework I put together for FORGE on Github. My profile is at</p> <p>I’m hoping to keep this blog more up-to-date as more work continues.</p> Sat, 02 Nov 2013 03:24:00 -0700 FORGE: Clang and Buildbot <p>One of my personal goals with the <a href="">FORGE Operating System</a> is portability, both across architectures and platforms, and across toolchains. This goal ensures the code written remains at a reasonably quality level, and little things like “quickfix inline assembly” snippets are definitely not allowed.</p> <p>I recently updated the build system for FORGE to enable Clang and the LLVM tools as a usable toolchain for compiling FORGE. Aside from the fact that GNU binutils is still necessary (as there is not yet an LLVM replacement for GNU ld), this means the intermediate steps of FORGE’s compilation all compile to LLVM bitcode, which is converted into assembly code and linked at the very end of the compilation.</p> <p>More on clang + LLVM later though - it’s still a work in progress as I try and get the most out of the toolchain. The ‘end goal’ is to be able to be completely free of gcc/binutils in development.</p> <p>I have also now set up a <a href="">buildbot for FORGE</a> which shows the status of each variant and target in the build system. This has let me see at a glance that I have recently broken the ARM builds, and that all X86 builds are building happily at the moment. Perhaps in the future the BuildBot can also generate nightly ISOs that can be downloaded for testing at the ‘bleeding edge’, once FORGE is somewhat usable as a general purpose operating system.</p> <p>The BuildBot automatically builds all of these targets after each set of commits, allowing immediate feedback on whether a particular change has broken another target (eg, a change in X86 that is not compatible with ARM). This kind of continuous feedback is excellent.</p> <p>The BuildBot has an IRC bot in the FORGE IRC channel on, #forgeos.</p> Tue, 18 Sep 2012 21:48:00 -0700 FORGE Scheduling <p><strong>Update: the project FORGE in this post refers to the <a href="">FORGE Operating System</a>. Also, when discussing feedback schedulers, note that the type of feedback discussed (thread pre-emption) is not the only possible method for feedback - just the one I’ve selected for this post.</strong></p> <p><strong>Update #2: please note that this all changes when more than one core/CPU can run threads, as various additional factors exist (CPU loads, caches, NUMA domains, etc…) that make scheduling more complex. This blog post is primarily written within the context of a uniprocessor system.</strong></p> <p>The topic of scheduling in operating system theory is one which can easily fill a chapter, if not several, of a textbook. Optimally scheduling the various tasks running during the course of normal operation is a unique challenge and one which has many solutions. One must consider the type of tasks being run, their overheads, the overhead of switching threads and processes, and various other factors. Furthermore, scheduling may change between operating system types. A real-time operating system will have different scheduling semantics to, say, a general-purpose operating system.</p> <p>I will not discuss co-operative scheduling here; this is all discussion around a pre-emptive scheduler. That is, a scheduler where threads are given a certain amount of time to run (their timeslice), and are interrupted if they exceed this time.</p> <p>In FORGE, the current scheduler is simply a round-robin scheduler that switches between threads and doesn’t care about any metadata. This means that every task in the system runs at an equal priority, and also means that threads in the same process are not ‘grouped’.</p> <p>Let’s take a quick break from discussing round-robin scheduling to discuss this grouping of threads by process, and why it is important in scheduling.</p> <p>Conceptually, a process can be thought of as an enclosing environment around a set of threads. This environment typically includes an address space, any data shared across threads, and metadata about the actual process itself. For the purposes of this detour, we will consider a process and address space to have a 1:1 relationship. In an environment where address space switches are cost-free (ie, every process runs in the same address space), scheduling threads out-of-order is fine. The threads from various processes can be switched to and from at will, without worrying about costly address space transitions. Consider however the case where an address space switch is in fact a costly operation (as it should be expected to be on most modern architectures): switching between a number of threads from different processes will induce a costly address space switch each time.</p> <p>Now, consider the following - three threads, two processes. Threads one and three are linked to process one. Thread two is linked to process two. Switching from thread one to two requires an address space switch into process two’s address space. Then, switching from two to three requires yet another address space back into process one’s address space. It would be far more sensible to queue threads from the same process together.</p> <p>Back to round-robin scheduling.</p> <p>We essentially have two first-in-first-out queues: the ‘ready’ queue, and the ‘already’ queue. Threads ready to be scheduled are queued in the ready queue, and threads that have already been scheduled are queued in the already queue. When the ready queue runs out of items, the two queues are swapped (already queue becomes the ready queue, and vice versa). This works excellently for an ‘initial’ testing algorithm for an operating system, but does not offer any priorities or have any sort of scheduling heuristics.</p> <p>To add priorities, it is possible to create a list of queues, and there are various other improvements that can be made to round-robin scheduling as well. In FORGE however, I have decided to take the current round-robin scheduler and replace it with a ‘feedback’ scheduler. This is a very powerful scheduler type that can dynamically respond to the changing requirements of the system as time goes on. Consider a general purpose operating system under a reasonable load. There are a number of threads in a number of processes and each is performing various tasks. Some threads are performing heavy I/O, while others are heavily using the CPU. Others still are mixing the two. This variety in workloads is very common and also very difficult to handle ‘well’.</p> <p>A feedback scheduler works off the feedback from the threads it runs (hence the name). If a thread is interrupted by the operating system for completing its timeslice, the operating system can generally assume the thread is using the CPU quite a bit. If however a thread gives up its timeslice to the operating system (by yielding, or implicitly by performing I/O), the operating system can determine that the thread is probably more I/O heavy. In a sense, the feedback the threads give to the scheduler allow the scheduler to determine if the threads are using the resources effectively.</p> <p>So how does a feedback scheduler respond to the feedback? Well, if we consider that each thread has a priority (and a base priority, which is its ‘default’ and starting priority), the scheduler can both punish threads that use too much CPU and reward threads that do not. So, if a thread completes its timeslice before being rescheduled, the scheduler reduces its priority. If a thread does not complete its timeslice before being rescheduled, the scheduler increases its priority. This balances the system by prioritising I/O tasks (which end up blocking while the CPU-heavy tasks do their thing). Then, if a CPU-heavy task needs to block and wait for I/O, its priority can again be increased, and the system is responsive and continues to be usable even during this common workload.</p> <p>Conceptually this is very easy to understand, but the actual implementation in code (or, perhaps, data structures) is a challenge. At this stage, this has not been implemented into FORGE, so no code is available for demonstration. However, the planned implementation will consist of a combination of binary trees and first-in-first-out queues. Remember earlier the discussion about grouping threads by process; by using a binary tree containing each <em>process</em> ID, we can traverse the tree completely and then load the queue of threads on each node (ie, process). This works for selecting the next thread to schedule.</p> <p>Now, for the priorities, a simple array suffices; with the index being the priority.</p> <p>So, for scheduling, we end up with an array, containing binary trees for each priority level, which then contain queues for each process.</p> <p>This enables a reasonably efficient lookup (assuming iterators are sensible for trees, and that lookup costs and the like are negligible for linked lists), groups threads by process, and enables the feedback system to work correctly. The outcome is a reasonably-well balanced system with priority given to I/O.</p> <p>As mentioned previously, scheduler design is a unique challenge, and this particular type of scheduler is not necessarily the ideal solution for, say, a real-time operating system. For reference, <a href="">a similar design is used in OSX, FreeBSD, NetBSD, and Windows NT-based operating systems</a>. Perhaps at a later date I will investigate and discuss an O(1) scheduler type. I’ll blog again later with the intricate details of actually implementing this algorithm.</p> Sun, 02 Sep 2012 19:05:00 -0700