Skip navigation

STANFORD UNIVERSITY

INFORMATION TECHNOLOGY SERVICES

xlong/xshort data-types - another solution to the "endianess" problem.

by Richard L Guertin


XLONG/XSHORT data types handle big-endian data AS BIG-ENDIAN DATA on both big-endian AND little-endian machines, without having to resort to functions like "htonl" or "ntohs".



Several years ago I wrote an Emulator in C (gcc) that interpretively executes a big-endian IBM mainframe program by reading in a binary image of the program and using it like a script.  Each IBM instruction is interpretively executed.


This worked great on big-endian platforms, such as "solaris" or "PowerPC OSX".  However, along came little-endian machines...Linux(tm) operating systems on Intel(tm) chips.  It was easy to adapt the basic Emulator that executed the fully-interpreted programs; but, I had gone a step further and had begun introducing TRAP-logic which made use of C-struct definitions that overlaid the programs DSECTS.  The big-endian mainframe program still had to be emulated in its big-endian fashion by an Emulator that was running on a little-endian machine, but now referenced data created by the big-endian mainframe program.


The problem was that there were over 22,000 lines of C code involved in my TRAP-logic, and having to deal with little-endian-ness across that much code was a daunting challenge.


My solution was to switch from C to C++ (g++).  But the big breakthrough was the introduction of "xlong" and "xshort" user-defined data types.  These were base class data types, with no constructors or destructors.


I considered using the traditional "htonl" or "ntohl" functions, but I decided against that because there were thousands of big-endian cell references that would require these traditional functions and they didn't work well in generalized expressions involving subscripts and pointers.


What xlong and xshort do is define cells on little-endian systems that use "methods" to execute the operators, like add or subtract.  The "byte-swapping" needed to do the work properly is done by these methods, not by "ntohl", etc.  The advantage was that all I had to do was declare a cell as xlong or xshort, and all byte-swapping for those cells was done automatically.  I didn't have to find every occurrence of the cells in the executable code and apply functions like "htonl" to each reference.


On big-endian systems, xlong and xshort default to standard long and short, and therefore they work directly without methods.


In a traditional C program dealing with a mixture of big-endian and little-endian data, you might have coded something like this:

Now, with xlong/xshort, you can code this:

As you can see, this is much simpler, and you don't have to worry about forgetting to apply "ntohl" or "htonl" functions.  The other beauty of this is that you could have a little-endian program that deals with big-endian data, or a big-endian program that deals with little-endian data.  What matters is whether you want xlong/xshort to byte-swap, or not.


You can easily write "universal binary" code which compiles to do byte-swapping for xlong/xshort in one version, and treats xlong/xshort as normal long/short in the other version.  What triggers the swap logic is -DLS_BYTE_SWAP, a defined flag passed to the compiler.


"xlong" looks a little like this:


For the sake of brevity, I left out a huge portion of the signatures.  Every possible unary and binary operator has to be represented with almost every standard type:  char, int, short, long, and unsigned.  What is shown above is just assignment and addition for long, short, and these new data types.  The entire xlong.h file is over 1400 lines.  A few of the assignment operators and the "bool" operator could be done inline.  "bool" is used for things like: if (xcell) ...


The remaining signatures needed to be implemented.  That was handled by xlong.c, a module with over 3200 lines of code.  Here's a sample:


Again, this is just a taste, corresponding to a few of the signatures shown in the xlong.h file.  Notice the use of "swapL" or "swapS" or cast operators, like the (long)right example.  The "swap" routines are static procedure at the start of xlong.c, and they byte-swap their standard type input and return the same standard type (swapped).


I could just as easily have used the traditional "htonl" or "ntohl" functions, but I decided against that because I wanted to experiment on big-endian platforms with test programs using xlong and xshort.  By coding my own unconditional swap process, I guaranteed both test capability and transportability.  But, I'm trying to get across a concept here, not the details of implementation.


Surprisingly this worked!  All I had to do was change the DSECT cells in the C-struct definitions from "long" or "short" into "xlong" or "xshort".  Since the xlong and xshort classes do NOT define their own constructors or destructors, the private "long xval" and "short xval" directly overlays the INTEGER and SHORT INTEGER cells of the DSECTS.


Well, that wasn't ALL I had to do...there was considerably more.  But almost all of what remained could be traced to several restrictions imposed by C++.  xlong and/or xshort cells could NOT be included in unions.  There were several dozen of these unions in the C constructs.


Again, this is just a sample.  It isn't possible to place an xlong in the long_overlay union.  Therefore, "psds" will always be treated as a little-endian long on a little-endian machine.


But that's where "xrefl" comes to the rescue (see xlong.h).


If the original code has the following:


This would be WRONG on a little-endian machine.  The big-endian value in psds would get its bytes swapped when loaded.  But:


This is correct!  What xrefl is doing it telling the compiler to treat the "long" as an xlong (ps->PS_L.l).  The big-endian bytes are swapped by swapL, and the result is a little-endian value that can properly be and'd by 0xFFFFFFL and placed into the lval cell.  Remember, on little-endian machines longs are carried around in memory with bytes swapped from their big-endian equivalent.  01234567 on a big-endian machine appears as 67452301 on a little-endian machine.  But we have the bytes swapped, because it is a big-endian value.  So xrefl turns this into a little-endian value, and we proceed properly.  The nice thing about xrefl and xrefs is that they can be used on the left or the right of an assignment operator:  xrefl(a) = lv = xrefs(b);


An alternative to using xrefl or xrefs everywhere a cell like psds is referenced is to change the definition to something like this:


Now something like ps->psds becomes psds(ps).  The advantage is that the xlong or xshort nature of psds is contained in the #define, and not in the program references.  The (p) parameter is needed to allow any pointer name, not just ps.  There could be two PASS_STACK objects with ps1 and ps2 pointers, so you could reference psds with either pointer.


Another source of error was the "memcpy" function.  Data is moved from one place to another, byte-by-byte.  If big-endian values were moved to little-endian cells, no byte-swapping occurred.  Sometimes this had to be done with huge sections of the DSECTS, or with other data that wasn't aligned on boundaries corresponding to its basic size.  There were long and short data values used within the DSECTS that were buried in streams of data, and could appear anywhere.  Just consider a short length before a string that could be odd or even in length, followed by another short length, etc.  If the first string was odd in length, the short following would NOT be aligned on a short boundary.


So the next problem to solve involved "memcpy" or other methods for moving data that might move big-endians into little-endian cells.  The simplest solution was to use xlong and xshort as recipients.


Next, what can be done about "&cell" when the cell is an xlong?  It was no longer permissible to pass addresses of objects unless xlong* or xshort* was in the signature of the receiving function.  But there are many system-supplied functions that don't know anything about xlong or xshort objects.  In particular, you can't pass them to functions like "printf" because they have indeterminate parms.  C++ doesn't allow user-defined types in such functions.


In many cases, I was forced to move an xlong or xshort into a long or short, and then pass &that.  Upon return, the result in the long or short had to be re-assigned to the xlong or xshort.  Remember, the whole idea in passing &cell is to allow the receiving function the capability of changing the cell.  But it doesn't know it is xlong unless the function's signature says it is xlong*, which it wasn't in most cases, or I couldn't change it.


Something else about &cell should be mentioned.  xlong and xshort cells can't be defined in a union, but xlong* and xshort* can!  That leads to an interesting capability:


The answer is: x gets the big-endian value in ps->PS_L.l without byte swapping, which is EXACTLY what we want in x, a big_endian.  So instead of using "xrefl", it is possible to retrieve values as though they had a different type.  In this case, the "long" is being treated as an xlong because *ptr.pxl says to dereference a pointer to an xlong.  Of course this trick requires setting up another union, the ptr_overlay pointers.  That becomes difficult if you use ptr for other types and addresses.  You must keep track of what address (and type) was last placed in a pointer.  Finally, there were several cases when it would have been nice to pass a long to an xlong, but although the receiving functions had the proper xlong declaration, the compiler would not allow it.


You CAN cast an xlong into a standard long, or assign a long to an xlong in an assignment statement, but you CAN NOT cause an implicit cast of a long into an xlong in passing to functions.


Since you can't send xlong values to functions that don't have xlong parameters, you have to either use long or the xrefl macro to get a long parameter.  Here is a simple example:


In summary, substituting xlong and xshort for long and short solved a lot of problems.  Unions were a problem, but could be circumvented by things like xrefl or xrefs.  "memcpy" was a problem; and passing the address of an object was a problem when the object was xlong or xshort and the receiving functions couldn't have signature changes.


But the good news is that using xlong and xshort made the code conversion MUCH simpler.  Of those 22,000 lines of code, less than 15% had to be modified, and most of that was automated using scripts to apply changes to multiple items in a single pass.  For example, all the long to xlong changes in the C-struct equivalent of a DSECT could be done with one editing command.


The other nice thing was that this same code still compiles under the "gcc" compiler on a big-endian machine without change.  The xlong and xshort data types simply revert back to standard long and short, and everything is big-endian.


For those of you interested in some details, you are welcome to view a couple of the executable script files used to build the xlong.h file, both xlong and xshort signatures, and their friends.  Of course there's other glue to hold everything together.


Standard signature generator        Friends signature generator


For those of you interested in more details, you are welcome to view a sample program run on a PowerPC 750.  There is a list of the sample program, and then disassembly output of the code.  Although this was done on a Big-Endian machine, it still demonstrates the byte swapping capabilities of the xlong and xshort data types.  Notice that xload and xstor code is identical, but at different locations.  I made them separate static functions for debugging purposes, and it doesn't hurt to keep them separate.


Sample Program


Finally, you can use these same xlong/xshort data-types on a big-endian platform where you are dealing with little-endian data.  Because the swap routines are coded into xlong.c, and can be activated by including xlong.h and a -DLS_BYTE_SWAP definition within a make-file given to the "g++" compiler on a big-endian, you can use xlong/xshort to access little-endian data on big-endian machines.  This may be handy.


All of this xlong/xshort code is protected by a Copyright under the title "BYTE-SWAPPING DATA-TYPES for BIG-ENDIAN and LITTLE-ENDIAN MACHINES" dated October 29, 2005 with the Registration Number: "TX 6-265-077".  The alternative short title is: "xlong / xshort data-types".  Any unauthorized use is strictly forbidden.  If you need to use this code, contact: Dick Guertin...


Last modified Monday, 08-Sep-2008 07:57:50 PM

Stanford University Home Page