{"id":107343,"date":"2022-11-02T07:00:00","date_gmt":"2022-11-02T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=107343"},"modified":"2022-11-02T07:47:44","modified_gmt":"2022-11-02T14:47:44","slug":"20221102-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20221102-00\/?p=107343","title":{"rendered":"A history of the <CODE>fd_set<\/CODE>, <CODE>FD_SETSIZE<\/CODE>, and how it relates to WinSock"},"content":{"rendered":"<p>The <code>select<\/code> function takes three sets of file descriptors (<code>readfds<\/code>, <code>writefds<\/code>, <code>exceptfds<\/code>) and waits for any of the following things to happen:<\/p>\n<ul>\n<li>A file descriptor in <code>readfds<\/code> is available to read without blocking. (Usually this means that there is data available.)<\/li>\n<li>A file descriptor in <code>writefds<\/code> is available to write without blocking. (Usually this means that there is buffer space available.)<\/li>\n<li>A file descriptor in <code>exceptfds<\/code> is in an error state.<\/li>\n<li>The call times out, if a timeout is provided.<\/li>\n<\/ul>\n<p>The function signature is<\/p>\n<pre>int select(int nfds, fd_set *readfds,\r\n    fd_set *writefds, fd_set *exceptfds,\r\n    struct timeval *timeout);\r\n<\/pre>\n<p>where <code>nfds<\/code> is one greater than the highest-numbered file descriptor in any of the <code>fd_set<\/code>s.<\/p>\n<p>When originally implemented, the ABI representation of <code>fd_set<\/code> was a bit array consisting of <code>nfds<\/code> bits.\u00b9 The bits are numbered starting with zero, and the file descriptors you want to be part of the <code>fd_set<\/code> have the corresponding bits set in the array.<\/p>\n<p>For example, if you wanted the <code>fd_set<\/code> to represent file descriptors 4 and 6, you would use a bit array whose first byte was <code>(1 &lt;&lt; 4) | (1 &lt;&lt; 6) = 0x50<\/code>.<\/p>\n<p>The <code>sys\/select.h<\/code> header file provided a stock implementation of <code>fd_set<\/code>: The stock implementation represented a a bit array of <code>FD_SETSIZE<\/code> bits. You could configure the header file by defining the <code>FD_SETSIZE<\/code> preprocessor symbol to the number of bits you desire, and if you didn&#8217;t customize the value, it defaulted to 1024. (We&#8217;ll see why later.)<\/p>\n<p>The physical representation of the <code>fd_set<\/code> was hidden behind macros, and you were expected to use those macros rather than trying to build the bitmap yourself.<\/p>\n<pre>FD_ZERO(set)        \/\/ set all bits in the fd_set to zero\r\nFD_SET(set, fd)     \/\/ set bit \"fd\" in the fd_set to 1\r\nFD_CLR(set, fd)     \/\/ set bit \"fd\" in the fd_set to 0\r\nFD_ISSET(set, fd)   \/\/ returns nonzero if bit \"fd\" is set\r\n<\/pre>\n<p>Picking a value for <code>FD_SETSIZE<\/code> size is a bit of a balancing act. You don&#8217;t want the bitmap to be too big, because that&#8217;s a lot of memory for each <code>fd_set<\/code>, and the <code>select<\/code> function is going to have to scan that many bits to find the ones that are set. On the other hand, you need it to be big enough to handle all reasonable workloads.<\/p>\n<p>The operating system itself imposes a limit on the size of the bitmap, because large bitmaps require a lot of memory and are slow to scan for set bits.\u00b2 At the time that <code>select<\/code> was invented, each process was limited to 1024 simultaneous open files, which seemed like a generous allowance at the time. I mean, who&#8217;s going to write a program that opens over a thousand sockets at once?<\/p>\n<p>Since you could have at most 1024 open file descriptors at a time, file descriptors were in practice always in the range 0\u20261023, and therefore a default value for <code>FD_SETSIZE<\/code> of 1024 would be sufficient to cover all the file descriptors a program could possibly open.<\/p>\n<p>Although the <code>fd_set<\/code> was in theory an opaque data type, it was effectively immutable because the management of the <code>fd_set<\/code> was done by macros, so all of the manipulations were hard-coded into every program that used <code>fd_set<\/code>s. And of course the entire thing was a bit array at the ABI layer, and you don&#8217;t want to break the ABI.<\/p>\n<p>The 1024-file descriptor limit became a problem as computers became larger and more powerful, and the choice of representation as a bit array meant that just increasing the bit array size limit was not a good solution. If you wanted to wait for file descriptors 4000 and 8000, you&#8217;d have to allocate three kilobyte-sized bit arrays. That&#8217;s three big bit arrays just to set two bits.<\/p>\n<p>At this point, the history splits into two story lines.<\/p>\n<p>On the Unix side, instead of making the bit arrays larger and larger, POSIX invented <code>poll<\/code>, a replacement for <code>select<\/code> which expresses the file descriptor set differently. Instead of being a bit array whose size is proportional to the total number of file descriptors the process has opened, it&#8217;s an array of <code>pollfd<\/code> structures, each of which describes a file descriptor you want to wait for.<\/p>\n<p>This brings the size and complexity of the operation down to a reasonable level, since the amount of memory that is needed now grows with the number of file descriptors you are monitoring, rather than with the total number of file descriptors in the process. (Linux, meanwhile, added their own extension to <code>poll<\/code> called <code>epoll<\/code>.)<\/p>\n<p>The <code>select<\/code> function never gained the ability to pass bit arrays larger than 1024 bits. The functionality got frozen at its old level of support, with the expectation that people who need to manage thousands of file descriptors will move to <code>poll<\/code>.<\/p>\n<p>Meanwhile, WinSock wanted to maintain source compatiblity with existing Unix-based networking code, including code that used <code>select<\/code>. On the other hand, the <code>fd_set<\/code> design didn&#8217;t fit well with Windows, because Windows sockets are handles (arbitrary pointer-sized values), not guaranteed to be small integers. Fortunately, WinSock could invent its own ABI, since there were no pre-existing binaries to be compatible with. The ABI for <code>fd_set<\/code> in WinSock takes a different form:<\/p>\n<pre>struct fd_set_abi\r\n{\r\n    uint32_t count;\r\n    SOCKET   sockets[count]; \/\/ variable-length array\r\n};\r\n<\/pre>\n<p>The WinSock ABI for the <code>fd_set<\/code> consists of a count (specifying how many sockets are in the set), followed by an array of that many socket handles.<\/p>\n<p>The various <code>FD_<\/code> macros were redefined so that they preserved the programming interface but operated on this alternate representation on an <code>fd_set<\/code>.<\/p>\n<p>This alternate design avoids tying the bit array to the numeric values of the socket handles, which could in theory force an array larger than available memory: If a handle happened to be <code>0xdeadbeef<\/code>, that would require nearly a half-gigabyte-sized bit array just so you could set the 3 billionth bit (and leave all the other bits zero). It also avoids the problem of having to validate and capture very large bit arrays at the kernel boundary.<\/p>\n<p>As with the original <code>sys\/select.h<\/code>, you can define the preprocessor symbol <code>FD_SETSIZE<\/code> before including <code>winsock.h<\/code>, and that changes the size of the default implementation of <code>fd_set<\/code> as well as the behavior of the corresponding <code>fd_set<\/code>-manipulation macros.<\/p>\n<p>A caveat applies to both the Unix-style and the WinSock-style <code>fd_set<\/code> structures: Since the definition changes depending on how you set <code>FD_SETSIZE<\/code>, a program which defines <code>FD_SETSIZE<\/code> differently in different translation units is technically in violation of the C++ standard One Definition Rule, because it results in different definitions of <code>fd_set<\/code> in different translation units. In practice, you usually get away with it if you pass the <code>fd_set<\/code> objects only by address or by reference. Still, bad things will happen if one function creates an <code>fd_set<\/code> with one <code>FD_SETSIZE<\/code>, and passes it to another function that was compiled with a different <code>FD_SETSIZE<\/code>, and the recipient function tries to modify the <code>fd_set<\/code>, because the recipient will operate on the assumption that the <code>fd_set<\/code> can hold <code>FD_SETSIZE<\/code> entries, for a different value of <code>FD_SETSIZE<\/code>.<\/p>\n<p>For Windows, you can define your own ABI-compatible version of <code>fd_set<\/code> that encodes its own maximum size:<\/p>\n<pre>\/\/ Warning: Works only for WinSock fd_set.\r\n\r\ntemplate&lt;uint32_t Capacity = FD_SETSIZE&gt;\r\nstruct fd_setN\r\n{\r\n    uint32_t count = 0;\r\n    std::array&lt;SOCKET, Capacity&gt; array;\r\n\r\n    operator fd_set*() noexcept { return reinterpret_cast&lt;fd_set*&gt;(this); }\r\n\r\n    constexpr auto capacity() const noexcept { return Capacity; }\r\n    auto begin() noexcept { return &amp;array[0]; }\r\n    auto end() noexcept { return &amp;array[count]; }\r\n\r\n    bool add(SOCKET s) noexcept\r\n    {\r\n        if (count &gt;= capacity()) return false;\r\n        array[count++] = s;\r\n        return true;\r\n    }\r\n\r\n    bool remove(SOCKET s) noexcept\r\n    {\r\n        auto it = std::find(begin(), end(), s);\r\n        if (it == end()) return false;\r\n        *it = array[--count];\r\n        return true;\r\n    }\r\n\r\n    bool contains(SOCKET s) noexcept const\r\n    {\r\n        return std::find(begin(), end(), s) != end();\r\n    }\r\n\r\n    bool clear() noexcept\r\n    {\r\n        count = 0;\r\n    }\r\n};\r\n<\/pre>\n<p>You are welcome to add your own methods to this class, like say <code>push_back<\/code> to make it satisfy more of the C++ Container requirements,\u00b9 but this is the basic idea.<\/p>\n<p>Of course, passing large arrays to <code>select<\/code> generates a lot of busy-work both for you and the operating system: You have to go into a loop filling the <code>fd_set<\/code> before each call to <code>select<\/code> (because <code>select<\/code> modifies the <code>fd_set<\/code> before returning), and the operating system has to go into a loop to set up a monitor to track the status of each socket, and then another loop to cancel the monitor once one of them becomes ready.<\/p>\n<p>A better model for this is to use <code>WSASelectEvent<\/code> to associate an event with each socket. This association is independent of activity on other sockets, so you don&#8217;t have to keep re-establishing it.<\/p>\n<p>Even better would be to use an I\/O completion port to deal with each socket&#8217;s readiness.<\/p>\n<p><b>Exercise<\/b>: Armed with this information, maybe you can help this customer, who reported that <code>FD_ISSET<\/code> is unreliable on Windows:<\/p>\n<pre>struct socket_manager\r\n{\r\n    fd_set sockets;\r\n\r\n    socket_manager()\r\n    {\r\n        FD_ZERO(&amp;sockets);\r\n    }\r\n\r\n    void add(int socket)\r\n    {\r\n        FD_SET(socket, &amp;sockets);\r\n    }\r\n\r\n    bool select()\r\n    {\r\n        fd_set ready = sockets;\r\n        if (select(FD_SETSIZE, &amp;ready,\r\n                   nullptr, nullptr, nullptr) &lt; 0) {\r\n            return false;\r\n        }\r\n        \/\/ Process each ready socket.\r\n        for (int socket = 0; socket &lt; FD_SETSIZE; socket++) {\r\n            if (FD_ISSET(socket, &amp;ready)) {\r\n                DoSomething(socket);\r\n            }\r\n        }\r\n    }\r\n\r\n};\r\n<\/pre>\n<p>The customer found that the <code>FD_ISSET<\/code> always returns 0. The <code>FD_ISSET<\/code> macro cannot find any of the sockets that were put into it. What&#8217;s going on?<\/p>\n<p><b>Bonus chatter<\/b>: People unfamiliar with the history of <code>select<\/code> and <code>fd_set<\/code> will sometimes ask, &#8220;Why do I have to pass <code>nfds<\/code> as the first parameter to <code>select<\/code>? The system already knows the size of an <code>fd_set<\/code>. Why do I have to tell it?&#8221; You have to tell it because the <code>fd_set<\/code> is variable-sized structure. Your program might set a custom value of <code>FD_SETSIZE<\/code>, and the <code>nfds<\/code> tells the operating system how big those bit arrays are.<\/p>\n<p>I&#8217;ve seen arguments that <a href=\"https:\/\/unix.stackexchange.com\/questions\/7742\/whats-the-purpose-of-the-first-argument-to-select-system-call\"> <code>nfds<\/code> was a performance optimization<\/a>, but really it was just a requirement for the ABI to know how many bits to scan. The performance optimization was a side effect.<\/p>\n<p><b>Bonus reading<\/b>: <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20161221-00\/?p=94985\"> What is the maximum numeric value for a socket, and what is the maximum number of sockets a Windows program can create?<\/a><\/p>\n<p><b>Answer to exercise<\/b>: There are no macros for enumerating the contents of an <code>fd_set<\/code>, so this code tries to fake it by simply trying &#8220;every possible number&#8221; and seeing if it&#8217;s a socket. The problem is that the numeric values of the Windows sockets created by the program are all larger than <code>FD_SETSIZE<\/code>, Therefore, the loop through &#8220;all possible sockets&#8221; numbered 0 through <code>FD_SETSIZE - 1<\/code> doesn&#8217;t find them. According to <a href=\"https:\/\/pubs.opengroup.org\/onlinepubs\/9699919799\/functions\/FD_ISSET.html\"> the POSIX documentation for <code>FD_ISSET<\/code><\/a>, the file descriptor parameter to <code>FD_ISSET<\/code> must be a valid file descriptor; you can&#8217;t just loop over &#8220;every possible number&#8221; hunting for them. The customer needs to adapt their code so they keep track of the handles which they have put in their <code>fd_set<\/code> and iterate over those handles (and only those handles) when calling <code>FD_ISSET<\/code>.<\/p>\n<p>\u00b9 If you want it to support equality comparison, you&#8217;ll probably have to switch to a sorted array.<\/p>\n<p>\u00b2 And an attacker could trigger a memory exhaustion denial of service by tricking the kernel into taking a snapshot of a a ridiculously-sized bitmap.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The <CODE>fd_set<\/CODE> started out as just a bitmap.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-107343","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>The <CODE>fd_set<\/CODE> started out as just a bitmap.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/107343","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=107343"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/107343\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=107343"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=107343"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=107343"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}