Coming Soon: Transport Advancements in the Windows 10 Creators update
Windows Networking for Kubernetes
A seismic shift is happening in the way applications are developed and deployed as we move from traditional three-tier software models running in VMs to “containerized” applications and micro-services deployed across a cluster of compute resources. Networking is a critical component in any distributed system and often requires higher-level orchestration and policy management systems to control IP address management (IPAM), routing, load-balancing, network security, and other advanced network policies. The Windows networking team is swiftly adding new features (Overlay networking and Docker Swarm Mode on Windows 10) and working with the larger containers community (e.g. Kubernetes sig-windows group) by contributing to open source code and ensuring native networking support for any orchestrator, in any deployment environment, with any network topology.
Today, I will be discussing how Kubernetes networking is implemented in Windows and managed by an extensible Host Networking Service (HNS) – which is used in both Azure Container Service (ACS) Windows worker nodes and on-premises deployments – to plumb network policy in the OS .
Note: A video recording of the 4/4 #sig-windows meetup where I describe this is posted here: https://www.youtube.com/watch?v=P-D8x2DndIA&t=6s&list=PL69nYSiGNLP2OH9InCcNkWNu2bl-gmIU4&index=1
Kubernetes Networking
Windows containers can be orchestrated using either Docker Swarm or Kubernetes to help “automate the deployment, scaling, and management of ‘containerized’ applications”. However, the networking model used by these two systems is different.
Kubernetes networking is built on the fundamental requirements listed here and is either agnostic to the network fabric underneath or assumes a flat Layer-2 networking space where all containers and nodes can communicate with all other containers and nodes across a cluster without using NAT (encapsulation is permitted). Windows can support these requirements using a few different networking modes exposed by HNS and working with external IPAM drivers and route configurations.
The other large difference between Docker and Kubernetes networking is the scope at which IP assignment and resource allocation occurs. Docker assigns an IP address to every container whereas Kubernetes assigns IP addresses to a Pod which represents a network namespace and could consist of multiple containers running inside the Pod. Windows also has a network namespace concept called a network compartment and a management surface is being built in Windows to allow for multiple containers in a Pod to communicate with each other through localhost.
Connectivity between pods located on different nodes in a Kubernetes cluster can be accomplished either by using an overlay (e.g. vxlan) network or without an overlay by configuring routes and IPAM on the underlying (virtual) network fabric. Realizing this network model can be done through:
- CNI Network Plugin
- Implementing the “Routing” interface in Kubernetes code
- External configuration
The sig-windows community (led by Apprenda) did a lot of work to come up with an initial solution for getting Kubernetes networking to work on Windows. The networking teams at Microsoft are building on this work and continues to partner with the community to add support for the native Kubernetes networking model – defined by the Container Network Interface (CNI) which, is different from the Cloud Network Model (CNM) used by Docker – and surfacing policy management capabilities through HNS.
Kubernetes networking in Azure Container Service (ACS)
Azure Container Service recently announced Kubernetes general availability which uses a routable-vip approach (no overlay) to networking and configures User-Defined Routes (UDR) in the Hyper-V (virtualization) host for Pod communication between Linux and Windows cluster node VMs. A /24 CIDR IP pool (routable between container host VMs) is allocated to each container host with one IP assigned per Pod (one container).
With the recent Azure VNet for Containers announcement which includes support for a CNI network plugin used in Azure (pre-released here: https://github.com/Azure/azure-container-networking/releases/tag/v0.7), tenants can connect their ACS clusters (containers and hosts) directly to Azure Virtual Networks. This means that individual IPs from a tenant’s Azure VNet IP space will be assigned to Kubernetes nodes and pods in potentially the same subnet. The Windows networking team is also working to build a CNI plugin to support and extend container management through Kubernetes on Windows for on-premises deployments.
Kubernetes networking in Windows
Microsoft engineers across Windows and Azure product groups actively contributed code to the Kubernetes repo to enhance kube-proxy (used for DNS and service load-balancing) and kubelet (for Internet access) binaries which are installed on ACS Kubernetes Windows worker nodes. This overcame gaps previously identified so that both DNS and service load-balancing worked correctly without the need for Routing and Remote Access Services (RRAS) or netsh port proxy. In this implementation, the Windows network uses Kubernetes’ default kubenet plugin without CNI plugin.
Using HNS, one transparent and one NAT network is created on each Windows container host for inter-Pod and external communication respectively. Two container endpoints – connected to the Service and Pod networks – are required for each Windows container which will participate in the Kubernetes service. Static routes must be added to the running Windows containers themselves on the container endpoint attached to the service network.
In the absence of ACS-managed User-Defined Routes, Out-of-Band (OOB) configuration of these routes need to be realized in the Cloud Service Provider network, implemented using the “routing” interface of the Kubernetes cloud provider, or connected via overlay networks. Other solutions include using the HNS overlay network driver for inter-Pod communication or using the OVS Hyper-V switch extension with OVN Controller.
Today, with the publicly available versions of Windows server and client you can deploy Kubernetes with the following restrictions:
- One container per Pod
- CNI Network Plugins are not supported
- Each container requires two container endpoints (vNICs) with IP routing manually plumbed
- Service IPs can only be associated with one Container Host and will not be load-balanced
- Policy specifications (e.g. network security) are not supported
What’s Coming Next?
Windows is moving to a faster release cadence such that new platform features will be made available in a matter of months rather than years. In some circumstances, early builds can be made available to Insiders as well as to TAP customers and EEAP partners for early feature validation.
Stay tuned for new features which will be made available soon…
Summary
In this blog post, I described some of the nuances of the Kubernetes networking model and how it differs from the Docker networking model. I also talked about the code updates made by Microsoft engineering teams to the kubelet and kube-proxy binaries for Windows in open source repos to enable networking support. Finally, we ended with how Kubernetes networking is implemented in Windows today and the plans for how it will be implemented through a CNI plugin in the near future…
Windows network performance suffering from bad buffering
Daniel Havey, Praveen Balasubramanian
Windows telemetry results have indicated that a significant number of data connections are using the SO_RCVBUF and/or the SO_SNDBUF winsock options to statically allocate TCP buffers. There are many websites that recommend setting the TCP buffers with these options in order to improve TCP performance. This is a myth. Using Winsock options (SO_RCVBUF and/or SO_SNDBUF) to statically allocate TCP buffers will not make Windows networking stack “faster”. In fact, static allocation of the TCP buffers will degrade performance in terms of how fast the connection responds (latency) and how much data it delivers (bandwidth). The Windows transports team officially recommends not doing this.
TCP buffers need to be dynamically allocated in proportion to the Bandwidth Delay Product (BDP) of the TCP connection. There are two good reasons why we should let the Windows networking stack dynamically set the TCP buffers for us and not set them statically at the application layer. 1.) The application does not know the BDP (TCP does) so it cannot properly set the TCP buffers and 2.) Dynamic buffer management requires complex algorithmic control which TCP already has. In summary, Windows 10 has autotuning for TCP. Let the autotuning algorithm manage the TCP buffers.
Example: I am going to use the Cygwin application as an example since they recently fixed their buffering (thank you Corinna). The experiment is conducted across the Internet to an iperf server in France (from my desk in Redmond).
Experiment 1 — Cygwin (Bad buffering):
Pinging 178.250.209.22 with 32 bytes of data:
Reply from 178.250.209.22: bytes=32 time=176ms TTL=35
Reply from 178.250.209.22: bytes=32 time=173ms TTL=35
Reply from 178.250.209.22: bytes=32 time=173ms TTL=35
Reply from 178.250.209.22: bytes=32 time=172ms TTL=35
Ping statistics for 178.250.209.22:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 172ms, Maximum = 176ms, Average = 173ms
————————————————————
Client connecting to 178.250.209.22, TCP port 5001
TCP window size: 208 KByte (default)
————————————————————
[ 3] local 10.137.196.108 port 56758 connected with 178.250.209.22 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 512 KBytes 4.19 Mbits/sec
[ 3] 1.0- 2.0 sec 1.50 MBytes 12.6 Mbits/sec
[ 3] 2.0- 3.0 sec 1.50 MBytes 12.6 Mbits/sec
[ 3] 3.0- 4.0 sec 1.25 MBytes 10.5 Mbits/sec
[ 3] 4.0- 5.0 sec 1.50 MBytes 12.6 Mbits/sec
[ 3] 5.0- 6.0 sec 1.50 MBytes 12.6 Mbits/sec
[ 3] 6.0- 7.0 sec 1.50 MBytes 12.6 Mbits/sec
[ 3] 7.0- 8.0 sec 1.25 MBytes 10.5 Mbits/sec
[ 3] 8.0- 9.0 sec 1.50 MBytes 12.6 Mbits/sec
[ 3] 9.0-10.0 sec 1.50 MBytes 12.6 Mbits/sec
[ 3] 0.0-10.1 sec 13.6 MBytes 11.3 Mbits/sec
We can see that the RTT is the same for both Experiment 1 & 2 about 177ms. However, in Experiment 1 Cygwin has bad buffering and the throughput averages 11.3 Mbps and tops out at 12.6 Mbps. This is because in Experiment 1 Cygwin was using SO_RCVBUF to allocate 278,775 bytes for the TCP receive buffer and the throughput is buffer limited to 12.6 Mbps.
Experiment 2 — Cygwin (Good buffering):
Pinging 178.250.209.22 with 32 bytes of data:
Reply from 178.250.209.22: bytes=32 time=172ms TTL=35
Reply from 178.250.209.22: bytes=32 time=172ms TTL=35
Reply from 178.250.209.22: bytes=32 time=172ms TTL=35
Reply from 178.250.209.22: bytes=32 time=173ms TTL=35
Ping statistics for 178.250.209.22:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 172ms, Maximum = 173ms, Average = 172ms
————————————————————
Client connecting to 178.250.209.22, TCP port 5001
TCP window size: 64.0 KByte (default)
————————————————————
[ 3] local 10.137.196.108 port 56898 connected with 178.250.209.22 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 768 KBytes 6.29 Mbits/sec
[ 3] 1.0- 2.0 sec 11.8 MBytes 98.6 Mbits/sec
[ 3] 2.0- 3.0 sec 18.0 MBytes 151 Mbits/sec
[ 3] 3.0- 4.0 sec 16.6 MBytes 139 Mbits/sec
[ 3] 4.0- 5.0 sec 16.4 MBytes 137 Mbits/sec
[ 3] 5.0- 6.0 sec 18.0 MBytes 151 Mbits/sec
[ 3] 6.0- 7.0 sec 18.0 MBytes 151 Mbits/sec
[ 3] 7.0- 8.0 sec 18.0 MBytes 151 Mbits/sec
[ 3] 8.0- 9.0 sec 15.6 MBytes 131 Mbits/sec
[ 3] 9.0-10.0 sec 17.4 MBytes 146 Mbits/sec
[ 3] 0.0-10.0 sec 151 MBytes 126 Mbits/sec
In Experiment 2 we see Cygwin perform without static application level buffering. The average throughput is 126 Mbps and the maximum is 151 Mbps which is the true unloaded line speed of this connection. By statically allocating the receive buffer using SO_RCVBUF we limited ourselves to a top speed of 12.6 Mbps. By letting Windows TCP autotuning dynamically allocate the buffers we achieved the true unloaded line rate of 151 Mbps. That is about an order of magnitude better performance. Static allocation of TCP buffers at the app level is a bad idea. Don’t do it.
Sometimes there are corner cases where as a developer one might think that there is justifiable cause to statically allocate the TCP buffers. Let’s take a look at three of the most common causes for thinking this:
1.) Setting the buffers for performance sake. Don’t. TCP autotuning is a kernel level algorithm and can do a better job than any application layer algorithm.
2.) Setting the buffers because you are trying to rate limit traffic. Be Careful! The results may not be what you expect. In the Cygwin example the connection is buffer limited to 12.6 Mbps maximum. However, if the RTT were to change to about 40 ms then the connection would be limited to about 50 Mbps. You cannot reliably set a bandwidth cap in this manner (See the BDP equations).
3.) Setting the buffers for some other reason. Let’s have a discussion. Please comment on the post and we will respond.
Troubleshooting certificate issues in Software Defined Networking (SDN)
As you may be aware, Network Controller in Windows Server 2016 uses certificate based authentication for communicating with Hyper-V hosts and Software Load Balancer MUX virtual machines (VMs).
Some SDN customers have complained about communication issues between Network Controller and hosts, although certificates were correctly configured on both the entities.
On debugging, we found that the customer had installed a non self-signed certificate into the computer’s Trusted Root Certification Authorities store. Although this certificate was not involved in communication between Network Controller and the hosts, the presence of such a certificate broke client authentication. Here is a view of some of the certificate properties:
The following Knowledge Base article provides information about this issue: Internet Information Services (IIS) 8 may reject client certificate requests with HTTP 403.7 or 403.16 errors
To resolve this issue, you can uninstall the non self-signed certificate from the Trusted Root Certification Authorities certificate store for the Local Computer, or move the certificate to the Intermediate Certification Authorities store.
One more thing to note is that that the Personal (My – cert:\localmachine\my) certificate store on the Hyper-V host must have exactly one X.509 certificate with Subject Name (CN) as the host FQDN. This certificate is used for communication with the Network Controller.
This behavior is due to a bug in the system and will be fixed shortly. For now, please ensure that you have only one certificate with the Subject Name (CN) as the host FQDN.
For more information, see the following topics in the Windows Server 2016 Technical Library.
Anirban Paul, Senior Program Manager
HTTPS Client Certificate Request freezes when the Server is handling a large PUT/POST Request
HTTPS Client Certificate Request freezes when the Server is handling a large PUT/POST Request
There is a class of problems that may occur when using client-side certificates in HTTPS.
Sometimes, the server’s request for a client certificate will freeze (until the timeout of two minutes or so) when processing PUT/POST request with a large payload (e.g., >40KB).
Ideally, the server should request the client certificate before any large request exchange.
Otherwise, the server should request the client certificate immediately after either:
- a request has been completely received, or
- a request has been responded to.
Otherwise, the large payload fills the network buffers, which cannot be emptied until the certificate is received and everything processed. This leads to deadlock if the server issues a synchronous call for the client certificate. Although it is not illegal, this is what causes the problem. Furthermore, this represents a trivial DoS vector against any such server.
This may depend on the component sitting directly above http.sys. IIS for example, tries to read as much entity body as possible before requesting the client certificate.
These are some alternatives to fix this issue, but only the first one listed below is deterministic.
By Modifying Only the Server side (when the client cannot be modified):
- (recommended) Set “client certificate required” on the SSL binding so that client certificate is requested at SSL/TLS connection time, before any HTTP request exchange. This forces client certificate to be requested for every connection on that binding. Depending on your configuration, you might need a dedicated VIP and/or SSL SNI name for this communication. This requires no server code changes, but a configuration change via “netsh http” on the SSL binding: clientcertnegotiation=enable
Note: If the server is IIS-based the change needs to be done through IIS. Otherwise, since IIS has a different config, it may overwrite any changes made directly to Http.sys.
- If the server sees this is a PUT/POST request, you need to ensure that the server’s TCP buffers have enough space for the client certificate when it arrives. This leads to strategies such as
- reading as much entity body as possible requesting for the client certificate using an asynchronous call, or,
- Modify your web server app so that it asynchronously pulls the request body while it waits for client certificate retrieval to finish. If too much entity body is pulled (e.g., several MB) and client certificate retrieval has still not finished then cancel the request/connection. Requires server code changes.
- even better, issuing the asynchronous call for the client certificate as early as possible and draining as much of the entity body as possible as you wait for the client certificate to arrive.
This requires server code changes for sure. To increase the chances of this working, *all* relevant buffers on the server as well as the client and in between, need to have enough space for the client certificate to not be stuck behind large payloads. So modifying the client to drain buffers (in addition to the server) helps, but is not sufficient, as intermediate buffers along the way may also pose a problem (e.g., bufferbloat). This is not a deterministic method.
By Modifying the Client side in addition to the Server side:
- (recommended) Use requests such as GET or HEAD to prime the connection so that the server can request for the certificate without being blocked to receive the entity body. This also implies an extra round trip for the priming request, but if client certificates are involved the application is already making some latency tradeoffs. This is not deterministic, as the immediately following “real” request may use a different connection, but usually reuses the “primed” connection from the connection pool. This will require client-side changes and requires that the server expose such an endpoint, as well.
- Use the status code 100 Continue (requires client to send “Expect: 100-continue” header). This may require both client and server changes to be supported. Furthermore, it is not a deterministic mechanism.
Core Network Stack Features in the Creators Update for Windows 10
By: Praveen Balasubramanian and Daniel Havey
This blog is the sequel to our first Windows Core Networking features announcements post. It describes the second wave of core networking features in the Windows Redstone series. The first wave of features is described here: Announcing: New Transport Advancements in the Anniversary Update for Windows 10 and Windows Server 2016. We encourage the Windows networking enthusiast community to experiment and provide feedback. If you are interested in Windows Transport please follow our Facebook feedback and discussion page: @Windows.10.Data.Transport.
TCP Improvements:
TCP Fast Open (TFO) updates and server side support
In the modern age of popular Web services and e-commerce , latency is a killer when it comes to page responsiveness. We’re adding support in TCP for TCP Fast Open (TFO) to cut down on round trips that can severely impact how long it takes for a page to load. Here’s how it works: TFO establishes a secure TFO cookie in the first connection using a standard 3-way handshake. Subsequent connections to the same server use the TFO cookie to connect without the 3-way handshake (zero RTT). This means TCP can carry data in the SYN and SYN-ACK.
What we found together with others in the industry is that middleboxes are interfering with such traffic and dropping connections. Together with our large population of Windows enthusiasts (that’s you!), we conducted experiments over the past few months, and tuned our algorithms to avoid usage of this option on networks where improper middlebox behavior is observed. Specifically, we enabled TFO in Edge using a checkbox in about:flags.
To harden against such challenges, Windows automatically detects and disables TFO on connections that traverse through these problematic middleboxes. For our Windows Insider Program community, we enabled TFO in Edge (About:flags) by default for all insider flights in order to get a better understanding of middlebox interference issues as well as find more problems with anti-virus and firewall software. The data helped us improve our fallback algorithm which detects typical middlebox issues. We intend to continue our partnership with our Windows Insider Program (WIP) professionals to improve our fallback algorithm and identify unwanted anti-virus, firewall and middlebox behavior. Retail and non WIP releases will not participate in the experiments. If you operate infrastructure or software components such as middleboxes or packet processing engines that make use of a TCP state machine, please incorporate support for TFO. In the future, the combination of TLS 1.3 and TFO is expected to be more widespread.
The Creators Update also includes a fully functional server side implementation of TFO. The server side implementation also supports a pre-shared key for cases where a server farm is behind a load balancer. The shared key can be set by the following knob (requires elevation): reg add HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters /v TcpFastopenKey /t REG_BINARY /f /d 0123456789abcdef0123456789abcdef netsh int tcp reload
We encourage the community to test both client and server side functionality for interop with other operating system network stacks. The subsequent releases of Windows Server will include TFO functionality allowing deployment of IIS and other web servers which can take advantage of reduced connection setup times.
Experimental Support for the High Speed CUBIC Congestion Control Algorithm
CUBIC is a TCP Congestion Control (CC) algorithm featuring a cubic congestion window (Cwnd) growth function. The Cubic CC is a high-speed TCP variant and uses the amount of time since the last congestion event instead of ACK clocking to advance the Cwnd. In large BDP networks the Cubic algorithm takes advantage of throughput much faster than ACK clocked CC algorithms such as New Reno TCP. There have been reports that CUBIC can cause bufferbloat in networks with unmanaged queues (LTE and ADSL). In the Creators Update, we are introducing a Windows native implementation of CUBIC. We encourage the community to experiment with CUBIC and send us feedback.
The following commands can be used to enable CUBIC globally and to return to the default Compound TCP (requires elevation): netsh int tcp set supplemental template=internet congestionprovider=cubic netsh int tcp set supplemental template=internet congestionprovider=compound
*** The Windows implementation of Cubic does not have the “Quiescence bug” that was recently uncovered in the Linux implementation.
Improved Receive Window Autotuning
TCP autotuning logic computes the “receive window” parameter of a TCP connection as described in TCP autotuning logic. High speed and/or long delay connections need this algorithm to achieve good performance characteristics. The takeaway from all this is that using the SO_RCVBUF socket option to specify a static value for the receive buffer is almost universally a bad idea. For those of you who choose to do so anyways please remember that calculating the correct size for TCP send/receive buffers is complex and requires information that applications do not have access to. It is far better to allow the Windows autotuning algorithm to size the buffer for you. We are working to identify such suboptimal usage of SO_RCVBUF/SO_SENDBUF socket options and to convince developers to move away from fixed window values. If you are an app developer and you are using either of these socket options please contact us.
In parallel to our developer education effort we are improving the autotuning algorithm. Before the Creators Update the TCP receive Window autotuning algorithm depended on correct estimates of the connection’s bandwidth and RTT. There are two problems with this method. First, the TCP RTT estimate is only measured on the sending side as described in RFC 793. However, there are many examples of receive heavy workloads such as OS updates etc. The RTT estimate taken at the receive heavy side could be inaccurate. Second, there could be a feedback loop between altering the receive window (which can change the estimated bandwidth) and then measuring the bandwidth to determine how to alter the receive window.
These two problems caused the receive window to constantly vary over time. We eliminated the unwanted behavior by modifying the algorithm to use a step function to converge on the maximum receive window value for a given connection. The step function algorithm results in a larger receive buffer size, however, the advertised receive window size is not backed by non-paged pool memory allocation and system resources are not used unless data is received and queued so the larger size is fine. Based on experimental results, the new algorithm adapts to the BDP much more quickly than the old algorithm. We encourage user and system administrators to also take note of our earlier post: An Update on Windows TCP AutoTuningLevel. This should clear misconceptions that autotuning and receive window scaling are bad for performance.
TCP stats API
The Estats API requires elevation and enumerates statistics for all connections. This can be inefficient especially on busy servers with lots of connections. In the Creators Update we are introducing a new API called SIO_TCP_INFO. SIO_TCP_INFO allows developers to query rich information on individual TCP connections using a socket option. The SIO_TCP_INFO API is versioned and we plan to add more statistics over time. In addition, we plan to add SIO_TCP_INFO to .Net NCL and HTTP APIs in subsequent releases.
The MSDN documentation for this API will be up soon and we will add a link here as soon as it is available.
IPv6 improvements
The Windows networking stack is dual stack and supports both IPv4 and IPv6 by default since Windows Vista. Over the Windows 10 releases, we are actively working on improving the support for IPv6. The following are some of the advancements in Creators Update.
RFC 6106 support
The Creators Update includes support for RFC 6106 which allows for DNS configuration through router advertisements (RAs). RDNSS and DNSSL ND options contained in router advertisements are validated and processed as described in the RFC. The implementation supports a max of 3 RDNSS and DNSSL entries each per interface. If there are more than 3 entries available from one or more routers on an interface, then entries with greater lifetime are preferred. In the presence of both DHCPv6 and RA DNS information, Windows gives precedence to DHCPv6 DNS information, in accordance with the RFC.
In Windows, the lifetime processing of RA DNS entries deviates slightly from the RFC. In order to avoid implementing timers to expire DNS entries when their lifetime ends, we rely on the periodic Windows DNS service query interval (15 minutes) to remove expired entries, unless a new RA DNS message is received in which case the entry is updated immediately. This enhancement eliminates the complexity and overhead of kernel timers while keeping the DNS entries fresh.The following knob can be used to control this feature (requires elevation):
The following command can be used to control this feature (requires elevation): netsh int ipv6 set interface <ifindex> rabaseddnsconfig=<enabled | disabled>
Flow Labels
Before the Creators update, the FlowLabel field in the IPv6 header was set to 0. Beginning with the Creators Update, outbound TCP and UDP packets over IPv6 have this field set to a hash of the 5-tuple (Src IP, Dst IP, Src Port, Dst Port). Middleboxes can use the FlowLabel field to perform ECMP for in-encapsulated native IPv6 traffic without having to parse the transport headers. This will make IPv6 only datacenters doing load balancing or flow classification more efficient.
You can use this admin only knob to enable/disable IPv6 flow labels : netsh int ipv6 set flowlabel=[disabled|enabled] (enabled by default) The following knob can be used to control this feature (requires elevation): netsh int ipv6 set global flowlabel=<enabled | disabled>
ISATAP and 6to4 disabled by default
IPv6 continues to see uptake and IPv6 only networks are no longer a rarity. ISATAP and 6to4 are IPv6 transition technologies that have been enabled by default in Windows since Vista/Server 2008. As a step towards future deprecation, the Creators Update will have these technologies disabled by default. There are administrator and group policy knobs to re-enable them for specific enterprise deployments. An upgrade to the Creators Update will honor any administrator or group policy configured settings. By disabling these technologies, we aim to increase native IPv6 traffic on the Internet. Teredo is the last transition technology that is expected to be in active use because of its ability to perform NAT traversal to enable peer-to-peer communication.
Improved 464XLAT support
464XLAT was originally designed for cellular only scenarios since mobile operators are some of the first ISPs with IPv6 only networks. However, some apps are not IP-agnostic and still require IPv4 support. Since a major use case for mobile is tethering, 464XLAT should provide IPv4 connectivity to tethered clients as well as to apps running on the mobile device itself. Creators Update adds support for 464XLAT on desktops and tablets too. We also enabled support for TCP Large Send Offload (LSO) over 464XLAT improving throughput and reducing CPU usage.
Multi-homing improvements
Devices with multiple network interfaces are becoming ubiquitous. The trend is especially prevalent on mobile devices, but, 3G and LTE connectivity is becoming common on laptops, hybrids and many other form factors. For the Creators Update we collaborated with the Windows Connection Manager (WCM) team to make the WiFi to cellular handover faster and to improve performance when a mobile device is docked with wired Ethernet connectivity and then undocked causing a failover to WiFi.
Dead Gateway Detection (DGD)
Windows has always had a DGD algorithm that automatically transitions connections over to another gateway when the current gateway is unreachable, but, that algorithm was designed for server scenarios. For the Creators update we improved the DGD algorithm to respond to client scenarios such as switching back and forth between WiFi to 3G or LTE connectivity. DGD signals WCM whenever transport timeouts suggest that the gateway has gone dead. WCM uses this data to decide when to migrate connections over to the cellular interface. DGD also periodically re-probes the network so that WCM can migrate connections back to WiFi. This behavior only occurs if the user has opted in for automatic failover to cellular.
Fast connection teardown
In Windows, TCP connections are preserved for about 20 seconds to allow for fast reconnection in the case of a temporary loss of wired or wireless connectivity. However, in the case of a true disconnection such as docking and undocking this is an unacceptably long delay. Using the Fast Connection Teardown feature WCM can signal the Windows transport layer to instantly tear down TCP connections for a fast transition.
Improved diagnostics using Test-NetConnection
Test-NetConnection (alias tnc) is a built-in cmdlet in powershell that performs a variety of network diagnostics. In Creators Update we have enhanced this cmdlet to provide detailed information about both route selection as well as source address selection.
The following command when run elevated will describe the steps to select a particular route per RFC 6724. This can be particularly useful in multi-homed systems or when there are multiple IP addresses on the system. Test-NetConnection -ComputerName "www.contoso.com" -ConstrainInterface 5 -DiagnoseRouting -InformationLevel "Detailed"
SDN Troubleshooting: UDP Communication failures and changing the Network Controller Certificate
With this blog post, I wanted to highlight a couple of issues that we have encountered recently with Software Defined Networking (SDN) customer deployments in Windows Server 2016.
Issue #1: UDP communication isn’t working when outbound NAT is configured
Customer had configured outbound NAT access for his virtual network through SCVMM (this internally uses SDN Software Load Balancer), so that machines in the virtual network could access the Internet. The customer noticed that TCP traffic to the Internet was working fine, but all User Datagram Protocol (UDP) traffic was getting dropped. Moreover, this only happened when the Software Load Balancer MUX was on a different HyperV host than the tenant VM.
On deeper analysis, it was revealed that the destination VM was rejecting the packet because the UDP checksum was incorrect. Further investigations revealed a physical NIC issue. The customer was using a physical NIC which was not certified for SDN with Windows Server 2016. The NIC was incorrectly marking the UdpChecksumFailed flag when the inner packet had a valid checksum.
If you are planning to use SDN with Windows Server 2016, please ensure that you use certified NICs. You can verify whether a network adapter is or is not certified by checking the Windows Server Catalog.
Click Software-Defined Data Center (SDDC) Premium to filter the Windows Server Catalog LAN card list.
Issue #2: Changing the Network Controller Server certificate
A customer wanted to change the Network Controller server certificate used for communication with the Northbound clients. He was using self-signed certificates and wanted to move to Certificate Authority based certificates. After installing the new certificate on all the Network Controller nodes, he used the Set-NetworkController Powershell command to point Network Controller to the new certificate.
Although the command succeeded, Network Controller communication with SCVMM stopped working.
This is due to a bug in the product where the certificate binding is only changed on one Network Controller node (where the command was run) and is not updated on the other nodes. We are planning to release a fix soon.
As a workaround, you need to manually change the certificate binding on the other Network Controller nodes. Process is as follows:
- Install the new certificate in Personal store of LocalMachine account
- Execute the Powershell command: Set-NetworkController -ServerCertificate <new cert>
- Retrieve the thumbprint of the new SSL certificate that you want to use with Network Controller
- Double click the certificate and click on Details, note the value of Thumbprint parameter. Remove any spaces in between
The following illustration depicts the Thumbprint property of the certificate.
On each Network Controller node, check the SSL binding by executing the following command from a command prompt:
netsh http show sslcert
If the result shows the thumbprint of the old certificate, change the binding by executing the following commands:
netsh http delete sslcert ipport=0.0.0.0:443
netsh http add sslcert certhash= <thumbprint of the new certificate> appid=<application ID> ipport=0.0.0.0:443 certstorename=MY
You can retrieve the appid from the output parameter Application ID of the netsh http show sslcert command.
If the binding shows the thumbprint of the new certificate, no further action is needed on that node.
Additional Information
Here are a few links to SDN topics to assist with your planning and deployment:
If you plan to assess your needs and environment for deploying SDN, see the topic Plan a Software Defined Network Infrastructure.
If you want to deploy SDN using System Center Virtual Machine Manager, see the topic Set up a Software Defined Network (SDN) infrastructure in the VMM fabric.
If you have any questions/feedback about SDN, send an email to sdn_feedback@microsoft.com.
Available to Windows 10 Insiders Today: Access to published container ports via “localhost”/127.0.0.1
Until now, a lingering limitation on Windows 10 has prevented access to published ports for containers via “localhost” or 127.0.0.1 (a.k.a. loopback). What this meant, was that if you had, say, a container running as an IIS Web server and exposing content through port 80 of your host machine, you wouldn’t be able to access that content locally, using the “http://localhost/” or even “http://127.0.0.1/”.
But at last, this limitation has been removed! Beginning with Build 17025, available today to Windows Insiders, it’s now possible on Windows 10 to access your containers via their locally published ports using “localhost” or 127.0.0.1.
Although simple, this is a little tedious to visualize, so let’s lay it out with an example. The image below shows a container running on its host, with content published to host port 8080. In the past, to access this content developers on Windows 10 have had to use either their host’s external IP address or the internal IP address of their container–so, in this example, “http://10.137.196.122:8080/” or “http://172.18.23.136:8080/”. Now, however, we have added the plumbing on Windows 10 enable “http://localhost:8080/” and “http://127.0.0.1:8080/” as additional ways to access your published containers locally.
Ready to try this out? This functionality is included in the latest Windows 10 Insider Preview Build 17025. If you’re already a Windows Insider running Build 17025, you have this capability now! If not, click here to learn more about the Windows Insider program and sign up to start receiving Windows 10 Preview Builds.
The Evolution of RDMA in Windows: now extended to Hyper-V Guests
This post written by Don Stanwyck, Senior Program Manager, Windows Core Networking
Remote DMA (RDMA) is an incredible technology that allows networked hosts to exchange information with virtually no CPU overhead and with extremely little latency in the end–system. Microsoft has been shipping support for RDMA in Windows Server since Windows Server 2012 and in Windows 10 (some SKUs) since its first release. With the release of Windows Server 1709 Windows Server supports RDMA in the guest. RDMA is presented over the SR-IOV path, i.e., with direct hardware access from the guest to the RDMA engine in the NIC hardware, and with essentially the same latency (and low CPU utilization) as seen in the host.
This week we published a how-to guide (https://gallery.technet.microsoft.com/RDMA-configuration-425bcdf2) on deploying RDMA on native hosts, on virtual NICs in the host partition (Converged NIC), and in Hyper-V guests. This guide in intended to help reduce the amount of time our customers spend trying to get their RDMA networks deployed and working.
As many of my readers are aware, in Windows 2012 we shipped the first version of RDMA on Windows. It supported only native interfaces, i.e., direct binding of the SMB protocol to the RDMA capabilities offered by the physical NIC. Today we refer to that mode of operation as Network Direct Kernel Provider Interface (NDKPI) Mode 1, or more simply, Native RDMA.
SMB-Direct (SMB over RDMA) was popular, but if a customer wanted RDMA on a Hyper-V host they had to set up separate NICs for RDMA and for Hyper-V. That got expensive.
.
With Windows Server 2016 came the solution: Converged NIC operation. Now a customer who wanted to use RDMA and Hyper-V at the same time could do so on the same NICs – and even have them in a team for bandwidth aggregation and failover protection. The ability to use a host vNIC for both host TCP traffic and RDMA traffic and share the physical NIC with Guest traffic is called NDKPI Mode 2.
New technologies were built on the Converged NIC. Storage Spaces Direct (S2D), for example, delivered the ability to user RDMA for low latency storage across all the hosts in a rack.
That wasn’t enough. Customers told us they wanted RDMA access from within VMs. They wanted the same low latency, low CPU utilization path that the host gets from using RDMA to be available from inside the guest. We heard them.
Windows Server 1709 supports RDMA in the guest. RDMA is presented over the SR-IOV path, i.e., with direct hardware access from the guest to the RDMA engine in the NIC hardware. (This is NDKPI Mode 3.) This means that the latency between a guest and the network is essentially the same as between a native host and the network. Today this is only available on Windows Server 1709 with guests that are also Windows Server 1709. Watch for support in other guests to be announced in upcoming releases.
This means that trusted applications in guests can now use any RDMA application, e.g., SMB Direct, S2D, or even 3rd party technologies that are written to our kernel RDMA interface, to communicate using RDMA to any other network entity.
Yes, there is that word “trusted” in the previous statement. What does that mean? It means that for today, just like with any other SR-IOV connected VM, the Hyper-V switch can’t apply ACLs, QoS policies, etc., so the VM may do some things that could cause some level of discomfort for other guests or even the host. For example, the VM may attempt to transmit a large quantity of data that would compete with the other traffic from the host (including TCP/IP traffic from non-SR-IOV guests).
So how can that be managed? There are two answers to that question, one present, and one future. In the present Windows allows the system administrator to affinitize VMs to specific physical NICs, so a concerned administrator could affinitize the VM with RDMA to a separate physical NIC from the other guests in the system (the Switch Embedded Team can support up to 8 physical NICs). In the future, at a time yet to be announced, Windows Server expects to provide bandwidth management (reservations and limits) of SR-IOV-connected VMs for both their RDMA and non-RDMA traffic, and enforcement of ACLs programmed by the host administrator and applying to SR-IOV traffic (IP-based and RDMA). Our hardware partners are busy implementing the new interfaces that support these capabilities.
What scenarios might want to use Guest RDMA today? There are several that come to mind, and they all share the following characteristics:
- They want low-latency access to network storage;
- They don’t want to waste CPU overhead on storage networking; and
- They are using SMB or one of the 3rd party solutions that runs on Windows Kernel RDMA.
So whether you are using SMB storage directly from the guest, or you are running an application that uses SMB (e.g., SQL) in a guest and want faster storage access, or you are using a 3rd party NVMe or other RDMA-based technology, you can use them with our Guest RDMA capability.
Finally, while High Performance Computing (HPC) applications rarely run in Guest OSs, some of our hardware partners are exposing the Network Direct Service Provider Interface (NDSPI), Microsoft’s user-space RDMA interface, in guests as well. So if your hardware vendor supports NDSPI (MPI), you can use that from a guest as well.
RDMA and DCB
RDMA is a great technology that uses very little CPU and has very low latency. Some RDMA technologies take a heavy reliance on Data Center Bridging (DCB). DCB has proven to be difficult for many customers to deploy successfully. As a result, the view of RDMA as a technology has been affected by the experiences customers have had with DCB – and that’s sad. The product teams at Microsoft are starting to say more clearly what we’ve said in quieter terms in the past:
Microsoft Recommendation: While the Microsoft RDMA interface is RDMA-technology agnostic, in our experience with customers and partners we find that RoCE/RoCEv2 installations are difficult to get configured correctly and are problematic at any scale above a single rack. If you intend to deploy RoCE/RoCEv2, you should a) have a small scale (single rack) installation, and b) have an expert network administrator who is intimately familiar with Data Center Bridging (DCB), especially the Enhanced Transmission Service (ETS) and Priority Flow Control (PFC) components of DCB. If you are deploying in any other context iWarp is the safer alternative. iWarp does not require any configuration of DCB on network hosts or network switches and can operate over the same distances as any other TCP connection. RoCE, even when enhanced with Explicit Congestion Notification (ECN) detection, requires network configuration to configure DCB/ETS/PFC and/or ECN especially if the scale of deployment exceeds a single rack. Tuning of these settings, i.e., the settings required to make DCB and/or ECN work, is an art not mastered by every network engineer.
RoCE vendors have been very actively working to reduce the complexity associated with RoCE deployments. See the list of resources (below) for more information about vendor specific solutions. Check with your NIC vendor for their recommended tools and deployment guidance.
Additional resources:
- Jose Barreto’s 100Gb/s RDMA demo
- Claus Jorgensen’s “S2D on Cavium 41000” Blog (iWarp – RoCE comparison)
- Microsoft’s sample switch DCB configurations for RoCE
- Mellanox’s RDMA/RoCE Community page
- Your vendor’s User Guides and Release Notes for your specific network adapter
Windows Container and Virtual Network Deep Dive Mini-Blog Series coming…
Just wanted to give you a quick heads up that we are going to begin a Mini-Blog Series this Monday (12/11). We will be covering about Container networking, and Windows network virtualization (WNV). While there are some differences between Container networking and other technologies that use WNV, like Hyper-V and LBFO (NIC teaming), the underlying method employed to move network data inside of Windows is the same. This led to an idea; make this a multipurpose series of documentation.
We’ve finalized the blogs, check the list as of this post. I know you will enjoy reading:
- WNV Deep Dive Part 1 – Introduction to Containers and Windows Network Virtualization
- WNV Deep Dive Part 2 – How WNV works
- WNV Deep Dive Part 3 – Capturing and Reading Virtualized Network Traffic
- WNV Deep Dive Part 4 – Looking at LBFO and Hyper-V traffic
- WNV Deep Dive Part 5 – Container Networking: The default NAT network
- WNV Deep Dive Part 6 – Container Networking: Transparent and L2bridge Networks
WNV Deep Dive Part 1 – Introduction to Containers and Windows Network Virtualization
By James Kehr, Networking Support Escalation Engineer
When I started writing this article it was going to be about Container networking, and nothing but Container networking. As the article progresses I realized there was a lot of useful information that applies to all of Windows network virtualization (WNV). Those who program shouldn’t be surprised. There’s no need to reinvent code when there’s already perfectly good code that can do it.
This article goes deeper into how network virtualization works than many articles on the subject. Those who are idly curious will likely fall asleep somewhere in the middle of this introductory piece. This series of articles is designed more for Windows and network admins who use, or are planning to use, Windows network virtualization technologies and want a better understanding of how it works, how to capture and read the virtualization network data, and how to apply the knowledge to make life easier.
Don’t be afraid of big, bad PowerShell.
There are two new technologies in Windows Server 2016 that are a huge departure for Windows administrators: Containers and Nano server. Why? Because there is no UI. But wait, some may say, there’s no UI in Server Core! Yes, but you are only partly correct. There is some UI in Core, just not a lot. There is zero UI in Nano Server and Containers. As in, not even a dialog box, pop-up, manager, or menu in sight.
While Linux and Unix admins will likely yawn at the lack of UI, a lot of Windows administrators are less than thrilled. We like our menus and our managers, thank you very much. People paying attention may have already noticed this shift to a more command line driven Windows than in the past; or more PowerShell driven, to be accurate.
Working in Containers, Core, and Nano server is going to require a paradigm shift for Windows admins. There’s no easy way to say this: it’s time to embrace the command line…er… PowerShell.
What is a Container?
There are a lot of articles out there explaining what a Container is. The link below has the Windows Containers documentation, for those who are curious about the official stuff.
https://docs.microsoft.com/en-us/virtualization/windowscontainers/index
The word I use to describe Containers is pseudo-virtualization. A virtual machine is an operating system running on synthetic, or virtualized, hardware. It has its own file system, its own resource pool, and synthetic hardware. The guest knows it’s a virtual machine, but acts like its physical hardware separate from the physical host.
Containers are kind of like that… except Containers think they are the host. Logon to a Container via PowerShell remoting, pull up the hardware information, and you will see a virtual copy of the host system. The biggest difference deals with the OS kernel. A virtual machine has a completely independent kernel from the host operating system. A Container shares the host’s kernel. The applications under the Container think they are running on the host, and act like they are running on the host, and they technically are, but container applications are isolated from the host.
Think of Containers as living in a world between application virtualization and virtual machines. The purpose of a Container is to make it easy to develop and deploy applications, without truly virtualizing them in the traditional App-V or Hyper-V style.
The exception to this rule is the Hyper-V Container. This Container style lives on a highly optimized virtual machine, and shares the kernel with the VM instead of the host. This adds more isolation between application and host, providing more security, while not compromising the ease of deployment and development that Containers offer.
Making your head hurt
All this lengthy explanation serves a purpose. I’m warming up your brain before I make it hurt.
The primary goal of my investigation into Containers was to see if there was anything new with Container networking. So, I built a Container, remoted in, and tried a network trace. Except the network trace failed. Turns out network capture tools don’t work inside a Container. Not even the tools built into Windows.
Then I took a packet capture on the host to see where the Container traffic interfaced with the host and… I saw the Container traffic. Not the hand off, not the interface, the actual Container traffic including the network stack events.
Remember that part where I said the Container and the host share a kernel? Part of that kernel sharing means they share the same network stack (kind of), the same hardware (sort of), and the same firewall (for real).
Take a minute to think about that.
To see the network traffic in a Container you run the packet capture on the host.
Does your head hurt yet? Not yet? Let me help you out.
Not only do you capture the network data on the host but you can’t use traditional tools like Wireshark or Network Monitor. Those will only show the traffic on the host NIC, and the NAT NIC, but nothing beyond. In order to capture Container network traffic you need to use an NDIS-based packet capture like Message Analyzer or NetEventPacketCapture in PowerShell.
From there you need to understand how to process, read, and follow the traffic through Windows and to the Container network stack – which is actually the host network stack, kind of.
And that should complete the brain bleed.
Part 2 covers how network virtualization works inside of Windows. This is possibly the most critical part of the series, as the data will likely not make sense without understanding part 2.
-James
WNV Deep Dive Part 2 – How WNV works
By James Kehr, Networking Support Escalation Engineer
To understand Windows Network Virtualization (WNV) capture data you first need to understand what you’re looking at. Which is hard when you may not understand what goes on inside of Windows. This article will cover the basics of what goes on when WNV is in use, in a manner that will, hopefully, be easy to understand.
ETW
Starting with Windows 7 and Windows Server 2008 R2, an ETW (Event Tracing for Windows) provider was added that allows the network frames, a.k.a. packets, to be captured within Windows without installing a packet capture tool like Message Analyzer, Wireshark, or Network Monitor. This provider is called Microsoft-Windows-NDIS-PacketCapture. The ETWs for this provider can be written to an ETL file (Event Trace Log) and analyzed using a tool that can parse ETL files. Think of ETW logging as a low-level Event Viewer on steroids.
There are several ETW providers that are used for troubleshooting Windows networking. Some ETWs can only be viewed internally at Microsoft, while others can be viewed by the public. All the basic networking ETWs can be publicly parsed and read. You just need to learn how.
What is an NBL?
An NBL is a Net Buffer List. Think of an NBL as a packet inside of Windows. Because that’s what it is. NBLs are used for every virtualized network data transaction inside of Windows. Every. Single. One.
Container networking uses NBLs to route and deliver data from the wire to a Container. And from the Container back to the wire. Hyper-V and LBFO (NIC teaming) use NBLs to move data around inside of Windows, too.
These NBL “packets” virtually move across virtual switches and virtual NICs inside of Windows like an Ethernet frame/packet moves around a physical network. Except, instead of moving the actual NBL it’s the reference to the NBL that is passed along the virtual wire.
Sending data between two computers on a switch involves a packet being sent by a NIC on Server A to a switch, then to Server B, where the NIC passes the packet into the operating system. The exact same concept applies to WNV. A packet sent from VM A is passed to the vmSwitch (virtual machine switch) via an NBL. The NBL is passed across the vmSwitch and delivered to the vmNIC on VM B, where the packet is received by the local network stack.
Those who want to see the network data move around inside of Windows just needs to look at the NBLs. This is handy when troubleshooting problems with virtualized networking. It allows admins to see where potential problems are and, unlike packets, look for errors returned from within Windows by the ETW providers used by virtual networking.
Exceptions time! SR-IOV bypasses the entire Hyper-V network virtualization stack, or the parent partition for those who love technical terms. There are a few other technologies, like RDMA and Fibre Channel, which also bypass the Windows and/or Hyper-V network stacks. You will not see packets or NBLs in Windows when network stack offloading or bypass technologies are in use.
Compartments
There is a special type of virtual networking I have not mentioned yet: Hyper-V Network Virtualization (HNV), also known as Software Defined Networking (SDN). HNV/SDN adds something called a Compartment to virtual networking. The best way to think of a compartment is a VLAN, but, again, inside of Windows.
Compartments are the magic that allow HNV to have two sets of VMs, from two different companies, occupy the same Hyper-V host, yet use the same subnet. When HNV is in use Contoso can use the 10.0.0.0/8 subnet, Fabrikam can use 10.0.0.0/8, and the subnets will never overlap, collide, or cause routing issues. When configured properly.
I won’t go into too much detail about how all this works, HNV and SDN networking is a topic all its own, but it’s important to know because Containers on Windows can use Compartments, too. Depending on the configuration. Each Docker, or Container, network that is created gets a Compartment ID. The Compartment ID is what Windows uses to securely get the correct network data to the correct resource.
The compartment details are exposed on the host by using one of two commands. The trusty old “ipconfig” command can be used with a newer parameter, /allcompartments. Get-NetIpAddress with the IncludeAllCompartments parameter can be used on the PowerShell front.
ipconfig /allcompartments
=============================================================
Network Information for Compartment 2
=============================================================
Ethernet adapter vEthernet (Container NIC c337484d):
Connection-specific DNS Suffix . : Contoso.com
Link-local IPv6 Address . . . . . : fe80::297c:2597:40d2:59ab%16
IPv4 Address. . . . . . . . . . . : 172.23.181.37
Subnet Mask . . . . . . . . . . . : 255.255.240.0
Default Gateway . . . . . . . . . : 172.23.176.1
Get-NetIPAddress -IncludeAllCompartments | where InterfaceAlias -match “container”
IPAddress : 172.23.181.37
InterfaceIndex : 16
InterfaceAlias : vEthernet (Container NIC c337484d)
AddressFamily : IPv4
Type : Unicast
PrefixLength : 20
PrefixOrigin : Manual
SuffixOrigin : Manual
AddressState : Preferred
ValidLifetime : Infinite ([TimeSpan]::MaxValue)
PreferredLifetime : Infinite ([TimeSpan]::MaxValue)
SkipAsSource : False
PolicyStore : ActiveStore
A Summary
So far, we’ve learned that packets on the wire are converted to NBLs inside of Windows. The NBLs are then passed between virtual components inside of Windows, behaving like a packet on the wire. The NBLs are then consumed by the endpoint network stack; such as, the guest VM, host, or Container (also the host, kind of). Compartments add isolation, security, and, potentially, the ability for multiple virtual networks to exist within the same host system.
ETW providers can be used to capture packets and events inside of Windows. These ETW events are stored in an ETL (Event Tracing Log) file that can be used to analyze network virtualization within Windows.
The next part covers capturing and reading WNV data, which includes Container networking. This is also handy if you want to learn how to read Hyper-V and LBFO networking traffic. That makes part 3 a triple win!
-James
WNV Deep Dive Part 3 – Capturing and Reading Virtualized Network Traffic
By James Kehr, Networking Support Escalation Engineer
There are three primary tools used to capture virtual network traffic in Windows: netsh trace, the PowerShell NetEventPacketCapture module, and Message Analyzer. I won’t focus much on Message Analyzer captures here. Most server admins don’t like installing tools, so I will focus on the built-in tools for capturing. MA will be discussed as an analysis tool, though it is capable of capturing NBLs and ETW events. There are a number of online tutorials for MA should you prefer a graphical capture tool.
netsh trace
“netsh trace” was added to Windows in the 7/2008 R2 generation. This is a single command start, single command stop option. It’s great for simple captures and scenario based captures, but becomes cumbersome with complex captures. It is also being deprecated in favor of the PowerShell NetEventPacketCapture module in new versions of Windows. There is a rare chance that the packets will not be captured with a netsh trace in Windows Server 2012 R2+/8.1+. When this happens, you must you an alternate method.
A netsh trace scenario is a pre-packaged set of ETWs. The most common scenario used by Microsoft support is the NetConnection scenario. It is the jack-of-all-trades scenario using 45 ETW providers and covering everything from the network stack to the various network subsystems; such as, wireless, wired, WWAN, 802.1x authentication, firewall, and much more. A sample NetConnection command looks like this:
netsh trace start capture=yes scenario=NetConnection
There are some problems with netsh trace, outside of those previously discussed. Scenarios do not work on Server Core. Each provider must be manually created and added to the netsh trace command. This makes the command large and unwieldy to adjust or troubleshoot should something go wrong. Nano Server doesn’t have netsh trace at all. And then there is stop time.
Every time netsh trace is stopped with “netsh trace stop” it will generate a report and compress the data using the CAB format. This is time consuming (several minutes) and can hammer system resources.
I’m not saying netsh trace is bad, because it’s not. Netsh trace is the bread and butter of Microsoft networking support. There are just some caveats to be aware of when using it. I could write an entire article on the intricacies of capturing network data on Windows…some other day perhaps.
PowerShell NetEventPacketCapture
PowerShell is a bit more complex to learn, but is more flexible, stops immediately, and can be better integrated into scripts. The PowerShell method can be used on Server Core and is the only packet capture tool supported on Nano Server, one of the primary Windows Container operating systems. This option is only available on Windows Server 2012 R2+/8.1+. Installing PowerShell 4 or 5 on older version of Windows will not add the NetEventPacketCapture module, as that is an OS specific module, not a PowerShell specific module.
The NetEventPacketCapture (NEPC) module is built to do packet captures “The PowerShell Way”. Rather than grouping up everything into one large command, like netsh trace, it’s spread out and can be controlled by variables. There are no scenarios, nor is there an automatically generated report, like netsh trace has. NEPC is built for creating a script file that collects what you need. The script can then be stored in a repository or share for use across your environment, rather than memorizing a command.
Having used both for several years in a support capacity I can honestly say both have their benefits, but I prefer “The PowerShell Way” when I am troubleshooting a version of Windows with NEPC. It is, in my opinion, easier to use a script, six basic steps to execute the script, and know I will get all the data I need every time, rather than send a list of commands and more complex instructions. Even then, I will still use netsh trace frequently because it can be so simple to use.
Capturing data
The steps below cover the very basics of collecting data for Windows Network Virtualization (WNV), using both netsh trace and PowerShell NEPC. Both tools are far more flexible and complex, when needed, than what I will show here.
Basic instructions for capturing virtual network data with netsh trace
- Open an elevated Command Prompt (Run as administrator) console.
- Run this command to start the trace.
netsh trace start capture=yes overwrite=yes maxsize=1024 tracefile=c:\%computername%_vNetTrace.etl provider=”Microsoft-Windows-Hyper-V-VmSwitch” keywords=0xffffffffffffffff level=0xff capturetype=both - Reproduce the issue, or perform the operation you wish to investigate.
- Stop the trace with this command.
netsh trace stop - The ETL file and CAB report be stored on the root of C:.
Basic instructions for capturing virtual network data with PowerShell
Please note that this code contains my own special flare. The most basic capture code can be much shorter than this. While the code can be much larger when applying more stringent coding practices.
-
- Open an elevated PowerShell (Run as administrator) console.
- Execute these commands to start tracing.
# Basic Windows Virtual Network host capture # the primary WNV ETW provider. [array]$providerList = 'Microsoft-Windows-Hyper-V-VmSwitch' # create the capture session New-NetEventSession -Name WNV_Trace -LocalFilePath "C:\$env:computername`_vNetTrace.etl" -MaxFileSize 1024 # add the packet capture provider Add-NetEventPacketCaptureProvider -SessionName WNV_Trace -TruncationLength 1500 # add providers to the trace foreach ($provider in $providerList) { Write-Host "Adding provider $provider" try { Add-NetEventProvider -SessionName WNV_Trace -Name $provider -Level $([byte]0x5) -EA Stop } catch { Write-Host "Could not add provider $provider" } } # start the trace Start-NetEventSession WNV_Trace
-
- Reproduce the issue, or perform the operation you wish to investigate.
- Stop the trace with these commands.
# stop the trace Stop-NetEventSession WNV_Trace # remove the session Remove-NetEventSession WNV_Trace
Cautionary Side Note:
ETL files are a bit finicky to work with, compared to traditional packet capture file types. Here are some caveats and information to consider when working with ETL files. Yes, there are a lot.
- ETLs can only be collected by a member of the Administrators group.
- Command Prompt, PowerShell, and Message Analyzer must be “Run as administrator” to collect data.
- There can only be one ETL packet capture at a time, per instance of Windows, regardless of tool used.
- You can collect an NDIS packet capture on a Hyper-V host and any VM guest at the same time, as those are separate instances of Windows.
- The capture must be stopped in the same user context that started the capture. If Bob starts the capture, Fred can’t stop it.
- Writing ETLs requires faster storage as the network speed increases.
- The buffering levels are not as large as other tools, and thus ETL files are prone to losing data when written to slow storage.
- This is not an ETL only issue, but it can be more prominent with ETL collections.
- Writing to an SSD or enterprise HDD [array] is needed for multiple-Gb packet captures.
- The only sure-fire way to accurately capture on a saturated 10Gb+ connection is by using a RAM disk or NVMe-grade solid state storage. No exaggeration.
- Never capture to a mapped network drive or other network storage. Though that’s always good practice regardless of packet capture tool.
- Packets are collected from all network interfaces, by default.
- This will cause packet duplication on systems using virtual networking, as each packet is collected on each physical and virtual NIC, and sometimes the vmSwitch, as it moves through Windows. Capture on a single interface to see only a single set of packets.
- netsh trace: CaptureInterface=<interface name or GUID> Enables packet capture for the specified interface name or GUID. Use ‘netsh trace show interfaces’ to list available interfaces. Must be placed after capture=yes.
- PowerShell: https://technet.microsoft.com/en-us/library/dn283343.aspx
- Message Analyzer: https://technet.microsoft.com/en-us/library/dn386834.aspx
- This will cause packet duplication on systems using virtual networking, as each packet is collected on each physical and virtual NIC, and sometimes the vmSwitch, as it moves through Windows. Capture on a single interface to see only a single set of packets.
- ETLs can only be parsed by Microsoft tools: Message Analyzer, Network Monitor (limited support), and parsed to text file by “netsh trace convert”.
- Wireshark and other third-party tools will not open an ETL (as of this writing).
- A reasonably fast computer can parse an ETL file up to ~2GB in size.
- This seems like a lot of data, but it takes a saturated 10Gb network connection about 2-3 seconds to fill a 2GB file. ~20-30 seconds on a single, full 1Gb connection.
- Files larger than 2GB can be parsed with Message Analyzer and “netsh trace convert”, but parsing and analyzing files this size can be extremely slow. Be patient.
- Truncate (snaplen in tcpdump terms) packets to fit more packets into a smaller file. This works if you don’t care about the payload, and just need packet headers.
- ETL tracing has limited filtering capabilities. Use Message Analyzer if more complex filtering is needed.
- netsh trace show CaptureFilterHelp
- https://technet.microsoft.com/en-us/library/dn268510.aspx
- Message Analyzer can usually export/convert an ETL to a CAP file that Wireshark can read.
- Open and apply a filter to the trace first.
- Any non-packet ETLs in the output will prevent Wireshark from opening the file.
- Use “Ethernet” as the filter if you want all the packets and nothing else.
- Save [As]
- Select the “Filtered Messages for Analysis Grid view” option
- Export
- Pick the location and filename
- Save
- Open and apply a filter to the trace first.
Looking at a Container trace
Here’s an example captured from a Contianer host, with Docker networking set in Transparent mode. This is how the TCP SYN, the first frame in all TCP connections, passes through the Windows host to the Container. Message Analyzer 1.4 was used to process the ETL file.
Notes:
- The traffic below is slightly modified. The vmNIC and vmSwitch GUIDs are included in the actual output, but these make it difficult to read in an article format.
- You need Hyper-V installed to see the vmSwitch events. The ETW parsing details are installed with the feature or role. Windows 10 or Server 2016 is highly recommended for the best parsing experience.
Flags: ……S., SrcPort: 52891, DstPort: HTTP(80), Length: 0, Seq Range: 1717747265 – 1717747266, Ack: 0, Win: 8192(negotiating scale factor: 8)
–>The first line is the TCP SYN as it arrives on the physical network adapter of the Container host. The destination port is 80, which is the HTTP port.
NBL received from Nic /DEVICE/ (Friendly Name: Microsoft Hyper-V Network Adapter) in switch (Friendly Name: Layered Ethernet)
–-> The second line shows the Microsoft_Windows_Hyper_V_VmSwitch module. This is how Message Analyzer parses the Microsoft-Windows-Hyper-V-VmSwitch ETW provider. This provider is what captures the NBL reference as it passes through the Hyper-V vmSwitch.
–> The NBL event text shows that the NBL was received by the Microsoft Hyper-V Network Adapter, which is the host’s management network adapter in this case. The network adapter names are sometimes the same in parser output. When this happens use the outputs of Get-VmNetworkAdapter and Get-NetAdapter on the host, in PowerShell, to see which NIC is used by comparing the vmNIC’s GUID of the PowerShell output to the adapter GUID in the trace.
Flags: ……S., SrcPort: 52891, DstPort: HTTP(80), Length: 0, Seq Range: 1717747265 – 1717747266, Ack: 0, Win: 8192(negotiating scale factor: 8)
–-> This in the packet on the vmSwitch. The packets will not always show up on the vmSwitch in a trace. It depends on the version of Windows and technology used. The NBLs may be the only thing you see, so don’t worry if your results don’t directly match this.
NBL routed from Nic /DEVICE/ (Friendly Name: Microsoft Hyper-V Network Adapter) to Nic (Friendly Name: Container NIC fb5c285c) on switch (Friendly Name: Layered Ethernet)
–> The next NBL shows the packet being routed across the vmSwitch to the Container vNIC.
Flags: ……S., SrcPort: 52891, DstPort: HTTP(80), Length: 0, Seq Range: 1717747265 – 1717747266, Ack: 0, Win: 8192(negotiating scale factor: 8)
–> The packet captured on the vmSwitch again.
NBL delivered to Nic (Friendly Name: Container NIC fb5c285c) in switch (Friendly Name: Layered Ethernet)
–-> The NBL is successfully delivered to Container NIC.
Flags: ……S., SrcPort: 52891, DstPort: HTTP(80), Length: 0, Seq Range: 1717747265 – 1717747266, Ack: 0, Win: 8192(negotiating scale factor: 8)
–-> This is where the packet finally arrives at the Container.
The interesting bit here is that the TCP connection shows up after the packet arrives on the Container. Which shouldn’t seem odd. Except that the capture was taken on the host. Maybe it’s just me.
Next up is a smashing bit about LBFO and Hyper-V traffic. It’s filled with fabulous pictures, hastily drawn in Paint, to provide a little visual context to the flow of data. This section also covers the basics of troubleshooting WNV traffic.
-James
WNV Deep Dive Part 4 – Looking at LBFO and Hyper-V traffic
By James Kehr, Networking Support Escalation Engineer
We’re going to look at the two other basic types of WNV traffic in part 4: LBFO (NIC teaming) and Hyper-V. I’ll be skipping over Hyper-V Network Virtualization and Software Defined Networks. The new Switch Embedded Teaming technology will also not be covered. Those are topics for a different day.
NBLs are still in use as “packets inside of Windows”, but there are a couple of small variations in capturing and looking at the traffic.
Network LBFO
Network LBFO (Load Balancing and Fail Over), more commonly known as NIC teaming, is currently the only network virtualization technique in Windows that does not use a vmSwitch. The basic traffic flow for LBFO looks something like this.
There are some notes I want to point out about LBFO. These are some common answers, misconceptions, and issues I see people running into when working with LBFO.
- You can create an LBFO team with a single physical NIC. There will be no load balancing, nor fail over with a “single NIC LBFO adapter,” but you can do it.
- Why would you ever use a single NIC LBFO adapter? Testing and VLANing without installing additional software. More on VLANing coming up.
- There are two parts to an LBFO team setup: switch type and load balancing mode. This is a topic unto itself. I will refer you to a couple of official guides if you want to learn more.
- You can create multiple sub-interfaces, one for each VLAN, off a single tNIC. This is similar to VLAN tagging in the Linux world.
- The primary tNIC must always use the trunk or default VLAN. Use sub-interfaces for other VLANs.
- The actual VLAN header is always added to the packet by the physical NIC in Windows.
- The VLAN information is passed to the pNIC driver via the NBL OOB (Out Of Band) blob.
- Never attach a VLAN to a pNIC or tNIC, including LBFO sub-interfaces, when it is attached to a Hyper-V vmSwitch.
- Use multiple host management adapters (vmNICs) if the host system needs to use VLANs, and the tNIC is attached to a vmSwitch.
- LBFO is only available in Windows Server SKUs. The LBFO PowerShell cmdlets would not throw errors in older versions of Windows 10, but LBFO never actually worked. There are a couple of third-party NIC OEMs who make and support teaming software that works in Windows client SKUs. There is no Microsoft supported Windows client NIC teaming.
LBFO does not use a vmSwitch to pass NBLs within Windows. The LBFO system driver handles the movement of LBFO NBLs within Windows, and these events are currently not publicly viewable. So how do you know whether the traffic works?
Packet capture tools can collect data from all network adapters on the system. In the case of LBFO that means the packet can be captured at the physical NIC (pNIC) and then at the LBFO teamed adapter (tNIC). Seeing packets at the pNIC shows that the data arrived from the wire, and was not discarded somewhere on the network. The packet arriving at the tNIC shows that the NBL made it to the LBFO virtual adapter.
This what a psping on TCP port 445 looks like with LBFO configured.
The TCP ports and IP addresses are the same, showing it’s the same TCP/IP connection, but there are two of every packet. Two SYN, two ACK-SYN, two ACKs, ACK-FINs, RST, and so on. This is not a glitch. The data was just captured on both pNIC and tNIC. This can throw off a packet analysis if you are not expecting it. Most packet analysis tools make this appear like there is a transmission problem, massive packet retransmission, when there is actually none.
Hyper-V Traffic
Hyper-V traffic is a lot like Container traffic. Or, more accurately, Container traffic is a lot like Hyper-V traffic, since Hyper-V came first. Hyper-V traffic is just NBLs being passed around between NICs, both physical and virtual, by vmSwitches.
When LBFO is involved you can add the LBFO adapter between the physical NIC and the vmSwitch. Unless, of course, you are using Windows Server 2016 and use Switch Embedded Teaming (SET). But that’s a confusion for a different blog.
The immediate security concern is that when capturing data on the Hyper-V host you can see all the traffic going to every VM. There may also be duplication in the trace when LBFO is used, and when packets are picked up on the vmSwitch. Duplication can be removed if you capture on a single adapter on Server 2012 R2. Windows 10 and Server 2016 seem to have a mechanism whereby duplication is sometimes not captured using a basic capture. I haven’t quite worked out how the anti-duplication bits are triggered, I just know that sometimes there is packet duplication and sometimes there isn’t.
The following is an example of how the traffic looks on Windows 10 Anniversary Edition (1607) with Hyper-V installed. Bing.com was pinged from the host system, while ARIN.net was pinged from a guest VM. The PowerShell code from part 2 was used to capture the data.
Here is what the Bing.com ping looks like from the host, filtered to keep the size down to blog friendly size.
Source | Destination | Module | Summary |
Contoso01 | Bing.com | ICMP | Echo Request |
Microsoft_Windows_Hyper_V_VmSwitch | NBL received from Nic (Friendly Name: wireless) in switch (Friendly Name: wireless) | ||
Contoso01 | Bing.com | ICMP | Echo Request |
Microsoft_Windows_Hyper_V_VmSwitch | NBL routed from Nic (Friendly Name: wireless) to Nic /DEVICE/(Friendly Name: Microsoft Network Adapter Multiplexor Driver) on switch (Friendly Name: wireless) | ||
Contoso01 | Bing.com | ICMP | Echo Request |
Microsoft_Windows_Hyper_V_VmSwitch | NBL delivered to Nic /DEVICE/ (Friendly Name: Microsoft Network Adapter Multiplexor Driver) in switch (Friendly Name: wireless) | ||
Contoso01 | Bing.com | ICMP | Echo Request |
The final ping (ICMP Echo Request) is the packet as it is sent to the physical NIC before it hits the wire.
This data is different than what you’d see on a Windows Server or via a wired connection, because my Windows 10 host was connected via a wireless network at the time of trace. This adds an additional hop for the data to traverse versus a traditional wired LAN connection.
Long story short, the Multiplexor device is a bridge that links the host vmNIC to the physical wireless NIC. This is a special step needed for Hyper-V to work with wireless networks that would not be present with a wired NIC. Though the data flow is similar to the traffic when a vmSwitch is attached to an LBFO NIC – LBFO NICs are also called Multiplexor adapters, but there is no LBFO native to Windows 10.
The ping to ARIN.net is takes a shorter path to the physical network than the host.
Source | Destination | Module | Summary |
Contoso-Gst | ARIN.net | ICMP | Echo Request |
Microsoft_Windows_Hyper_V_VmSwitch | NBL routed from Nic (Friendly Name: Network Adapter) to Nic /DEVICE/ (Friendly Name: Microsoft Network Adapter Multiplexor Driver) on switch (Friendly Name: wireless) | ||
Contoso-Gst | ARIN.net | ICMP | Echo Request |
Microsoft_Windows_Hyper_V_VmSwitch | NBL delivered to Nic /DEVICE/ (Friendly Name: Microsoft Network Adapter Multiplexor Driver) in switch (Friendly Name: wireless) | ||
Contoso-Gst | ARIN.net | ICMP | Echo Request |
The shorter path is because the guest operating system doesn’t have to traverse a network bridge before hitting the physical network. Instead it goes straight to the multiplexor adapter attached to the vmSwitch and out to the physical network.
In case you are curious about how long all this intra-OS routing takes, it took 0.0355 ms (milliseconds), or 35.5 µs (microseconds), for the ICMP frame to go from the guest vmNIC to the physical adapter. The Bing ping took slightly longer at 0.054 ms, or 54 µs. The network latency through Windows Hyper-V will vary, depending on hardware speed and system resource constraints.
Under 50 µs, each way, is common for guest to physical NIC traversal. If an application absolutely, positively needs every microsecond shaved off the network latency, look at SR-IOV and SET with RDMA as lower latency Hyper-V networking technologies.
Common things that go wrong
Most of the time you won’t experience any issues once Windows gets the packet. The vmNICs and vmSwitches in Windows are, by default, simple virtual devices. Think of the vmSwitches as simple L2 switches. vmNICs are basic synthetic NICs with standard feature sets. There are a few things that can cause headaches though. The three most common causes will be discussed.
VLANs
When using VLANs with Windows Network Virtualization it is best practice to leave the physical switchport, physical NIC, and Hyper-V attached LBFO adapters in trunk mode. VLANs should be attached to vmNICs only. Whether that’s a host management vmNIC, or a vmNIC attached to a guest VM.
Use multiple vmNICs when multiple VLANs are needed. One vmNIC per VLAN. The VLAN can be set via the VM settings for guest vmNICs, and via PowerShell’s Set-VMNetworkAdapterVlan cmdlet.
https://technet.microsoft.com/en-us/itpro/powershell/windows/hyper-v/set-vmnetworkadaptervlan
Secondary LBFO adapters can be added to provide VLAN support, but *never* add secondary LBFO adapters to an LBFO adapter attached to a Hyper-V vmSwitch. This will cause all kinds of routing problems in Hyper-V. Use additional host management adapters (vmNICs) if multiple host VLANs are needed.
VLAN errors will appear with the following message in Message Analyzer (filtered for readability).
NBL destined to Nic (Friendly Name: wireless) was dropped in switch (Friendly Name: wireless), Reason Packet does not match the packet filter set on the destination NIC.
The default native VLAN ID is 0. Windows will assume that all untagged traffic belongs to that VLAN.
The default VLAN state for a vmNIC is untagged. Meaning any tagged traffic will automatically be dropped by the vmSwitch.
To change the native VLAN the vmNIC must be in trunk mode, a list of allowed VLANs should be set, and the new native VLAN ID needs to be set. Changing the native VLAN will cause Hyper-V to assume that all untagged traffic belongs to that VLAN, and VLAN 0 traffic must now be tagged.
See Example 2: https://docs.microsoft.com/en-us/powershell/module/hyper-v/set-vmnetworkadaptervlan?view=win10-ps#examples
ACLs
While there isn’t a lot of configuration on a vmSwitch, there are some ACL capabilities. ACLs and Extended ACLs are set via the VM Network Adapter PowerShell cmdlet, but… the ACL is actually applied on the switchport of the vmSwitch.
These are regular old ACLs, processed in an unordered fashion. This means Hyper-V ACLs don’t have numbered priorities like many other ACL systems do. Instead it picks the winning rule based on other criteria.
https://technet.microsoft.com/en-us/library/jj679878.aspx#bkmk_portacls
- ACLs are evaluated on the longest prefix that is matched
- Specific matches, such as MAC address or subnet, win over a wildcard (ANY) match
- MAC beats Subnet
- An IP address is just a subnet with a /32 prefix
- The appropriate action, Allow or Deny, is applied based on the winning match
- Port ACLs are bi-directional, meaning rules must be created for both inbound and outbound traffic
ACLs are the second most common cause of issues with WNV. Usually because not everyone knows about them. One admin applies an ACL, then a different admin wonders why they can’t reach a VM and, because they aren’t aware of ACLs in Hyper-V, they don’t know to look for them.
The best way to look for ACLs is via System Center Virtual Machine Manager, if that is used to manage Hyper-V or HNV/SDN, or with our good friend, PowerShell. The first link below is an article on how to manage port ACLs with SCVMM 2016. The other links point to the help pages for the PowerShell get port ACL cmdlets.
https://technet.microsoft.com/en-us/system-center-docs/vmm/manage/manage-compute-extended-port-acls
https://technet.microsoft.com/en-us/itpro/powershell/windows/hyper-v/get-vmnetworkadapteracl
https://technet.microsoft.com/en-us/itpro/powershell/windows/hyper-v/get-vmnetworkadapterextendedacl
From an ETW trace perspective, using the code from Part 2 of this article series, the following error is displayed when traffic is blocked because of a port ACL.
NBL destined to Nic (Friendly Name: Network Adapter) was dropped in switch (Friendly Name: docked), Reason Failed Security Policy
Anti-virus
Ah, anti-virus. The applications everyone both loves and hates. The security teams love the peace of mind, but the server admins hate the resource usage. The official Microsoft recommendation is to not run anti-virus on Hyper-V hosts, run it on the VM guests instead. Problem solved, right? Not so fast, says the security team! All systems must run anti-virus.
This is why Microsoft posted a fancy article detailing the recommended anti-virus configuration for Hyper-V hosts. If you must run anti-virus on a Hyper-V host, please make sure the appropriate exceptions are in place.
https://support.microsoft.com/en-us/help/3105657/recommended-antivirus-exclusions-for-hyper-v-hosts
Part 5 of this series gets started on Container networks types. The default NAT network type gets to go first. I’ll discuss why using NAT can be fun for testing but should not be used on a production system.
-James
WNV Deep Dive Part 5 – Container Networking: The default NAT network
By James Kehr, Networking Support Escalation Engineer
There are, as of this writing, five Container network types in Windows: NAT, Transparent, L2bridge, Overlay, and L2tunnel.
This part of the article series will cover the NAT network type. Part 6, the conclusion, will cover Transparent and L2bridge, plus Hyper-V isolation. Overlay and L2tunnel will not be discussed. Overlay because Docker needs to be in swarm mode for that to work, and I’m not Docker-savvy enough to setup swarm mode. L2Bridge will not be discussed because that is exclusive to the Microsoft Cloud Stack, so I’ll let the Cloud Stack folks write about that.
Which brings us back the Container NAT network.
NAT Pros
- Auto-magically setup when Containers are installed.
- Works well out-of-the-box.
- Doesn’t require any configuration for basic dev workload.
NAT Kaaaaaaaaaahhhhhhhhns:
- Can require an insane amount of management for production needs, especially when running multiple Containers on a single host for the same service (like multiple web applications).
- Not good for latency sensitive applications, as the NBLs (see Part 2) have a longer travel distance inside of Windows, plus a trip through WinNAT.
- Higher resource needs on the host.
- Did I mention the management nightmare?
What is a NAT?
NAT stands for Network Address Translation. It’s the technology that allows 2 desktops, a laptop, four tablets, and three smartphones to get Internet access with the single public Internet address provided by your ISP. NAT works by taking your inside IP address, usually something in the 192.168.1.xxx IP address range, and changes it to the single outside IP address when you need to reach an Internet resource.
NAT does this by creating a session table. The session table is a collection of inside to outside IP address combinations. For example, when going to your preferred search engine – which is, of course, Bing.com – the inside IP address and port of your computer gets matched to an available outside IP address and port combination, with the destination address (a Bing.com address like 204.79.197.200) and port (TCP 443) added to complete the uniqueness of the session. This is what allows NAT to transfer data back and forth between multiple systems using only a single public IP address.
Time for a pretty diagram.
The same basic process happens inside of Windows. Except the NAT portion is handled by a subsystem called, WinNAT. Think of it like a virtual home router sitting between the Windows host and Containers. The inside Container address get turned into an outside address, the pairing is written to a session table inside of Windows, and that allows Containers to reach the world through WinNAT-based networking. This is needed because addresses on the inside of the NAT are isolated. Meaning nothing from the outside can reach the inside, and nothing from the inside can reach the outside, without WinNAT’s permission.
Because WinNAT is a newer Windows subsystem it has an ETW provider, Microsoft-Windows-WinNAT. Modify the provider list from the PowerShell steps in Part 2, run a capture on the Container host, and you can see all the address translation magic happening.
<p># the primary WNV ETW provider.<p>[array]$providerList = 'Microsoft-Windows-Hyper-V-VmSwitch', 'Microsoft-Windows-WinNAT'<p>
Here’s an example of some WinNAT traffic. Look closely at the events and you’ll spot the first sign of Compartment ID’s.
Module | Summary |
Microsoft_Windows_WinNat | TCP binding created. Internal transport addr: 172.23.183.39:49161 (CompartmentId 1), External transport addr 192.168.1.70:1227, SessionCount: 0, Configured: False |
Microsoft_Windows_WinNat | TCP binding session count updated. Internal transport addr: 172.23.183.39:49161 (CompartmentId 1), External transport addr 192.168.1.70:1227, SessionCount: 1, Configured: False |
Microsoft_Windows_WinNat | TCP session created. Internal source transport addr: 172.23.183.39:49161 (CompartmentId 1), Internal dest transport addr: 65.55.44.109:443, External source transport addr 192.168.1.70:1227, External dest transport addr 65.55.44.109:443, Lifetime: 6 seconds, TcpState:Closed/NA |
Microsoft_Windows_WinNat | TCP session state updated. Internal source transport addr: 172.23.183.39:49161 (CompartmentId 1), Internal dest transport addr: 65.55.44.109:443, External source transport addr 192.168.1.70:1227, External dest transport addr 65.55.44.109:443, Lifetime: 120 seconds, TcpState: Internal SYN received |
Microsoft_Windows_WinNat | TCP session lifetime updated. Internal source transport addr: 172.23.183.39:49161 (CompartmentId 1), Internal dest transport addr: 65.55.44.109:443, External source transport addr 192.168.1.70:1227, External dest transport addr 65.55.44.109:443, Lifetime: 120 seconds, TcpState: Internal SYN received |
Microsoft_Windows_WinNat | NAT translated and forwarded IPv4 TCP packet which arrived over INTERNAL interface 8 in compartment 1 to interface 2 in compartment 1. |
The Binding Issue
One of the benefits of Containers is that you can run multiple Containers per host. Dozens to hundreds of Containers on a single host, depending the workload and host hardware. The main problem with the NAT network is binding. Network-based services normally have a common network port. Web sites and services use TCP ports 80 and 443. SQL Server uses TCP port 1443 as the default. And so on.
The problem: Each port to IP address binding must be unique.
Let’s say our example host is running 100 containers: 75 web sites, 10 Minecraft servers, and 15 SQL Server Containers. All the web-based Containers are going to want to use TCP ports 80 and 443 because browsers are built to use those ports. The SQL Servers will use TCP 1433 by default. Minecraft, TCP port 25565.
NAT’ing means that the host only needs a single IP address for outgoing Container traffic. All 100 Containers could connect to anything using just that one host address. For clients to reach a Container service, however, something called port forwarding is needed. Port forwarding allows devices on the outside of the NAT to reach a service on the inside of the NAT, and works like a NAT in reverse.
To host 20 Minecraft servers on port 25565 the Container host needs 20 IP address. 75 IP addresses for the web site Containers. 15 for the SQL Servers for public access. While there can be some overlap, each time you need to bind port 80, or 1433, or 25565 to a new Container, the host needs another IP address. This goes back to the uniqueness requirement for a binding. Port 80 cannot be bound twice to the same IP address.
Adding a bunch of IP addresses to a server isn’t the complex part. Getting those unique IP addresses to work with NAT requires an additional feature called a PortProxy. The PortProxy feature is a type of port forwarding. This feature can forward network traffic from one IP:port combination to a different IP:port combination inside of Windows. This allows an administrator to use PortProxy to route traffic to an individual Container, giving the NAT type network the ability to host many Containers publicly.
This practice is highly discouraged in a production environment.
Not Recommended for Production
The first problem with NAT in production is the double network stack connection. The PortProxy creates one network connection on the host. The second connection is created on the Container itself (which is also the host, kind of). Ultimately, there is one connection between the client and host, and a second between the host and the Container. While this technically works, it’s an unnecessary mess.
Then there’s the administrative nightmare trying to keep all the PortProxy settings straight, which can be a chore since the only mechanism that exposes PortProxy is the legacy netsh command. Perhaps the biggest issue is the latency. All that bouncing around inside of Windows adds precious microseconds to each packet. That doesn’t seem like a lot, but it adds up fast.
This is an example of a client connecting to a host with a PortProxy and a NAT Container network. The TCP SYN arrives from the client and a TCP/IP connection is created on the host.
Followed by the second connection from the host to the Container. The traffic goes from the host, across the vmSwitch, bypassing WinNAT, and directly to the Container. Or as directly as the NBL can travel through the vmSwitch. A second network stack connection is then established on the Container.
From packet arrival on the host to the packet arriving at the Container’s network stack is about 0.6783 milliseconds, or 678 microseconds, in this example. This makes NAT + PortProxy about 15-20 times slower than standard vmSwitch traversal. Which is another reason why NAT is good for testing, less so for production purposes.
The final article in this series will cover the preferred production Container network type, transparent mode. I’ll talk briefly about L2Bridge mode, because it’s just transparent with a catch, and end the series with a brief explanation about what happens when you throw Hyper-V Container isolation into the mix.
-James
WNV Deep Dive Part 6 – Container Networking: Transparent and L2bridge Networks
By James Kehr, Networking Support Escalation Engineer
The next Container network type on the list is called, transparent. Production Container workloads, outside of swarms and special Azure circumstances, should be using a transparent network. Unless you need to use L2 bridging in an SDN environment, but production will mostly use the transparent network type.
Transparent Networks
The transparent network essentially removes WinNAT from the NBL path. This gives the Container a more direct path through Windows. Management of transparent network attached Containers is as easy as managing any other system. It’s like the virtual network is transparent to the system admins…hmmm.
The only configuration needed with a Container on a transparent network is the IP address information. Since the Container is directly attached to the network infrastructure this can be done through static assignment or regular old DHCP. The Container connects to a normal vmSwitch on the default Compartment 1 (see part 2). On top of easy management, the NBL only travels through four or three hops within Windows, depending on whether LBFO NIC teaming is enabled or not. This decreases latency to normal Hyper-V guest like numbers.
There is at least one circumstance where your Containers will end up on a separate network compartment. This primarily happens when the Container is attached to a Hyper-V created vmSwitch. Hyper-V traffic will use Compartment ID 1, and Container traffic will use something else, like Compartment ID 2. Each Container network attached to the Hyper-V vmSwitch will use a unique Compartment ID. This is done to keep traffic secure and isolation between all the various virtual networks. See part 2 for more details about Compartments.
The traffic looks identical to the Hyper-V traffic, except you see the Container traffic on the host system, where the endpoint is not visible in Hyper-V. The Container vmNIC sends an NBL to the vmSwitch, which routes the traffic to the host NIC, where the NBL is converted to a packet, and off to the wire it goes.
Source | Destination | Module | Summary |
cntr.contoso.com | bing.com | ICMP | Echo Request |
Microsoft_Windows_Hyper_V_VmSwitch | NBL received from Nic (Friendly Name: Container NIC 066a9b55) in switch (Friendly Name: vmSwitch) | ||
cntr.contoso.com | bing.com | ICMP | Echo Request |
Microsoft_Windows_Hyper_V_VmSwitch | NBL routed from Nic (Friendly Name: Container NIC 066a9b55) to Nic /DEVICE/ (Friendly Name: Host NIC) on switch (Friendly Name: vmSwitch) | ||
cntr.contoso.com | bing.com | ICMP | Echo Request |
Microsoft_Windows_Hyper_V_VmSwitch | NBL delivered to Nic /DEVICE/ (Friendly Name: Host NIC) in switch (Friendly Name: vmSwitch) | ||
cntr.contoso.com | bing.com | ICMP | Echo Request |
The total time for the ping (Echo Request) to go from the Container to the wire in this example was 0.0144 milliseconds, or 14.4 microsecods. About 47 times faster than the NAT network with PortProxy example. That is quite the significant improvement.
Transparent Container networks offer easier management, less resource consumptions, and significantly improved throughput when compared to NAT networks. Especially when compared to NAT with PortProxy’s. Which is why production Container workloads should, in basic production circumstances, use the Transparent network.
L2Bridge networks
For the purposes of this article the L2Bridge network is identical to a Transparent network. The NBL traverses through Windows in the exact same way. The difference between Transparent and L2Bridge are spelled out in the Windows Container Networking doc.
l2bridge – containers attached to a network created with the ‘l2bridge’ driver will be will be in the same IP subnet as the container host. The IP addresses must be assigned statically from the same prefix as the container host. All container endpoints on the host will have the same MAC address due to Layer-2 address translation (MAC re-write) operation on ingress and egress.
There doesn’t appear to be any obvious way to see the MAC address rewriting in an ETW trace, so there’s nothing new to look at. The L2Bridge is used mainly with Windows SDN. And that’s that.
Hyper-V Isolation
One of the touted features of Docker on Windows is Hyper-V isolation. This feature runs the container inside of a special, highly optimized VM. This adds an extra layer of security, which is good. But there’s no way to troubleshoot what happens to the network traffic once it leaves the host and goes to the optimized VM. Hyper-V isolation is a networking black hole.
The obvious question is… why? The answer boils down to how the optimization is done. Neither the optimized VM nor the Container image have a way to capture packets. The Container because that subsystem is shared with the host kernel, as discussed in part 1. The VM doesn’t have packet capture because it is optimized to the point that it only includes the minimum number of components and features needed to run a Container. That list does not include packet capture. And that’s assuming you could even access the VM, which I haven’t found a way to do.
How exactly does someone go about troubleshooting networking on an isolated Container, then? The Container lives separately from the optimized VM. Isolation is performed when the correct parameter is applied as part of the docker run command when starting the Container. Specifically, the “–isolation=hyperv” parameter. Remove the isolation parameter so the Container starts normally, and then all the normal troubleshooting rules apply. While it’s not a complete apples-to-apples comparison, it does provide some ability to troubleshoot potential networking issues within the Container.
That concludes this series on Windows Virtualization and Container networking. I hope you learned a thing or three. Please keep in mind that technologies such as Containers and Windows networking change a lot. Especially when the technology is young. Don’t be afraid to dig in and learn something new, even if this article just acts as a primer to get your started.
-James
Windows Server 2016 Software Defined Networking: Updating the Network Controller Server certificate
Network Controller uses a single certificate for northbound communication with REST clients (like System Center Virtual Machine Manager) and southbound communication with Hyper-V hosts and Software Load Balancers. A customer may wish to change this certificate after initial deployment, maybe because the certificate has expired or maybe because he wants to move from self-signed certificate to certificates issued by a Certificate Authority. Currently, the workflow to update certificates is broken if you are using System Center Virtual Machine Manager. This will be fixed in an upcoming release. For now, please follow the steps below to update the Network Controller Server certificate.
NOTE: These steps are not required if you are renewing the existing certificate with the same key.
Steps to update the Network Controller Server certificate
- Install the new certificate in Personal store of LocalMachine account on a Network Controller node
- Export the certificate with private key and import it on the other Network Controller nodes (to ensure that the same certificate is provisioned on all the nodes)
- DO NOT remove the old certificate from the Network Controller nodes
- Update the server certificate using the Powershell command:
Set-NetworkController -ServerCertificate <new cert>
- Update the certificate used for encrypting the credentials stored in the Network Controller using the Powershell command:
Set-NetworkControllerCluster -CredentialEncryptionCertificate <new cert>
- You will also need to update the certificate used for southbound authentication with Hyper-V hosts and Software Load Balancer MUX virtual machines. To update this, follow steps 7 to 9.
- Retrieve a Server REST resource using the Powershell command:
Get-NetworkControllerServer -ConnectionUri <REST uri of your deployment>
- In the Server REST resource, navigate to the “Connections” object and retrieve the Credential resource with type “X509Certificate”
"Connections": [ { "ManagementAddresses":[ “contoso.com" ], "CredentialType": "X509Certificate", "Protocol": null, "Port": null, "Credential":{ "Tags": null, "ResourceRef": "/credentials/41229069-85d4-4352-be85-034d0c5f4658", "InstanceId": "00000000-0000-0000- 0000-000000000000", … … } } ]
- Update the Credential REST resource retrieved above with the thumbprint of the new certificate
$cred=New-Object Microsoft.Windows.Networkcontroller.credentialproperties $cred.type="X509Certificate" $cred.username="" $cred.value="<thumbprint of the new certificate>" New-NetworkControllerCredential -ConnectionUri <REST uri of the deployment> -ResourceId 41229069- 85d4-4352-be85-034d0c5f4658 -Properties $cred
- If the new certificate is a self-signed certificate, provision the certificate (without the private key) in the Trusted Root certificate store of all the Hyper-V hosts and Software Load Balancer MUX virtual machines. This is to ensure that the certificate presented by Network Controller is trusted by the southbound devices. If the certificate is not self-signed, ensure that the Certificate Authority that issued the certificate is also trusted by the Hyper-V hosts and the Software Load Balancer MUX virtual machines.
- System Center Virtual Machine Manager (SCVMM) also must be updated to use the new certificate. On the SCVMM machine, execute the following Powershell command:
Set-SCNetworkService -ProvisionSelfSignedCertificatesforNetworkService $true -Certificate $cert -NetworkService $svc Where NetworkService is the Network Controller service, Certificate is the new Network Controller certificate, and ProvisionSelfSignedCertificatesforNetworkService is $true if you are using a self-signed certificate
- Provision the Network Controller certificate (without the private key) in the Trusted Root certificate store of the SCVMM machine
After you have verified that the connectivity is working fine, you can go ahead and remove the old Network Controller certificate from the Network Controller nodes.
Network start-up and performance improvements in Windows 10 April 2018 Update and Windows Server, version 1803
Increased container density, faster network endpoint creation time, improvements to NAT network throughput, DNS fixes for Kubernetes, and improved developer features
A lot of enthusiasm and excitement surrounds the highly anticipated quality improvements to the container ecosystem on Windows; all shipping with Windows Server version 1803 (WS1803) and Windows 10 April 2018 Update. The range of improvements span long-awaited networking fixes, enhanced scalability and efficiency of containers, as well as new features to make the suite of container networking tools offered to developers more comprehensive. Let’s explore some of these improvements and uncover how they will make containers on Windows better than ever before!
Improvements to deviceless vNICs
Deviceless vNICs for Windows Server Containers removes the overhead of using Windows PNP device management to make both endpoint creation and removal significantly faster. Network endpoint creation time in particular can have a notable impact on large-scale deployments, where scaling up and down can add unwanted delay. Windows 10 April 2018 Update and WS1803 achieves better performance than its predecessors, as the data below will show.
WS1803 is Microsoft’s best-of-breed release to date in terms of providing a seamless scaling experience to consumers that expect things to “just work” in a timely fashion.
To summarize the impact of these improvements:
- Increased scalability of Windows Server Containers from 50 to 500 containers on one host with linear network endpoint creation cost
- Decreased Windows Server Container start-up time with 30% improvement in network endpoint creation time and 50% improvement in time taken for container deletion
Before vs. after
As discussed above, container vNIC creation and deletion was one of the identified bottlenecks for scaling requirements of powerhouse enterprises today. In previous Windows releases, with PNPs required for container instantiation, we saw on average 10 container creations fail out of 500. Now, with deviceless vNIC’s, we don’t see any failures for 500 container creations.
See the graphs below for a quick visualization of the trends discussed:
In addition to this, check out the stress test below that captures the new, lightning-fast multi-container deployment creation time!
Stress test: container endpoint creation time
Description
PowerShell script that creates and starts specified amount of recent “microsoft/windowsservercore” Windows Server containers (build 10.0.17133.73) on a Windows Server, version 1803 host (build 17133) using the default “NAT” network driver.
Hardware specification
- C6220 Server
- Storage: 1 400GB SSD
- RAM: 128GB
- CPU: 2x E5-2650 v2 2.6Ghz 16c each (32c total)
- Networking: 1 GB Intel(R) I350 Gigabit Network Connector
Test results
Number of Containers | Average HNS endpoint creation time (switch+port+vfp) (ms) |
10 | 104.6 |
50 | 126.28 |
100 | 150.3 |
Figure 3 – Table of HNS endpoint creation time. Wondering what HNS is? See here
Container endpoint creation time (ms) vs. number of container instances
Test discussion
The results show that container creation performance follows a stable linear trend, with creation time scaling to an average of 150ms on servers with 100 endpoints .
In other words, on our experimental hardware we can roughly estimate Windows server container creation time “t” against number of endpoints on server “n” very easily using the simple relationship t = n/2 +100.
This shows that the daunting task of twiddling your thumbs waiting for deployment to finally launch is much more agreeable and foreseeable on WS1803.
NAT Performance Improvements
Several bespoke Windows use-cases including Windows Defender Application Guard in the Edge web browser or Docker for Windows rely heavily on network address translation (NAT), so investments into one comprehensive and performant NAT solution is another built-in benefit of moving to this new release.
Alongside improvements in deviceless vNICs, here are some additional optimizations which are applicable to the NAT network datapath:
- Optimizations (CPU utilization) of machinery used for translation decisions of incoming traffic
- Widened network throughput pipeline by 10-20%
This alone is already a great advocate for moving to the new release, but watch this space for even more awesome optimization goodies (in the near future!) that are actively being engineered!
Improvements to Developer Workflows and Ease of Use
In previous Windows releases, there existed gaps to the flexibility and mobility needs of modern developers and IT admins. Networking for containers was one such space where gaps were identified that prevented both developers and IT admins from having a seamless experience with containers; they couldn’t confidently develop containerized applications due to a lack of development convenience and network customization options . The goal in WS1803 was to target two fundamental areas of the developer experience around container networking that need improvement—localhost/loopback support, and HTTP proxy support for containers.
1. HTTP proxy support for container traffic
In WS1803 and Windows 10 April 2018 Update, functionality is being added to allow container host machines to inject proxy settings upon container instantiation, such that container traffic is forced through the specified proxy. This feature will be supported on both Windows Server and Hyper-V containers, giving developers more control and flexibility over their desired container network setup.
While simple in theory, this is easiest to explain with a quick example.
Let’s say we have a host machine configured to pass through a proxy that is reachable under proxy.corp.microsoft.com and port number 5320. Inside this host machine, we also want to create a Windows server container, and force any north/south traffic originating from the containerized endpoints to pass through the configured proxy.
Visually, this would look as follows:
The corresponding actions to configure Docker to achieve this would be:
For Docker 17.07 or higher:
- Add this to your config.json:
{
"proxies": {
"default": {
"httpProxy": "http://proxy.corp.microsoft.com:5320"
}
}
}
For Docker 17.06 or lower:
- Run the following command:
docker run -e "HTTP_PROXY=http://proxy.corp.microsoft.com:5320 " -it microsoft/windowsservercore <command>
Diving deeper from a technical standpoint, this functionality is provided through three different registry keys that are being set inside the container:
- Software\Microsoft\Windows\CurrentVersion\Internet Settings\Connections\[DefaultConnectionSettings\WinHttpSettings
- Software\Microsoft\Windows\CurrentVersion\Internet Settings\Connections\DefaultConnectionSettings
- Software\Policies\Microsoft\Windows\CurrentVersion\Internet Settings
The configured proxy settings inside the container can then be queried using the command:
netsh winhttp show proxy
That’s it! Easy, right? The instructions to configure Docker to use a proxy server can be found in the Docker documentation.
The preliminary PR can be tracked here.
2. Localhost/loopback support for accessing containers
New with the Windows 10 April 2018 Update and WS1803 release is also support for being able to access containerized web services via “localhost” or 127.0.0.1 (loopback). Please see this blog post that does an excellent job portraying the added functionality. This feature has already been available to Windows Insiders via Build 17025 on Windows 10 and Build 17035 on Windows Server.
Networking Quality Improvements
One of the most important considerations of both developers and enterprises is a stable and robust container networking stack. Therefore, one of the biggest focus areas for this release was to remedy networking ailments that afflicted prior Windows releases, and to provide a healthy, consistent, and sustainable networking experience of the container ecosystem on Windows.
Windows 10 April 2018 Update and WS1803 users can expect the following:
- Greatly stabilized DNS resolution within containers out-of-the-box
- Enhanced stability of Kubernetes services on Windows
- Improved recovery after Kubernetes container crashes
- Fixes to address and port range reservations through WinNAT
- Improved persistence of containers after Host Networking Service (HNS) restart
- Improved persistence of containers after unexpected container host reboot
- Better overall resiliency of NAT networking
We continue being dedicated to stamp out pesky networking bugs. After all, sleepless nights playing whack-a-mole with HNS are no fun (even for us). If you still face container networking issues on the newest Windows release, check out these preliminary diagnostics and get in touch!
Previewing support for same-site cookies in Microsoft Edge
Please refer to our Edge blog:
Introducing the NetAdapter Driver model for the next generation of networks and applications
As we move towards a fully connected world, inundated with intelligent devices and massively distributed computing infrastructure, networks that can sustain high bandwidth have never been more relevant. Initial requirements for a 5G network project peak data rates in the order of 10s of gigabits per second. The gaming and the video streaming applications continue to push the frontier seeking a higher throughput and lower latency data path. In addition, the new breed of developers necessitates a simpler driver model that offers agility and greater reliability.
The Windows core networking team has been hard at work building a new, simpler network driver model. Introducing NetAdapter Class Extension using Windows Driver Framework(WDF) and an updated data path, for the next generation of networking on Windows.
NetAdapter Class Extension(NetAdapterCx)
NetAdapter class extension(NetAdapterCx) module to the Windows driver Framework can be used to write a driver for the network interface card. NetAdapter brings with it a simpler, easy to use driver model that offloads complexities to WDF and offers improved reliability. Initial focus for this model is on consumer devices with mobile broadband network adapters paving the way for adoption in ethernet and the rest of the ecosystem in the next couple of years.
NOTE: WDF allows developers to implement simple and robust drivers. It has been the model of choice for most Windows driver developers because it abstracts away a lot of complexities such as interacting with the PnP, power and power policy subsystems.
Deriving from PacketDirect, an experimental Windows Server data plane technology, NetAdapterCx brings an updated data path, that sits below the TCP/IP stack, with improved performance over current NDIS stack by reducing latency and cycles/packet. The new data path is built based on the polling-based IO model vs. the interrupt driven model, allowing the OS to optimize performance.
These improvements not only result in accelerated data paths and better drivers but deliver easier to build drivers. Windows developers can now focus on solving network domain specific problems while leveraging the framework for common device tasks.
But wait, what happened to NDIS? NDIS is not going away anytime soon! NetAdapter combines the productivity of WDF and the networking performance of NDIS.
Why build NetAdapter?
- Stay Consistent with WDF: Popular Demand! Yes, we heard YOU! In the past, WDF and NDIS each had advantages but did not interoperate well which meant that only a small subset of WDF features were accessible from the NDIS miniport driver . Whereas now, extending WDF, new OS kernel features are readily available to the NetAdapterCx based driver. This allows the developers to focus most of their effort on enabling their hardware to work on Windows rather than deal with OS complexities.
- A simpler driver model: WDF brings a familiar set of abstractions simplifying driver development, making it easier for non NDIS driver developers to write/maintain a NetAdapter based network driver. NDIS pushes many hard problems to the client driver such as data path synchronization, PnP and power handling. NetAdapterCx solves this by taking over the responsibility of synchronizing power and Pnp event with both data and control path IO so that individual client driver is not required to do so. In addition, NetAdapterCx also serializes input to the queues. With NDIS, the client driver runs with elevated privileges and can consequently cause instability to the entire system, resulting in bug checks.
- By moving complexities to the OS (such as DMA mapping) and using the polling-based IO model, NetAdapter model works towards graceful handling of such scenarios, resulting in improved driver quality and reliability.
- Accelerated performance: Moving away from the legacy interrupt-driven model in NDIS, the NetAdapterCx builds on the updated data path and the polling-based IO model of PacketDirect. In the future, the polling model can be optimized for performance. NetPacket, the primary layer of abstraction in the NetAdapter data path, map directly to the NIC hardware queues. This allows Windows to intelligently and transparently scale out in a way that will make maximum use of your NIC’s hardware. Because of this direct mapping, scale out for features like Receive Side Scaling becomes more impactful.
Going forward, all NetAdapter drivers will be state separated and DCHU compliant for new, secure systems.
Architectural Overview
The following pictures show the differences between the traditional NDIS architecture and the new NetAdapter model.
Figure 1 above shows the traditional NDIS model. A typical NDIS miniport driver leverages NDIS for all its needs, including PnP, power, control, data and hardware interaction.
Figure 2 above shows a new Miniport driver being a WDF client. It interacts with NDIS via the new NetAdapter Class Extension (NetAdapterCx) but only for data and control. For other needs, such as PnP, power and hardware interaction the new NetAdapter driver uses WDF interfaces.
As you can see, NetAdapterCx still works behind the scenes with NDIS, but handles all the interaction with NDIS.
What is available today?
Starting in Windows 10, version 1703, the Windows Driver Kit (WDK) includes a Network Adapter WDF Class Extension module (NetAdapterCx), for preview. Many thanks to our networking partners and the Windows developer community, without you the NetAdapterCx framework wouldn’t be as mature as it is today.
Milestone | Comments |
RS2 Release, Windows 10, version 1703 | Fully working preview with support for ethernet over any bus (USB, PCIe).
|
RS3 Release, Windows 10 version 1709 | Advancements in Performance over RS2 Release |
RS4 release, Windows 10, version 1803 | Stabilization. |
Upcoming RS5 release, Windows 10, version 1809 | Focus on the consumer devices with the Mobile broadband(MBB) network adapters, more details below. Go here for the latest specification. |
Marching towards RS5 RTM
The goal in RS5, is to commercialize the new NetAdapter based Mobile Broadband class extension(MBBCx) and class driver. This new framework will be compatible with in-market modems that rely on the NDIS based MBB USB class driver built using MBIM specification.
We are excited to announce that this framework and class driver is shipping in RS5 insider builds for you to try! Test it out and do not forget to report issues. Help us make windows better for you!
Long road ahead of us
While we do believe NetAdapter framework is our path forward, we see NDIS 6 and NetAdapter co-existing for the foreseeable future.
After RS5 stabilization for in-market MBB modems, the NetAdapter framework is slated to expand to upcoming consumer devices with PCIe mobile broadband modems. Future focus will be on increasing coverage across other network adapter types.
Can’t wait to try?
It is critical for us to get your feedback to drive adoption, influence future design decisions and roadmap.
We have a ton of great information and resources to share with you. To that end, we invite you
- To visit our Github page and hit “follow”.
- Visit documentation at docs.microsoft.com for the API specification.
- Check out the video series here to dive deeper into the topics introduced in this article.
- Peruse our NetAdapter based driver samples.
- Experiment building a driver for your hardware.
- Most importantly, reach out on NetAdapter@microsoft.com with issues and questions.
We truly appreciate and look forward to your feedback and continued engagement!