If you’re new to Wi-Fi you may have a similar story to me. I wanted to share how I decided to specialize in it.
A slight long read, but I’m sure many readers can resonate with similar issues they’ve experienced.
The main takeaways for me when I look back at the last few years I’ve been working on and improving my knowledge in Wi-Fi:
- We fear what we do not know
- When it’s reported as a Wi-Fi issue, it isn’t always (and most of the time it isn’t either)…as Wi-Fi engineers, we have to prove it isn’t and identify what the root cause of the issue is
- Everyone other than a network guy says “The network” is always to blame. And hey sometimes it is…
- You can find your passion when you work on something you know nothing about
Working for my employer, I take care of all things considered “Network”; Routing, Switching, Security, Data Center and Wi-Fi.
I’ve always been strong in Routing, Switching, Security and ‘OK’ with Data Center but a novice at Wi-Fi.
The best I could do with Wi-Fi is configure an AP on a Controller (assign a mode, AP group, etc).
To be honest at that point, like most Network Engineer’s, that’s really how I wanted to keep it. I wanted to be “the guy” who was an expert in Route/Switch, Data Center and Cloud (AWS/Azure). Wi-Fi was an afterthought in my mind and I really didn’t want to work with it unless I had to. Mostly down to the fact that I feared it.
I didn’t really understand it (the RF or 802.11 parts).
What is signal strength?
What is RSSI?
What is SNR?
What is an AP that is spec’d at 2×2:2 or 3×3:3?
What metrics are considered “good” for client connectivity
…I had no clue.
In early 2016, I came in to work one morning and I was the first one in.
A Sev 1 ticket was in our queue and I was told to look at it right away.
The Wi-Fi in one of our Warehouses was “down”.
Checking the basics, I confirmed the controllers were up, AP’s were registered and clients were connected. I then asked the Warehouse what the exact problem was. They explained in frustration the “Wi-Fi keeps disconnecting all our devices and they don’t connect again”. In this case they were referring to their barcode handhelds and voice picking devices. My immediate question was, has anything changed and of course their response was “No! nothing has changed whatsoever.”
After exhausting my limited knowledge to troubleshoot this issue, raising a TAC case with Cisco was the only thing I felt I could do. We began to do some debugs with Cisco TAC, they told me several client devices packets weren’t getting through to the WLC. When a client tried to authenticate they were seeing M1, M3 messages but nothing after that. I had no idea what that meant or how we were meant to troubleshoot it. They provided some recommendations, but to some level they had exhausted what they could do for us right there and then.
I was sent onsite to look at the problem.
I was learning more about Wi-Fi than I ever had before (seeing AP’s mounted 40 feet high, directional antennas, etc).
It was pretty overwhelming and when arriving on site, the folks at the warehouse saw me and were thinking “Hey! Here’s the guy who knows everything and is here to solve all the problems”…in reality it was so far from the truth.
Everyone there (many with years of experience managing warehouses) having their input “We have too many AP’s”, “There’s too much metal here so its causing the signals to bounce everywhere and creating interference with the Wi-Fi.”, “The antennas used are all wrong”.
I didn’t know whether or not they were right but I had to take what they said with a pinch of salt.
Speaking with more people on site, I found out things had changed. They had moved a large part of their operations to a different part of the warehouse that wasn’t being used before, racks had been taken down and moved and over the weekend there was some work done on the building power (of course that wouldn’t have caused any issues…)
Cisco TAC were not able to help us further in figuring out the issue so we engaged a 3rd party company specializing in Wi-Fi surveys. They performed an active and a passive survey, identified that there were some improvements to be made but nothing was really wrong in the environment from their perspective. I was at a dead end and a few days had passed we weren’t making any more progress. The warehouse was able to operate but not efficiently. I decided to stay overnight and work through areas where they were reporting problems with certain devices.
I found that even with my own personal devices like my iPhone, I was also failing to connect or my connectivity would fail completely in certain areas (Wi-Fi was connected but I couldn’t forward traffic). After finding a fault, I would work through it and check the controller, it was always after I associated to certain AP’s.
Given that the Wi-Fi survey that was performed reported that the Wi-Fi was generally in good health, I paused my troubleshooting on the Wi-Fi and began to troubleshoot at the wired side. Some devices I used worked just fine (like my laptop) and others (like my iPhone) didn’t. I found the same thing with devices they used in the warehouse for day-to-day operations, same model, firmware. Ten devices would not have any issues and another ten would, it was very random.
I really began to consider this could be a wired issue. The AP’s connect back to stack of access switches with a port-channel back to a VSS L3 core switch. From my networking knowledge and given that some devices experienced the issues and others didn’t, I considered that perhaps there was some issue in load balancing packets over the port-channel.
Using the ‘test etherchannel load-balance interface po1 mac <source-mac> <dest-mac>’ and the source/destination mac address combination of working and non-working devices, I could see that working devices were using interface 1 (connecting to Switch 1 in the VSS core) in the port-channel and devices that were not working were using interface 2 (connecting to Switch 2 in the VSS core) in the port-channel.
With this information, I had a eureka moment. I decided to shutdown the second interface between the access switch and the core switch and performed the tests on all devices. Lo and behold all devices connected right away and were operating without any issues!
My next steps were to find out why this was happening. The core (VSS) and access switches were operating without any issues, or at least they seemed to be, nothing indicate in logs, VSS status, port-channel status, etc. I then looked for relevant bugs for the code we were running…I found a bug that seemed to match what I was seeing:
For IP packets passing through the L2 Multichassis Ether Channel, partial packet drops are noticed.
On a VSS, L2 Multichassis Ether Channel, must be configured with trunk mode. The issue is seen after the VSS-standby switch reloads and becomes part of the VSS again.
I was pretty certain this was the root cause and checked the uptime of the standby switch in the VSS pair. It had been reloaded (due to a power issue) though the Primary had not been. This most likely due to the work that took place on the building power at some point over the weekend.
For the interim I performed the workaround mentioned in the bug and everything was operational in the warehouse without any problems being reported thereafter. I contacted TAC and got confirmation on which version of code we should upgrade our core switch to, performed the upgrade and we were finally back to a stable environment.
From this experience, I really began to find Wi-Fi interesting. My curiosity grew. I used to be the guy who would shudder at thought of working on Wi-Fi issue, now I’m up for working on anything Wi-Fi whether it be design, implementation or troubleshooting. I’m working toward being the guy that knows Wi-Fi really well, well enough to choose it as a specialization and become an expert…this has led me on the path to become CWNE and CCIE Wireless.