When I woke up today around 6:45 AM PDT, I didn’t seem to have internet service available. My phone told me that I was connected to my Wi-Fi network, but it didn’t have connectivity. “Hmm, that’s weird,” I thought. Maybe a fiber cut in the area or something? I looked at my IRC client on my desktop Windows PC, which is nice because it records timestamps of when I lose my connection:
My connection had been down for over 3 hours at this point. Weird! I figured I would log into my ASUS RT-AC86U router’s web interface and see what was going on. Something happened that I wasn’t expecting at all — the page wouldn’t fully load. Portions of it showed the little “sad page” icon indicating a connection error.
I tried to SSH into the router instead. The first few connection attempts failed, and then finally I got in. What I found, though, was that I couldn’t run any commands. It just spit this error back at me:
-sh: can't fork
OK, so something was really messed up. I decided to power cycle the router at this point. Maybe some weird glitch happened or something. Which would be odd — this router has been pretty rock solid since I’ve had it, aside from 2.4 GHz Wi-Fi issues over time. That’s another story I don’t want to get into today.
Anyway, when the router came back up everything seemed fine. But then, 40 minutes later, my connection dropped again with the same symptoms.
The fact that they were both at exactly 23 seconds is probably just a crazy coincidence. I was starting to panic a bit at this point. I really didn’t think an issue like this could be my ISP’s fault, but I hadn’t changed a single thing about my network setup. I hadn’t updated my router firmware for quite a while either — I had automatic updates turned off, and last I had checked, ASUS hadn’t released a new update for it.
I was able to successfully SSH into the router this time, and I did a few quick diagnostics. I used top to show me what was going on. I sadly didn’t take any screenshots, but I noticed that a process called asd was taking up 50% of my CPU. The CPU is dual-core according to /proc/cpuinfo, so 50% likely means one core was fully pegged.
My first instinct was to search for asd (which was difficult with a non-working internet connection) but I found that it’s an ASUS security daemon. This made me feel a little bit better, but I still felt like it had to be involved in the problem. Normally when I SSH into my router, top doesn’t show anything using anywhere close to 50% of the CPU.
I started searching on Reddit and Twitter to see if anyone else had run into anything similar, and that’s when I spotted this tweet by @stevecantsmell:
The way he worded it, it sounds like he works for an ISP. This sounded so similar to my issue, even down to the time frame! That would correspond to 3 AM in my time zone. I followed his advice. I quickly rebooted the router and went right into the firmware update page in its web UI. Sure enough, I was running version 3.0.0.4.386.48260 and there was an update available for 3.0.0.4.386.51529 which was released last month. It turns out I had also missed a firmware release that came out in March. I do like to keep my router up to date, but I had been checking at a slower interval since there hadn’t been an update for about a year.
I was able to install the update. The router rebooted on its own after the update finished and everything has been fine since then. asd is no longer using 50% of the CPU either. In the hours since this problem occurred, I’ve heard of countless other people who ran into this exact same issue with a variety of ASUS routers. More people chimed in in the Twitter thread linked above, and there were several posts on Reddit and SNBForums. In some cases a beta firmware was required to fix the issue. It was comforting to know that I wasn’t alone, but also incredibly frustrating to hear that so many people were affected. I bet ISP tech support employees had a “wonderful” day today.
So…what exactly happened early this morning to set this whole thing off? Did ASUS’s asd program download some kind of faulty file from their servers that caused it to hang up? Was someone attempting a mass exploit on a vulnerability that was recently patched by ASUS? Did updating the firmware really fix the issue or did it just stop a chain of events that will restart itself again soon?
I don’t know, but here’s what I’ve been able to gather so far. It appears that the file /jffs/asd.log (and /jffs/asd.log.1, which I think is the rolled-over version containing previous entries) on my router was being filled with thousands of lines of the following error message:
1684335272[chknvram_action] Invalid string
The number appears to be a UNIX timestamp, corresponding to 7:54 AM PDT this morning, which is probably right around the time that I finally installed the firmware update. I’m guessing this was constantly being written to this log as soon as the problem began at 3:24 AM.
I also found these interesting messages in /jffs/syslog.log-1 at around the time the connection was first lost:
May 17 03:18:14 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7669)]do webs_update May 17 03:18:21 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7687)]retrieve firmware information May 17 03:18:21 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7702)]fimrware update check first time May 17 03:18:21 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7733)]no need to upgrade firmware May 17 03:18:51 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7669)]do webs_update May 17 03:18:51 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7687)]retrieve firmware information May 17 03:18:51 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7707)]fimrware update check once May 17 03:19:21 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7669)]do webs_update May 17 03:19:21 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7687)]retrieve firmware information May 17 03:19:21 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7707)]fimrware update check once May 17 03:19:51 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7669)]do webs_update May 17 03:19:51 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7687)]retrieve firmware information May 17 03:19:51 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7702)]fimrware update check first time May 17 03:19:51 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7733)]no need to upgrade firmware May 17 03:22:58 kernel: CPU: 1 PID: 12870 Comm: touch Tainted: P O 4.1.27 #2 May 17 03:22:58 kernel: Hardware name: Broadcom-v8A (DT) May 17 03:22:58 kernel: task: ffffffc01730eb00 ti: ffffffc0126dc000 task.ti: ffffffc0126dc000 May 17 03:22:58 kernel: PC is at 0xf6dcc90c May 17 03:22:58 kernel: LR is at 0xf6dccfd4 May 17 03:22:58 kernel: pc : [<00000000f6dcc90c>] lr : [<00000000f6dccfd4>] pstate: 400f0010 May 17 03:22:58 kernel: sp : 00000000fff3d154 May 17 03:22:58 kernel: x12: 00000000fff3d188 May 17 03:22:58 kernel: x11: 00000000fff3d1d4 x10: 00000000f76334c0 May 17 03:22:58 kernel: x9 : 00000000fff3d189 x8 : 00000000fff3d184 May 17 03:22:58 kernel: x7 : 000000000000000b x6 : 0000000000000000 May 17 03:22:58 kernel: x5 : 00000000fff3e8bc x4 : 00000000fff3d420 May 17 03:22:58 kernel: x3 : 000000006e69622f x2 : 00000000fff3e8bc May 17 03:22:58 kernel: x1 : 00000000fff3d420 x0 : fffffffffffffff2 May 17 03:23:17 kernel: CPU: 1 PID: 12894 Comm: touch Tainted: P O 4.1.27 #2 May 17 03:23:17 kernel: Hardware name: Broadcom-v8A (DT) May 17 03:23:17 kernel: task: ffffffc01730eb00 ti: ffffffc01151c000 task.ti: ffffffc01151c000 May 17 03:23:17 kernel: PC is at 0xf6dcc90c May 17 03:23:17 kernel: LR is at 0xf6dccfd4 May 17 03:23:17 kernel: pc : [<00000000f6dcc90c>] lr : [<00000000f6dccfd4>] pstate: 400f0010 May 17 03:23:17 kernel: sp : 00000000fff3d154 May 17 03:23:17 kernel: x12: 00000000fff3d188 May 17 03:23:17 kernel: x11: 00000000fff3d1d4 x10: 00000000f76334c0 May 17 03:23:17 kernel: x9 : 00000000fff3d189 x8 : 00000000fff3d184 May 17 03:23:17 kernel: x7 : 000000000000000b x6 : 0000000000000000 May 17 03:23:17 kernel: x5 : 00000000fff3e8bc x4 : 00000000fff3d420 May 17 03:23:17 kernel: x3 : 000000006e69622f x2 : 00000000fff3e8bc May 17 03:23:17 kernel: x1 : 00000000fff3d420 x0 : fffffffffffffff4 May 17 03:23:51 watchdog: restart_firewall due DST time changed(1->0) May 17 03:23:51 rc_service: watchdog 1807:notify_rc restart_firewall May 17 03:23:51 rc_service: watchdog 1807:notify_rc restart_wan May 17 03:23:51 rc_service: waitting "restart_firewall" via watchdog ... May 17 03:23:51 firewall: apply rules error(2857) May 17 03:23:51 firewall: apply rules error(2892) May 17 03:23:51 services: apply rules error(17779) May 17 03:23:51 firewall: apply rules error(4580) May 17 03:23:52 miniupnpd[5322]: shutting down MiniUPnPd May 17 03:23:53 DualWAN: skip single wan wan_led_control - WANRED off May 17 03:23:58 dnsmasq-dhcp[5503]: failed to write /var/lib/misc/dnsmasq.leases: No space left on device (retry in 60s)
So it did an auto firmware update check at 3:18 AM (again, I have auto updates turned off) and then 3 minutes later, the kernel got mad about something. As you can see at the bottom, other things started to fail too. The dnsmasq error clearly indicates that there was no space available in /var/lib/misc. /var is mounted as a tmpfs, so I think this means the router was out of RAM.
It looks like the auto firmware check is pretty common to see every morning in the log, although it did fail on Monday if that’s relevant:
May 15 03:18:05 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7669)]do webs_update May 15 03:19:17 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7737)]could not retrieve firmware information: webs_state_update = 1, webs_state_error = 1, webs_state_dl_error = 0, webs_state_info.len = 23 May 15 03:19:46 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7669)]do webs_update May 15 03:21:11 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7737)]could not retrieve firmware information: webs_state_update = 1, webs_state_error = 1, webs_state_dl_error = 0, webs_state_info.len = 23 May 15 03:21:40 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7669)]do webs_update May 15 03:22:44 WATCHDOG: [FAUPGRADE][auto_firmware_check:(7737)]could not retrieve firmware information: webs_state_update = 1, webs_state_error = 1, webs_state_dl_error = 0, webs_state_info.len = 23
It’s unclear to me if the auto firmware check is even related to when the problem first started. Maybe it’s one of several periodic tasks that run at around that time? It looks like typically I see this message about 30-40 minutes after the auto firmware check:
ahs: [read_json]Update ahs JSON file.
This seems to be related to “ASUS Healing System” which I don’t even know if I have enabled or not. I also saw the auto update check and ahs JSON message show up again in the log after my first router reboot, at around 6:47 AM. Not too long after that, the dnsmasq.leases “no space left on device” error happened again, so I think it was out of RAM again — perhaps asd was gobbling up CPU time and RAM.
Does anyone have any further info on what happened here? My two theories are: either asd downloaded a bad file from ASUS that caused it to crash, or someone was exploiting a vulnerability that was patched in one of ASUS’s two most recent updates for my RT-AC86U router. If it’s the latter, it’s obviously my bad for not keeping my firmware up to date, but I can’t help but wonder if an automatic file download in the middle of the night caused it. I’m very curious about what happened! Did anyone with an ASUS router not run into a similar problem today?
from Hacker News https://ift.tt/OSpJ0nP
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.