Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reboot/Post hang #64

Closed
fhloston opened this issue Feb 27, 2018 · 172 comments
Closed

Reboot/Post hang #64

fhloston opened this issue Feb 27, 2018 · 172 comments
Assignees
Labels

Comments

@fhloston
Copy link

fhloston commented Feb 27, 2018

apu2c4 with BIOS 4.6.1, 4.6.5, 4.6.6 hangs occasionally during soft reboot/post.

Hard reset clears fault.

4.0.x and 4.5.5 do not show this behaviour.

Reproduce with simple shellscript in tinycore:
date >>reboot.log;sleep 60;reboot

Update: still present in 4.6.8

@dacmot
Copy link

dacmot commented Mar 2, 2018

Did you try this procedure? Specifically the part about MSI.

@fhloston
Copy link
Author

fhloston commented Mar 2, 2018

@dacmot I do not see the relevance here?

Perhaps I wasn't precise enough before.

The issue is that the apu2c4 does hang during POST when it is warm rebooted from a running OS. Does not matter if that OS is tinycore, debian or opnsense. It also does not matter if the boot media is a usbstick or a sata drive.

When the system hangs only a hard reset helps to get the unit in a working condition. sp5100_tco watchdog does also not help - even when J2 5-6 is shorted.

@a1bert01
Copy link

a1bert01 commented Mar 4, 2018

4.0.7, 4.0.12 SD card not recognized after reboot...

@miczyg1
Copy link
Member

miczyg1 commented Mar 6, 2018

@a1bert01 what operating system are You running? Are You rebooting from system installed on SD card?

@fhloston thank You for reporting the issue. We will check that and try to fix it.

@a1bert01
Copy link

a1bert01 commented Mar 6, 2018

Yes, I am using/rebooting from just SD card,
with 4.6.1. bios:

Grub (slackware) hangs without message

PCEngine's apu2-tinycore6.4.img.gz:

reboot: machine restart
PC Engines apu2
coreboot build 08/30/2017
BIOS version v4.6.1
2032 MB DRAM

SeaBIOS (version rel-1.10.2.1)

Press F10 key now for boot menu

Booting from Hard Disk...

SYSLINUX 6.03 20150820 Copyright (C) 1994-2014 H. Peter Anvin et al
Loading vmlinuz... CHS: Error 0c00 reading sector 30852 (2/12/15)
EDD: Error 0c00 reading sector 32900
CHS: Error 0c00 reading sector 35802 (2/90/51)
EDD: Error 0c00 reading sector 37850
ok
Loading core.gz...CHS: Error 0c00 reading sector 2 (0/32/35)
EDD: Error 0c00 reading sector 2050
CHS: Error 0c00 reading sector 929 (0/47/17)
EDD: Error 0c00 reading sector 2977
failed: I/O error

with 4.0.12 bios:
Grub prints:

Restarting system.
machine restart
PC Engines apu2
coreboot build 20170831
BIOS version v4.0.12
2032 MB DRAM

SeaBIOS (version rel-1.10.2.1)

Press F10 key now for boot menu

Booting from Hard Disk...
GRUB loading.
Welcome to GRUB!

error: failure reading sector 0x60b68 from `hd0'.
Entering rescue mode...
grub rescue> ls

grub rescue>

syslinux (tinycore):

reboot: machine restart
PC Engines apu2
coreboot build 20170831
BIOS version v4.0.12
2032 MB DRAM

SeaBIOS (version rel-1.10.2.1)

Press F10 key now for boot menu

Booting from Hard Disk...

SYSLINUX 6.03 EDD 20150820 Copyright (C) 1994-2014 H. Peter Anvin et al

@TheEvilCoder42
Copy link

TheEvilCoder42 commented Apr 9, 2018

Successfully compiled v4.6.8, after flashing, the issue reported by @fhloston is still present.

A warm reboot causes the apu2c4 to hang just after displaying the BIOS version and before displaying the available RAM.
Only a hard reset (re-plugging the power) leads to a boot..

Edit:

Seems AHCI is related to this issue..

I'm running OPNsense 18.1.6 on a Kingston mSATA.
Since AHCI was enabled for SATA in this version, I thought of disabling the two entries in loader.conf:

  • hint.ahci.0.msi="0"
  • hint.ahci.1.msi="0"

Suddenly, a warm reboot worked instantly.
Since I didn't saw any errors while booting, I will further observe if any errors occurs in running state..

Edit 2:

Interestingly warm reboot continues to work even if I re-enable those two entries..
I'm confused..

@miczyg1
Copy link
Member

miczyg1 commented Apr 9, 2018

@TheEvilCoder42 these two entries in loader.conf You are referring to were necessary to prevent running or installing pfSense hang due to failed write command on HDD. They are affecting SATA controller which has little to do with reboots and power cycles.

Look at the second @fhloston comment. No matter which OS/quirks/tweaks do You use, the problem will exist. We are aware of the issue and are working on it.

@pietrushnic
Copy link
Member

@fhloston I automated testing this issue and was able to run over 35 cycles of reboot without issue. I runtest on:

  • apu2c
  • v4.6.9
  • Debian with 4.14.y kernel
  • booted over iPXE

Overnight want to run 150 iterations if I will not hit that bug we have to narrow down difference between our configurations. If you can check 4.6.9 on your side it would be great.

@fhloston
Copy link
Author

@pietrushnic cool, will test!

Can you create the v4.6.9 tag in the release manifest repo?

@miczyg1
Copy link
Member

miczyg1 commented May 24, 2018

@fhloston we are not using manifests anymore for newest releases. If You want to compile firmware by yourself, please use the new pce-fw-builder. Refer to README for usage and build binary using pce-fw-builder v1.0.0 tag.

Binaries are also published on PC Engines github page: https://pcengines.github.io/

@fhloston
Copy link
Author

@miczyg1 thanks for the headsup, need to create new jenkins job then.

@fhloston
Copy link
Author

fhloston commented May 24, 2018

@pietrushnic it hangs after 5 reboots here :/

Booting tinycore from lower front usb...

`root@box:/media/TINYCORE# 5 /media/TINYCORE/reboot.log

Syncing all filesystems.
Killing all processes.
Terminating all processes.
Unmounting all filesystems.
Shutdown in progress.

The system is going down NOW!
Sent SIGTERM to all processes
Sent SIGKILL to all processes
Requesting system reboot
sd 4:0:0:0: [sdc] Synchronizing SCSI cache
sd 1:0:0:0: [sdb] Synchronizing SCSI cache
sd 0:0:0:0: [sda] Synchronizing SCSI cache
reboot: Restarting system
reboot: machine restart
PC Engines apu2
coreboot build 20180511
BIOS version v4.6.9
`

@pietrushnic
Copy link
Member

@fhloston last time we survived 48 reboots and I hit some issue with test framework (exactly Robot Framewrk and terminal emulator called pyte). So it was not failure of system. I think we have to try to recreate your configuration. What storage are you using to boot? Also can you link us to TinyCore you use?

@fhloston
Copy link
Author

@pietrushnic it's TinyCore 6.4 downloaded from the pcengines website especially for apu2 some time ago to reproduce this bug.

I have only added the reboot script.

@pietrushnic
Copy link
Member

@fhloston finally I managed to automate Core 6.4 booting over iPXE. I started 50x reboots and will get back with results. @miczyg1 has some updates, that this may be related more to storage and booting over iPXE maybe not correct reproduction procedure.

@fhloston
Copy link
Author

If you managed to get 50 reboots without hang via iPXE boot i would ollow your assumption, that iPXE does not trigger the problem.

My testcase is an external USB stick with tinycore 6.4.

Initially i reproduced also via FBSD/Debian on SATA.

I have never attempted to reproduce with iPXE boot.

I have started to try with tinycore and SD-Card but stopped due to the different device naming and the then not working autostart.sh.

TL;DR: USB/SATA and Debian/FreeBSD/Tinycore to reproduce

@miczyg1
Copy link
Member

miczyg1 commented May 29, 2018

@pietrushnic unfortunately booting a Debian over iPXE and doing reboot hanged platform after 5 repeats. So I would assume it is not storage problem, but overall platform issue.

@fhloston
Copy link
Author

If you boot via iPXE do you disconnect local storage first - if there is any in the first place?

@miczyg1
Copy link
Member

miczyg1 commented May 29, 2018

@fhloston I do not disconnect the storage. Actually during the test I had two USB stick plugged to teh front connector. I will try a run without any media.

@pietrushnic
Copy link
Member

@miczyg1 yeah, maybe that is the case that we have some media attached I think I was on setup without any storage connected.

@miczyg1
Copy link
Member

miczyg1 commented May 29, 2018

@pietrushnic @fhloston platform survived 50 reboots from Debian booted via iPXE without any media connected. Storage media might be a good shot in this issue.

@pietrushnic pietrushnic added 4.8.0.2 and removed 4.6.10 labels Jun 5, 2018
@NanoCaiordo
Copy link
Contributor

NanoCaiordo commented Aug 11, 2018

Hi, I have the same issue from mSata with v4.8.0.2 on reboots from pfSense.
I have used the following tweaks but none worked:

hint.ahci.0.msi="0"
hint.ahci.1.msi="0"
hw.acpi.handle_reboot="0" # from system tunable
hw.acpi.disable_on_reboot="1" # from system tunable

but it still hangs after the bios version message.

Cold boot are ok, repeated reboot within few minutes are ok, but rebooting after a while the system was on still hangs.

@pietrushnic
Copy link
Member

@miczyg1 is this the same like this Fch OEM config in INIT RESET Done. @krystian-hebel FYI

@miczyg1 anyone can validate if this was the case on legacy?

@Veldkornet
Copy link
Contributor

Veldkornet commented Aug 19, 2018

FYI, I also have this problem on the newer firmwares. Currently on 4.8.0.2.
Never experienced this problem on the older firmwares (< 4.0.11)

@pietrushnic
Copy link
Member

@Veldkornet thanks. I really don't like that legacy firmware didn't have that issue. @miczyg1 @krystian-hebel we should take a look at it as soon as we get free cycles.

@miczyg1
Copy link
Member

miczyg1 commented Feb 4, 2019

@Firefishy as stated in the issue: pcengines/pcengines.github.io#22
Binaries will be available EOD.

@b-bittner
Copy link

@krystian-hebel thanks for your reply, but this seems to be a bit to risky - (besides from the fact, that I don't know how to do it). I guess, if nobody else comes up with an idea, I will plan for manual reset.

@miczyg1
Copy link
Member

miczyg1 commented Feb 4, 2019

@b-bittner every firmware update should be finished with full power cycle. Reboot does not cut off the power from memory and processor, thus some leftovers from previous boot may be in place. It is strongly advised to turn off platform after flashing and replug the DC power supply.

@pietrushnic
Copy link
Member

@miczyg1 is this highlighted in documentation? I think we need power cycle always after firmware update. I understand that users expect things will work after reboot, but this is not under our control. Doing warm boot or just reboot may cause unexpected behavior (considering all weird states we can enter with AGESA and old content of memory). This should be clear for all users doing update that we need to do cold power boot after firmware update (most normal PCs and laptops will force cold boot path after firmware update) - I'm not sure if we can force cold boot path programatically after flashing? @krystian-hebel ?

@soder10
Copy link

soder10 commented Feb 4, 2019

@ALL: I dont think the firmware update instructions are clear enough to properly stress the need for full power cycle after successful firmware update.

@krystian-hebel
Copy link
Member

Dear community, I have another request.

As @b-bittner noticed it is sometimes necessary to update firmware remotely. We might have found a way to do this, but we can't test it ourselves due to lack of reproducibility of this issue. If someone affected by this issue would be so kind to test:

  • flash with firmware without SVI2 fix (something older than 4.9.0.1, but not legacy)
  • reboot (with full power cycle if possible)
  • wait for the issue to became apparent (stuck CPU frequency)
  • flash with firmware from here
  • before rebooting do:
    • setpci -s 18.0 6c.L=10:10 (Linux)
    • pciconf -w pci0:24:0 0x6c 0x580ffe10 (FreeBSD/pfSense)
  • reboot without power cycle

Platform should print Forcing cold boot path on this reboot, go to S5 state for 3-5 seconds (do not panic here) and then do another reboot by itself. This will (hopefully) reset voltage regulator and allow platform to boot properly, but as this particular issue is very hardware-related we would like to test it before claiming that it works. What we need to know is simply whether the platform hangs or not.

For curious:

Images were build from branch coldboot with SVI2 fix in the same place it was in legacy and 4.9* releases. In this branch the fix was moved before call to AmdInitReset to be applied before the point of previous hangs, but for research I reverted that change. Included change tests if bit ColdRstDet in D18F0x6C (this is one of registers that hold their state even after reset) was set and performs full reset (going into S5 for 3-5 seconds) if it was. It also clears some registers that are checked by AGESA for cold/warm boot testing. This bit can be set with setpci/pciconf, which makes it something controllable from within already running OS. We've tried to put this check as early as possible, but there still is some code before that. It means that if in the future some another hang would appear earlier, it can't be prevented by this change. If this approach works as it should, it will be included in following releases and added to documentation as a part of flashing process.

Note that some registers are not reset to their cold-boot values nonetheless, and some can't be changed manually even when they are documented as read-write, not lockable. Because of that it is impossible to force a state that looks 100% like cold boot. State of peripheral devices can make it even more difficult. Because of that we strongly suggest to perform full power cycle after flashing if possible.

@schweizp
Copy link

I am not shure if my apu2d4 suffers from the same problem, or if it is something completely different...

Symptoms:

  • the board hangs directly after the BIOS version information, e.g. "BIOS version v4.6.1"
  • happens at first coldboot after "longer" downtime
  • happens reproducable at warmboot (initiated from shell, Web-GUI, BIOS-setup)
  • a "fast" coldboot (unplug powersupply, and immidately replug) fixes the problem

Already tried:

  • different mainline versions (newest to oldest available on the website)
  • different legacy versions (newest to very old with date as version information)
  • long tests under load -> board is stable
  • long memtests -> no faults
  • changed the buffer battery on the board since the voltage was on the low side
  • tried to boot from m2-ssd, sd-card, usb-stick

--> no effect on the above mentioned symptoms

Any ideas what could cause this behaviour?

@miczyg1
Copy link
Member

miczyg1 commented Feb 28, 2019

@schweizp the symptoms match the problem described in this GitHub issue. Starting with v4.9.0.1 release, the problem should be gone. Try updating the firmware to v4.9.0.1 or newer.

The problem was caused by the processor and power management controller (PMIC) communication timeout. Processor requested voltage change due to a frequency change to PMIC, but PMIC did not respond that the operation of voltage change is complete. This caused frequency stuck and reboot hangs. Like mentioned before, fixed in BIOS v4.9.0.1.

I would also appreciate if You could try the procedure above that @krystian-hebel described. We would like to make the firmware update more robust and ensure users can do the update remotely from versions < v4.9.0.1 to v4.9.0.1 or newer. If You decide to help us, please attach the boot log from coreboot.

@schweizp
Copy link

@miczyg1
Thank you for the fast reply.
It seems that I was not entirely clear in my problem description (although I tried very hard ;-), sorry for that.
The described problem is also present with the newest firmware versions (v4.9.0.1 & v4.9.0.2).
That is the reason I was not entirely sure that it is the same problem which was addressed in this thread, since it should already be solved in this two newest versions.
I will try the above procedure as soon as I get time (probably tomorrow).

@schweizp
Copy link

schweizp commented Feb 28, 2019

@miczyg1 Ok, here is what I did:

First file is the dmesg.boot from my box with v4.9.0.2
In this state the box hangs after warmboot, and can be recovered with a coldboot.
4.9.0.2.log.txt

Second file is the dmesg.boot from my box with v4.8.0.7
Same behaviour as above.
4.8.0.7.log.txt

Third file is the dmesg.boot from my box with the "special" coldboot rom from @krystian-hebel.
After flashing the rom-file I entered the pciconf commands given above and initiated a warmboot.
The system booted without hanging!
Then I initiated a warmboot without the pciconf command and the system hang again.
A third try with the pciconf command and the warmboot worked as it should.
So for the moment I have a method to remotely boot my system. Are there any negative aspects to be expected from this special rom? Can it be used as "productive" firmware?
4.9.0.2_coldboot.txt

I hope my description is clear enough, and it can help troubleshooting the issue.

@krystian-hebel
Copy link
Member

@schweizp interesting. The binaries that I shared are somewhat modified on the top of v4.9.0.1, so they have SVI2 fix included, and because of that they shouldn't hang on later warm boots (after initial power cycle). Do you say that your platform hangs even though it booted correctly the first time, without re-flashing in the meantime?

Also, dmesg isn't very useful, it is the coreboot boot logs that we are interested in, but they can be obtained only through serial. Platform hangs way before OS starts.

@pietrushnic
Copy link
Member

@krystian-hebel @miczyg1 I assume it would be useful to have OS image or binary (if package not available in distro) to perform cbmem console dump. We have to add to documentation that coreboot log from cbmem would be most useful.

@miczyg1 I wonder if we can have logging to cbmem and not throw things on serial?

@miczyg1
Copy link
Member

miczyg1 commented Mar 3, 2019

@pietrushnic we have native support for cbmem log in mainline releases. It is just a matter of compiling and running the util. But if the platform hangs we loose the log because we never reach OS

@pietrushnic
Copy link
Member

@miczyg1 we need a guide for bug reporting for users. Information about cbmem should be in README or some other place where we can point.

@schweizp
Copy link

schweizp commented Mar 4, 2019

@schweizp interesting. The binaries that I shared are somewhat modified on the top of v4.9.0.1, so they have SVI2 fix included, and because of that they shouldn't hang on later warm boots (after initial power cycle). Do you say that your platform hangs even though it booted correctly the first time, without re-flashing in the meantime?

Yes, the behaviour is correct as you stated it. In the meantime I have decided that I will return the box to the seller, since I had also problems after longer uptime. The box stopped working, and a reboot was not possible. I suspect some kind of thermal problem: the room where the box was running is not heated, and gets rather cold during the winter nights. As stated a reboot in the cold environement was not possible after the "hang". After taking the box back in a heated room, and waiting for the temperautre to acclimate, the box started normally again, so the box is going back!

Also, dmesg isn't very useful, it is the coreboot boot logs that we are interested in, but they can be obtained only through serial. Platform hangs way before OS starts.

I agree with the comments above. better documentation for the "non-expert" user for troubleshooting would be useful.

@krystian-hebel
Copy link
Member

I suspect some kind of thermal problem: the room where the box was running is not heated, and gets rather cold during the winter nights.

I think that high humidity resulting from temperature fluctuations can be the issue too.

@smuellener
Copy link

We are facing warm reboot hangs also (using PCEngines apu2 boards and bios version <4.9.0.1). After updating to v4.9.0.1 we have not been able to reproduce the issue yet, which is promising. We did an imediate cold start after the bios update.

@miczyg1: You write that a cold start (power cycle) is needed after updating (flashing) to v4.9.0.1 in order for the issue to be resolved. But: We need to do the update remotely. Is it ok to do the update and continue running without a cold start and postpone the cold start to later? We understand that in this state on a warm reboot the device can still hang. But if a cold start is done then, the issue should be resolved, right?

@krystian-hebel: You also need to do the update remotely. Is there anything speaking against this solution?

Of course the important point here is that continue running after updating the bios without an immediate cold start or warm reboot should not lead to any other problems.

@krystian-hebel
Copy link
Member

krystian-hebel commented Mar 6, 2019

@smuellener flashing new firmware without rebooting is as good as not updating at all. Firmware is read from ROM initially, but not after it starts, and especially it isn't executed. The platform must be booted so that new firmware is loaded and new code executed. The reason why this particular fix doesn't work after warm boot is that the fix is applied after the platform is already FUBAR. Cold reset is required to reset voltage regulator - it is a hardware issue that can't be fixed.

The images that I linked a couple of comments earlier could allow for a safe remote reboot, but apart from @schweizp (who was having another issue, probably not connected with this one) nobody tested it yet. If You have some time for testing it we would be much obliged.

@NanoCaiordo
Copy link
Contributor

  • flash with firmware without SVI2 fix (something older than 4.9.0.1, but not legacy)

  • reboot (with full power cycle if possible)

  • wait for the issue to became apparent (stuck CPU frequency)

  • flash with firmware from here

  • before rebooting do:

    • setpci -s 18.0 6c.L=10:10 (Linux)
    • pciconf -w pci0:24:0 0x6c 0x580ffe10 (FreeBSD/pfSense)
  • reboot without power cycle

I've flashed v4.8.0.6 , did poweroff and unplugged/plugged from the main, CPU frequency got stuck just after about a minute, so flashed the coldboot firmware and then, since I'm on pfSense 2.5.0-dev (FreeBSD 12), I run the pciconf command and reboot, this is what I got:

reboot: Restarting system
PMIOxC0 = 40080400
PC Engines apu2
coreboot build 20190602
BIOS version v4.9.0.2
Before Reset
D18F0x6C: 000ffe00
After Early
D18F0x6C: 580ffe00
SVI2 Wait completion disabled
After Post
D18F0x6C: 580ffe00
4080 MB ECC DRAM
SeaBIOS (version rel-1.12.0.1-0-g393dc9cde5e4)

System did reboot fine, but as usual the boot order is always changed after flashing... fixed the boot order, reboot gave me the same output as before.

CPU frequency doesn't get stuck anymore and system reboots fine. Note that every reboot give me the output as above.

Hope it helps.

@krystian-hebel
Copy link
Member

@NanoCaiordo are you sure that this is output from the first reboot? There should be Forcing cold boot path in the log on first reboot after setpci/pciconf command. SVI2 Wait completion disabled also suggests that it was later reboot, and not the initial one.

Still, it is indeed a good information that it rebooted fine.

@NanoCaiordo
Copy link
Contributor

@krystian-hebel you're correct, long story short, the first time I had a couple of mishaps:
(1) did pciconf before flashing the coldboot rom, and (2) when I went to install flashrom I found out I had ABI misconfiguration issues because 2.4.5-dev wanted 2.5.0-dev upgrade before I could install anything else.

I upgraded, did make sure the frequency stuck issue happened again and this time I flashed the coldboot rom and the did pciconf again.

I could, do a fresh 2.4.4-p2 install, flash v4.8.0.6 back on, power cycle and try again... I just need to be sure if this is the correct path to start fresh again, or not.

@krystian-hebel
Copy link
Member

@NanoCaiordo the order of pciconf and flashrom doesn't matter, as long as they are both run on the same boot. In fact, now that we know that it is able to reboot, the only thing left to test is whether Forcing cold boot path works - in order to test it, do just pciconf and reboot steps, on the coldboot rom. Confirming that it works on pfSense 2.5.0-dev can be also useful for other users.

@NanoCaiordo
Copy link
Contributor

There you go:

All buffers synced.
Uptime: 1h9m19s
uhub2: detached
umass0: detached
Forcing cold boot path
PMIOxC0 = 40000400
PMIOxC0 = 40000400
PMIOxC0 = 40200400
PC Engines apu2
coreboot build 20190602
BIOS version v4.9.0.2
Before Reset
D18F0x6C: 000ff800
After Early
D18F0x6C: 580ff800
Disabling SVI2 Wait completion
After Post
D18F0x6C: 580ffe00
4080 MB ECC DRAM

@krystian-hebel
Copy link
Member

Great, that's exactly what I needed to see.

Thank You.

@miczyg1
Copy link
Member

miczyg1 commented Jun 11, 2019

@krystian-hebel anything we still need here?

@soder10
Copy link

soder10 commented Jun 11, 2019

@miczyg1 I know it wasnt me who you asked, but at least guys please add this link (if one of you already took the time and typed it) to the pcengines.github.io as "Firmware instructions and reboot"

https://github.com/pcengines/apu2-documentation/blob/master/docs/firmware_flashing.md

@miczyg1
Copy link
Member

miczyg1 commented Jun 11, 2019

@soder10 of course, can do. Note that everything is open and anyone can contribute (including You). Feel free to open pull requests (requires repository fork) if any part of the documentation is not linked or not user-friendly or not well formatted/designed.

@krystian-hebel
Copy link
Member

@miczyg1 as there are no further reports of the problem I think it is safe to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests