More

gluggymug · on April 27, 2016

In the good old days, you'd just turn it off then on again.

The reason management doesn't care is that it isn't important. If a bug eventuates on a web page, the impact is minimal usually.

There are contingencies if a FAQ page is blank like in the article example. The user can visit the contacts page and email their question. Crisis averted.

Most of this web stuff is relatively trivial in terms of impact.

I used to work in ASIC development where the impact of a bug is massive. Hence the verification effort and rigor should be much higher.

gluggymug · on Jan 11, 2016

Uh, tape out doesn't mean you've verified anything properly.

Looking at the various Intel bugs found, I can't help sniggering a little. Verification mistakes happen all the time though. It's just a matter of when the bugs are found: if found later, it has much greater impact.

On the 28nm issue, getting your design to synthesize to 28 nm depends on the architecture and your constraints. Trickier data paths make it harder to get it meeting timing. Where your IOs are placed, where your RAMs are placed and how much you have, each thing can make it harder and harder to synthesize. Adapteva doesn't have much RAM I believe.

nickpsecurity · on Jan 11, 2016

Are you joking? Intel uses (even invented) so much verification tech it's crazy. I'm talking more than any small vendor could hope to use as it would slow them down or take too much brains/training. Any bugs they're having are more likely due to size of their project or how custom the optimizations are. Here's some Intel techniques they use for verification:

https://www7.in.tum.de/um/25/pdf/LimorFix-2.pdf

https://www.cl.cam.ac.uk/~jrh13/slides/nasa-14apr10/slides.p...

IBM's papers on their verification system for POWER processors mention all kinds of optimizations for things like pipelines that make the logic ridiculous. Then they jump through hoops in verification. Yet, they don't hit 3+GHz on good pipeline without that. Undoubtedly, Intel's using similar tricks with similar issues.

Regular ASIC verification doesn't cut it at their level. What they're doing is on another level. It's hard to say what exactly we should expect in terms of errata level given their operating constraints (esp marketing). Only thing I expect is to know clearly the circumstances errata appears so I can avoid it. They let me down...

gluggymug · on Jan 11, 2016

"Are you joking? Intel uses (even invented) so much verification tech it's crazy."

That's kinda why I am sniggering. Intel has a rep for letting bugs go to silicon despite all that stuff. In terms of verification, they dropped the ball probably (unless they found these bugs in verification and decided to tape out anyway).

One area where they fell on their face seems to be AVX. From TFA, "Certain Combinations of AVX Instructions May Cause Unpredictable System Behavior". That's a huge bug. Remember how we discussed the torture testing of floating point on the RISC-V ? Kinda similar issue. A customer wouldn't be happy with a huge bug there.

The big 3 of EDA tools (Cadence, Synopsis, Mentor Graphics) provide their own tools for verifying this stuff called UVM. Like anything, it still relies on the person using the tool. It takes a lot of effort and planning to use this stuff.

Whenever the verification engineer has to create Verification IP to test the IP, there's a chance they create bugs of their own. It's like a golden rule.

That's why I am not a fan of formal methods. Nothing is proven until you have it working in silicon.

nickpsecurity · on Jan 12, 2016

"That's why I am not a fan of formal methods. Nothing is proven until you have it working in silicon."

The one's that did use formal methods all the way did what they were supposed to and usually first pass. They were a mix of academic and defense-related stuff most people can't buy. What I normally see when I look up formal verification in industry is equivalence checking with custom shops also doing protocol verification and certain correctness angles. We've seen lots of what Intel does in their docs. So, that narrows the question down to "Why are these errata in there anyway?"

"The big 3 of EDA tools (Cadence, Synopsis, Mentor Graphics) provide their own tools for verifying this stuff called UVM. Like anything, it still relies on the person using the tool. It takes a lot of effort and planning to use this stuff."

Didn't know about that one. Thanks. The briefs I just Googled sound weaker than Intel's stuff and especially IBM's where presentations cover an incredible amount of specific verifications. Like you said, what one puts in determines what one gets out of it. So, are Intel just being lax on verification or is their stuff just too complex + optimized to catch all the corner cases?

"Intel has a rep for letting bugs go to silicon despite all that stuff. "

If it's intentional and avoidable, then I think it might be wise in another light. (Or not, but worth considering.) The other light is the Lipner essay on why shipping is more important than highest quality:

https://blogs.microsoft.com/cybertrust/2007/08/23/the-ethics...

That comes from a background where he and Karger did high assurance systems that aimed for perfection and got as close as they could in that period. Kept slipping behind competition in terms of features/speed/price and that would effect market share. So, his prior employer canceled that product with his next one following his recommendation to hit acceptable quality levels, ship, and continuosly improve the product. Wonder if Intel is doing that to keep market dominance?

"Nothing is proven until you have it working in silicon."

This we agree on. The formally verified stuff usually works first try but that's thanks to billions in R&D in tooling & fabs they used. I like knowing a batch of chips performed exactly according to spec when probed during operation. Funny I can't remember what you HW people call that activity.

Anyway, I'd love to email or chat with you sometime to see an insider's view on this topic and fill in some blanks. Reason being experienced ASIC people talk very little vs software people. I'm collecting what tidbits of reality I can for a variety of reasons. Two important ones are giving a head start to people aiming for HW design and boosting high assurance design by determining where the weak points are currently. Really busy right now but maybe later on, eh?

gluggymug · on Jan 12, 2016

"So, are Intel just being lax on verification or is their stuff just too complex + optimized to catch all the corner cases?"

Lax! Of course it's complicated but verification is about finding the corner cases. Intel is driving all these extensions to the ISA. They have a pretty captive CPU market so they are slack. Qualcomm was the same with Wifi SoCs when I was there. Freescale was better but that may be because of the particular projects.

"I like knowing a batch of chips performed exactly according to spec when probed during operation. Funny I can't remember what you HW people call that activity."

ATE (automatic test equipment)?

The thing is that Intel does have the toughest job. They are 28nm with a complicated design, lots of RAM, power is a big issue so clock gating probably everywhere etc. You can't really compare that with a military or an academic chip. The design constraints are much tougher for Intel.

Still, Intel supposedly has all the geniuses and the money. They should have no excuses.

On the formal stuff, I have yet to be convinced. I never just trust the tools, remember?

My email is now in my user profile if you want to discuss further.

nickpsecurity · on Jan 12, 2016

"ATE (automatic test equipment)?"

Yeah. It's one of the only processes I have little data on. Must be straight-forward if I haven't stumbled on many academic papers on the subject. If not, there's a siloing effect happening on publishing side & the term will be helpful.

"The thing is that Intel does have the toughest job. They are 28nm with a complicated design, lots of RAM, power is a big issue so clock gating probably everywhere etc. You can't really compare that with a military or an academic chip. The design constraints are much tougher for Intel."

That's part of my point. Hitting perfection took making the problem a lot simpler than what Intel faced. Same happened in high assurance security where everything TCB was verified down to every trace and state. Took lots of geniuses...

"Still, Intel supposedly has all the geniuses and the money. They should have no excuses."

...but still couldn't solve all the problems, keep up in feature parity, meet profit requirements, etc. So, I'm not so harsh on Intel for now given the complexity & business model. I might change my mind later. For now, we'll just disagree. :)

"On the formal stuff, I have yet to be convinced. I never just trust the tools, remember?"

Now, that I don't get. I've seen, in synthesis/verification work, one 9-transistor analog circuit take (IIRC) 55,000+ equations to represent all its behaviors. Digital ones easier but with tons of multi-layer cells wired up. For custom, they often behave differently. DRC's on modern nodes I read are in 1,000-2,500 range. I'm ignoring OPC because you're handicapped enough at this point. If you don't trust the tools, how are you getting anything done in ASIC land?

You must write really fast plus have a discount card at Office Depot to do it all on pencil and paper. :P

I think you trust tools more than you're letting on. You probably just cross-check tools with tools in various ways like I did with high assurance SW to catch tool-specific issues. That implies a lot of, but not total, trust in the tools. If I'm wrong, I'll be surprised and probably learn something in the process.

There's another method HW people might already use that comes from theorem provers. They know the proving process is complex. It also breaks down into a series of primitive actions in logic. So, they split the activity between a complex, untrusted prover and a simple, easy-to-verify checker. I know state-machine equivalence & even many physical phenomenon can be modeled well in software. I've seen as much with FEC systems. Trick for HW might be turning all the tool outputs into a series of steps like in an audit log that such tools can verify. That might take a hell of a long time, though, but also should be easy to parallelize onto clusters, GPU's, FPGA's, etc. Do it on a macro-cell at a time composing the results like in proof abstraction or abstract interpretation for software.

What you think?

gluggymug · on Jan 12, 2016

"I think you trust tools more than you're letting on. You probably just cross-check tools with tools in various ways like I did with high assurance SW to catch tool-specific issues. That implies a lot of, but not total, trust in the tools. If I'm wrong, I'll be surprised and probably learn something in the process."

While it is true that we use tools to cross check each other, what I mean is that we regularly are manually looking through waveforms. At every stage of the flow, we are checking that our verification infrastructure is actually doing what it should to find bugs. Because a lot of the time, either we've stuffed up using the tool or the tool itself is stuffed.

So much tooling is provided for you. Bus functional models, protocol checkers, etc. You are just cramming it all together and writing your own stuff over the top. There is always a mistake in there somewhere.

"Trick for HW might be turning all the tool outputs into a series of steps like in an audit log that such tools can verify."

This is what happens with the UVM. A checker is written with a SystemVerilog interface by a third party or ourselves. It uses the UVM standard so you can integrate it with other UVM stuff to make even more abstract checkers. If I am writing the prover, I know I probably threw a few bugs in there!

If only it were all parallelised because it is slow as hell.

nickpsecurity · on Jan 13, 2016

"what I mean is that we regularly are manually looking through waveforms. At every stage of the flow, we are checking that our verification infrastructure is actually doing what it should to find bugs. Because a lot of the time, either we've stuffed up using the tool or the tool itself is stuffed."

Waveform-based verification is something I know nothing about. I haven't seen it in any paper I've looked at. Is that what people do in logic analyzers and such? Do you have a link to a free reference discussing what people do with that stuff and how it's used to verify digital designs? I really should have this info in mind and on hand if you all rely on it more than verification tools.

"This is what happens with the UVM. A checker is written with a SystemVerilog interface by a third party or ourselves. "

That makes sense.

"If I am writing the prover, I know I probably threw a few bugs in there!"

I like that you're realistic. It's how I used to look at code on a complex project. Even with Correct-by-Construction, I didn't get to feel safe with my code: only wonder how obscure or unnecessarily simple a problem was left in it. I need to make a Philosoraptor meme along the lines of: "Do we create coding schemes or does the code scheme against us?" Haha.

"If only it were all parallelised because it is slow as hell."

There's the opportunity. Whether it can be acted on who knows. I do know so far that hardware is many blocks strung together with all kinds of tests that should be parallelizable. Now you've reinforced this potential in my mind. I have a trick for this but I'm holding off publishing it for now. Let's just say it's easier to parallelize stuff if one doesn't force their implementation to be inherently sequential or even tied to CPU's. And there's a little-known, albeit alpha-quality, way of doing both at once. :)

gluggymug · on Jan 13, 2016

"Waveform-based verification is something I know nothing about. I haven't seen it in any paper I've looked at. Is that what people do in logic analyzers and such? Do you have a link to a free reference discussing what people do with that stuff and how it's used to verify digital designs? I really should have this info in mind and on hand if you all rely on it more than verification tools."

Yeah, waveforms from a logic analyzer are mimicked by simulator tools.

Not sure about free references. Just googling around I found this about using logic analyzers : http://www.eetimes.com/document.asp?doc_id=1274572

For example, page 3 shows a RAM timing diagram. Like any good spec, the interface from one module to another is defined via a timing diagram. We build our UVM checkers and monitors to detect these memory transactions based on the sequences specified. When a transaction occurs it triggers a UVM event which in turn can be observed by other monitors/checkers or it can create other events or record the event to a log file etc.

We build our verification infrastructure to automatically check transactions behave as specified. However knowing I can't trust my own work, I manually check the waveforms to see whether the infrastructure is performing correctly.

"Let's just say it's easier to parallelize stuff if one doesn't force their implementation to be inherently sequential or even tied to CPU's."

Sounds interesting. I don't know much about how it's all implemented in the simulator.

yosefk · on Jan 11, 2016

Well, yeah, I'm just saying it wasn't a picnic at 40nm or 90nm, either. It didn't go from a walk in the park to a nightmare at 28nm.

Also I believe all the bugs in TFA are logical bugs where the circuit would have misbehaved in a logical simulation, not the kind where the circuit "flips zeros to ones" or vice versa because of physical implementation issues. In this sense advanced nodes make things worse only indirectly by enabling larger, more complex designs.

gluggymug · on Jan 11, 2016

Agreed. The bugs seem logical. It doesn't reflect well on the Intel verification effort if they were logical though. It means they didn't verify properly.

One possible guess/excuse-for-stuffing-up is that tools don't always simulate correctly. There can be weird gotchas where simulation models and reality don't match.

gluggymug · on Jan 10, 2016

Why do you want to do that? Isn't there a hard ARM core on the Zynq 7000s ?

gluggymug · on Jan 7, 2016

I don't think the RISC-V work is a good example. It suffers from some of the problems that mdwelsh is worried about.

It's aimed at a real world problem but their solution is not good.

A couple of days ago, someone asked where the verification infrastructure was on https://news.ycombinator.com/item?id=10831601 . So I took another look around and found it was pretty much unchanged from when I looked last time. There is almost nothing there. It is not up to industry standards, to put it lightly.

It's not just the verification aspect that is weak either. On the design side, they only have docs on the ISA. For SOC work, you are essentially given no docs. Then in another slap in the face, the alternative is to look for code to read but the code is in Scala. Basically only helping those who went to Berkley or something.

It is something that seems relevant but if you were to try using it most engineers would have a pretty hard time.

nickpsecurity · on Jan 7, 2016

That I recall, the RISC-V instruction set was created by looking at existing RISC instructions, industry demands, and so on. The result was a pretty good baseline that was unencumbered by patents or I.P. restrictions. From there, simulators and reference hardware emerged. Unlike many toys, the Rocket CPU was designed and prototyped with a reasonable flow on 45nm and 28nm. Many others followed through with variants for embedded and server applications with prior MIPS and SPARC work showing security mods will be next.

Them not having every industrial tool available doesn't change the fact that the research, from ISA design to tools developed, was quite practical and with high potential for adoption in industry. An industry that rejects almost everything out of academia if we're talking replacing x86 or ARM. Some support for my hypothesis comes from the fact that all kinds of academics are building on it and major industry players just committed support.

Is it ideal? No. I usually recommend Gaisler's SPARC work, Oracle/Fujitsu/IBM for high-end, Cavium's Octeons for RISC + accelerators, and some others as more ideal. Yet, it was a smart start that could easily become those and with some components made already. Also progressing faster on that than anything else.

gluggymug · on Jan 8, 2016

The flow is not good IMO.

They haven't followed engineering practices which is one of the issues mdwelsh was talking about.

If they've synthesized to 45nm and 28nm, where's all their synthesis stuff - constraints etc.?

They have no back end stuff, very little docs, almost no tests, almost no verification infrastructure.

nickpsecurity · on Jan 8, 2016

Hmm. Im clearly not an ASIC guy so I appreciate the tip on this. News to me. Ill try to look into it.

Any link you have where people mention these and any other issues?

gluggymug · on Jan 8, 2016

Maybe I was a bit harsh with the "almost no tests". They have some tests.

Someone name fmarch asked on https://news.ycombinator.com/item?id=10831601 about verification against the ISA model.

It possibly can be done via a torture tester apparently, https://github.com/ucb-bar/riscv-torture , but taking a quick look I don't think it handles loops, interrupts, floating point instructions etc.

nickpsecurity · on Jan 8, 2016

There didn't seem to be a lot in there but I don't know Scala. I wish it was scripted in Lua or something with the Scala doing execution and analysis. Make it easier for others to follow.

Doesn't seem nearly as thorough as what I've read in ASIC papers on verification. They did (co-simulation?), equivalence, gate-level testing, all kinds of stuff. Plus, you did it for a living so I take your word there. I do hope they have some other stuff somewhere if they're doing tapeouts at 28nm. Hard to imagine unless they just really trust the synthesis and formal verification tools.

The flow is here:

http://www.cs.berkeley.edu/~yunsup/papers/riscv-esscirc2014....

Are those tools and techniques good enough to get first pass if the Chisel output was good enough to start with? Would it work in normal cases until it hits corner cases or has physical failures?

gluggymug · on Jan 8, 2016

Interesting paper. It sounds good until you look for the actual work. With a possibly limited amount of testing, you can't be sure of anything. In verification, you can never just trust the tools. With no code coverage numbers, how do I know how thorough the existing tests are? The tests themselves have no docs.

The torture test page said it still needed support for floating point instructions. That kinda says, they did no torture tests of floating point instructions. I wouldn't be happy with that. Same goes for loops. Etc.

You have to think about physical failures as well: the paper mentions various RAMs in the 45 nm processor. You should have BIST for those and Design For Test module/s. Otherwise you have no way to test for defects.

nickpsecurity · on Jan 8, 2016

Yeah, that all sounds familiar from my research. Especially floating point given some famous recalls. Disturbing if it's missing. I'll try to remember to get in contact with them. Overdue on doing that anyway.

gluggymug · on Jan 4, 2016

Check out : https://github.com/ucb-bar/rocket-chip

Last time I complained here about the verification infrastructure of this project, someone gave that link.

I didn't like much of what I could find there.

What exactly are you trying to do?

fmarch · on Jan 4, 2016

I want to at the minimum compare the architectural state of the implementation on every retire with the ISA simulator (Spike).

In addition it would be good to have a set of assertions to maintain the sanity (read functional correctness) while experimenting with the design.

How are these implementations verified right now ? All I see are few assertions in chisel and small set of tests.

_chris_ · on Jan 4, 2016

You can get a commit log (PC, inst, write-back address, write-back data) from Rocket to compare against Spike's commit log. It's not documented because the verification story is still in flux, and the commit log is fairly manual (since there will be many false positives).

Comparing a real CPU against an ISA simulator is VERY HARD. There's counter instructions, there's interrupts, timers will differ, multi-core will exhibit different (correct) answers, Rocket has out-of-order write-back+early commit, floating point registers are 65-bit recoded values, some (required) ambiguity in the spec that can't be reconciled easily (e.g., storing single-precision FP values using FSD puts undefined values in memory, the only requirement being that the value is properly restored by a corresponding FLD).

We also use a torture tester that we'll open source Soon (tm).

fmarch · on Jan 5, 2016

Thanks for the information. I understand its a hard problem but an essential one that needs a solution. It needs to be supported both in Chisel and with appropriate infrastructure to test and compare with the golden ISA model. Is there any one in the community who is actively working on the verification story ?

_chris_ · on Jan 6, 2016

> Is there any one in the community who is actively working on the verification story ?

Not sure. I'd look out for videos to show up at the RISC-V workshop that's ongoing (http://riscv.org/workshop-jan2016.html).

The problem is verification is where the $$$ is, so even amongst people sharing their CPU source code, they're less willing to share the true value-provided of their efforts. A debug spec is being developed and will be added to Rocket-chip to make this problem easier.

With that said, MIT gave a good talk at the RISC-V workshop about their works on verification, and we open sourced our torture tester at (http://riscv.org/workshop-jan2016.html).

_chris_ · on Jan 7, 2016

https://github.com/ucb-bar/riscv-torture, stupid copy/paste.

gluggymug · on Jan 4, 2016

Not a lot of verification that I could find.

The tests are not extensive. Just hand written assembly code testing one thing at a time AFAIK. As someone who used to lead ASIC verification projects for a living, I expected a lot more at a minimum.

I don't know what the Chisel stuff checks. Chisel doesn't do anything at the simulation stage I believe.

I guess if you are just playing around with CPU designs, you can use this stuff.

I would never sign off on going to tape out with just this stuff though. Apparently they've taped out over 11 times though!

You want to run the RTL and compare against the ISA simulator? I think you're on your own...

gluggymug · on Dec 27, 2015

I think the story was that these etudes were written to be impossible to play hence people got a kick out of seeing someone give it a go.

gluggymug · on Dec 22, 2015

Have you just started working or something?

As a grad of a Go8 uni and someone who has worked a couple of decades, I wouldn't worry about this signal stuff. If you want to improve yourself you can do study but it doesn't mean as much to employment as you think.

Maybe in a government job you have a lot of BS from higher up idiots but not in private industry. A year of relevant full time work is like double the value of a year of study when we are evaluating job applicants IMO.

eli_gottlieb · on Dec 23, 2015

Eh. I got my undergrad at a second-tier university and my MSc at a top-tier one. There's definitely been a perceptible difference in how people treat me.

Futurebot · on Dec 24, 2015

Definitely an underappreciated part of having a prestigious degree(s). I've written about it elsewhere with regards to having no degree vs having one, but it applies to MS vs BS and standard state school vs top-tier just as well.

It's another form of social proof, and it affects not just whether or not you get hired, but how people view you, speak to you, and whether / how much they listen to or consider your opinion. Of course, many people don't give degrees a second thought, but for every one of those, there are plenty who do.

gluggymug · on Dec 23, 2015

Did you work a year full time in a relevant field before you did your masters?

I am saying I would rather employee someone who had worked for a year in my field than someone who had done a Masters straight after their Bachelors.

gluggymug · on Dec 21, 2015

The target is the iCE40 HX8K according to the slides (p17) not the icestick unfortunately.

It uses 1521 LUTs with Yosys. The icestick is a HX1K which has 1280 LUTs I think.

gluggymug · on Dec 19, 2015

That's bad advice.

Just going to cause resentment IMO.

The concept of double checking your own work doesn't really fly in engineering work. Not that everyone shouldn't do it but it doesn't tend to add much value to the end product. It is human nature to overlook your own mistakes.

In HW development, separate people verify and validate everything. The cost of a mistake getting through is too high to just hope it is all good.

The point is that mistakes will always happen. You shouldn't blame staff for being human but the mistakes can't get out the door. If it's that important, a second set of eyes needs to be there IMO.

However since this is a post office, really the important thing for the customer is to get in and out as quick as they can. The major complaint is probably the waiting time. So in one sense the employees are helping customer satisfaction by getting them processed quickly!

gluggymug · on Dec 19, 2015

I agree the article is more about stupid management but micromanagement is another aspect of stupid management.

If there are over-detailed micro-goals and accompanying micro-status updates on everything, it devalues the employee. The goals have to have significance to the end product.

When managers sweat over details that have no meaning they lose their authority. It leads to bad morale.

HN For You