For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | more scott_wilson46's commentsregister

Monte Carlo sims for options pricing? I've done this before on FPGA, might have a go at doing it for this instance as a fun exercise to test the concept!


Not sure if that makes sense with the offer that Amazon has. The machines are huge, so either you're pricing a huge amount of options at a very high speed (which you'd probably do in-house with FPGAs that you own), or you'll be much cheaper using a good machine locally. Never found MC sims to be a bottleneck regarding time, but YMMV I guess?


I've heard (although admittedly never seen in practice) that some places take a long time for this sort of things (running over a cluster of computers overnight). If you could do the same job on a single F1 instance in say an hour then I think that would be compelling! Bearing in mind that simple experiments I did showed an improvement of around 100x for this sort of task over a GPU.


This all seems highly sensible advice to me!


It should be possible to write the majority of the code for an FPGA in a generic fashion and get the tools to infer things like RAM's by the way the Verilog or VHDL is written. Ideally, I think you should only have FPGA specific blocks in the very toplevel of a design and the majority of the design should be agnostic to the FPGA architecture. For example if you write your code like this:

  reg [31:0] mem[0:1023];
  
  always @(posedge clk) begin
    rd_data <= mem[rd_addr];
    if (wr_en)
      mem[wr_addr] <= wr_data;
  end
Then tools like vivado, quartus, synplify will infer a 1k x 32bits ram.


Scott, you are right, and I try to do it that way (there are many registers in the design with "_ram" suffix. But I did have some problems when Vivado incorrectly inferred small non-registered RAM as Block RAM.

I was able to modify the design (moving registers from inputs to outputs - https://github.com/Elphel/x393/blob/master/x393_sata/host/el... ) to force Vivado to infer correctly, but still have suspicion that it may not always be the case. So I used wrapper modules for Block RAMS with direct instances, they can be replaced as you suggested.


Actually the IC design tools are a lot more expensive than 50k to 100k. You need a variety of tools for a modern process node (Calibre being one of them usually used for LVS/DRV - checing the GDS2 against the schematic and checking the process design rules). You also need a synthesis tool to convert RTL to a gate level representation (usually another 100k or so), a place and route tool (usually many 100's of K), simulators (probably around 50k again), tools for inserting test logic (again around 50k), timing analysis tools (probably around 50k again or more). Usually you have a bunch of timing analysis tools as you need to check timing at a variety of process corners and temperatures at the same time. There are also other tools that get used at various points in the flow that all seem to cost a lot of money too (like tools for analyzing static and dynamic IR drop and logical equivalance tools, formally checking that the gate level description matches the RTL) So you can see you can end up spending a million dollars fairly easily.


I think this is a common misconception in the debate about GPU's vs FPGAs. If you take a top of the range GPU you get 2.7 Teraflops of performance (according to the GTP Titan review I just looked at: http://www.techspot.com/review/977-nvidia-geforce-gtx-titan-...). Comparing this to a top of the range Stratix 10 FPGA, you get 3.2 Teraflops (https://www.altera.com/content/dam/altera-www/global/en_US/p...) so there is really not much in it.


You have to take into account that Titan are super cheap compared to FPGA. Big FPGA boards for HPC can easily cost between 5k$ to 10k$. If you compare to GPU in the same price range, you end-up with K40 or K80, who have a peak at 4.3TFLOPS SP and 5.6TFLOPS SP respectively, much higher than Stratix 10. Moreover, FPGA are not really good at double precision FP, which is important in many HPC area. At the end of the day, the important metric is FLOPS/$, and more importantly what you can achieve for your application and tooling and ecosystem. Many scientists are not computer science experts, and many HPC codes are legacy simulations which can be hard to port and re-validate. In my experience, FPGA are still a nich accelerator vs GPU. And I am not even talking about future Xeon Phi generations. And of course, when talking about HPC you should not forget the elephant in the room: standard Xeon...


My work intends to change the aspect of having to be a computer scientist in order to leverage the power of an FPGA by using Haskell/CLaSH as a HDL which is close to mathematics.

Furthermore, the verification of the designs is simplified a lot by checking directly in Haskell over generating VHDL testbenches and then running an additional simulator tool.

Lastly, I hope that with the recent acquisition of Altera by Intel, some of the other issues you mentioned (mainly floating point performance) additionally with some tooling issues will be addressed as well.


I understand that, and having a DSL is definitely a good idea, but you need to create a community, which can be hard (and NVIDIA seems to be good at it). I didn't want to be deceptive, I just wanted to highlight that it was not just a matter of peak FLOPS (in fact it never is - as an engineer working on another niche accelerators I know it too much ;) ).


Which other niche accelerators?


http://www.kalrayinc.com You can see it as a scaled-out DSP.


I know IBM recently announced about cloud FPGA system called supervessel. Maybe it would be useful for your project.


Well, the top of the range FPGAs are priced two orders of magnitude above the top of the range GPUs, so in terms of Tflop/$ GPUs will win in many cases.


Given that most projects, the power costs are much higher than the upfront equipment costs and that the dominating factor on computational density and interconnect speeds is our thermal budget, I'd think it's really a question of Tflop/watt.


The cost of each watt of grid power for 2 years is about $1. A high-end GPU costs $3000 and burns 200W, so power costs $200 over 2 years, or 6% of the total cost. I can't think of any high-performance computing semiconductors costing less than $1 / watt.

What systems can you point to where the power cost over the time-to-obsolescence exceeds capital cost? Besides Bitcoin mining.


You're correct, I had receive wrong information and never really thought it through in regards to high end systems.

Just about the only systems where it makes sense to talk about the power budget being relevant are where you're going in on base commodity systems and talking about a 3-4 year cycle time. (And maybe weird cases where we're having to meet power budgets of existing deployments.)

I was comparing situations that had already made a FLOPS/dollar decision because of constraints on other resources (cheap hardware, lots of it, TONS of storage, high sync latency), and so I guess both falls outside traditional HPC and is a secondary concern.

Thanks for the correction and have an upvote. (:


In terms of power, there is a largely unexplored but yet very interesting world of the mobile GPUs. Project "Mont Blanc" is about to dig into this area: http://www.montblanc-project.eu/

But, yes, I'm very enthusiastic about the Altera acquisition by Intel, it may drive prices down and we'll probably see FPGA-enhanced Xeons soon.


What are state of the art bitcoin miners using nowadays? That might shed light on which platform is more efficient.


Bitcoin miners use ASICs, though the progression was from GPUs to FPGAs, since FPGAs are more efficient than GPUs in terms of power.


In fact some of the Stratix 10 FPGAs have over 9 Teraflops...


If you are interested in using python for unit testing there is also Cocotb which is a python based library for running verilog and vhdl simulations. It interfaces to the simulator and allows you to stimulate your design directly from python:

https://github.com/potentialventures/cocotb


Very interesting. Will definitely look into this.

All of our tests right now are implemented as VHDL/Verilog testbenches. We automate building and running in ISim with a simple Python tool which generates xunit output. It works but its slow and kind of painful to manage testcases.


I can't help but feel that the VHDL that was written is a bit overly verbose. You could probably write something like:

  i_sm: process(clk, reset)
  begin
    if (reset) then
      state <= DATA_BITS;
    elsif (clk'event and clk = '1') then
      case (state)
        when DATA_BITS => 
          if (data_valid = '1') then
            if (count < 8) then
              count <= count + 1;
            else
              state <= STOP_BIT;
              count <= 0; 
            end if;
          end if;
        when STOP_BITS => 
          if (data_valid = '1') then
            if (count < num_stop_bits) then
              count <= count + 1;
            else
              state <= DATA_BITS;
              count <= 0;
            end if; 
          end if;
       end case
    end if;
  end process i_sm;
which is not too dis-similar from the Cx example (I've missed a few things out like port/signal declerations, just wanted to show the guts of the code). The thing I like about VHDL/Verilog is that its easy to tell the exact port names, what the clk is, the name and type of reset, etc which is useful information for putting the block in the context of an overall system.


Hi Scott, I'm a co-founder of Synflow. You're right, the VHDL on our website is a bit overly verbose. I will update the website with the right version of the code that day.

And I also agree on your point. This is why we added properties to the language (http://cx-lang.org/documentation/properties) so one can either use the (implicit) default names and types for the clk, reset, etc. or explicit/tweek things for more complex systems.


Actually another thing I am curious about is how asynchronous clock domains are handled Usually this is something thats quite tricky to model in a HLS tool. Also, how about simulating the interactions between the domains?


That's a good point Scott. Asynchronous clock domains are indeed complex and it took us time to manage them efficiently. When you need to connect different clock domains with Cx/ngDesign you have to synchronize the various tasks with specific components (e.g. SynchronizerMux) http://cx-lang.org/documentation/instantiation/stdlib. Simulating the interactions between the domains is not yet supported by our simulator. We will develop this feature when people will request it. And you know, we're a startup so we still have plenty of R&D to do :-)


I think if you can crack the modelling and simulation of asynchronous clock domains then I suspect you will have something that the other HLS solutions don't have at the moment that would be an incredibly useful feature. Design with async clocks is difficult and I have seen loads of bugs with these interfaces (including bugs found in the field for chips that were release many years previously).


That Ada roots doe.


f you are interested. I did something similar for fpga: https://github.com/scottwilson46/FPGARandom


Theres also the SocKit from arrow: http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=E...

Very similar to Zynq but from altera rather than xilinx


It's not going to limit clock speed, its just not going to work. To write multiple values to different ram locations (considering a ram has at most two ports) would require you to stay in the INIT state for 9 cycles and do something like:

  INIT:
    begin
    if (count == 9) begin
      next_count == 0;
      nextState = `EVALUATE;
    end
    else 
      next_count == count + 1;
    case (count)
      0: begin ram_wr_addr = count; ram_wr_data = `CMD_LED_ON; ram_wr_en = 1; end
      1: begin ram_wr_addr = count; ram_wr_data = `CMD_LOAD_NEXT_TO_ACCU; ram_wr_en = 1; end
      etc....
    end
usually you have a ram module that takes an address and some write_data, wr_en, etc rather than accessing the array directly.


I think we've come back to "original author doesn't actually know how to write logic, thinks he's writing sequential code" as the core problem


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You