Ista greska kakvu je AMD napravio sa Bulldozerom - da su ga reklamirali kao 4c8t bilo bi bolje po njih. Ovde ce rezultat biti bolji, karte su zveri svakako, ali se svakako slazemo da je ovde nesto cudno u celoj matematici.
Ko zna sta uporedjuju, ali iskreno ne verujem da je efikasnost i blizu 2x u odnosu na Turing (mozda izolovani RT test ili nesto slicno). 3090 u startu trosi 40% vise u odnosu na 2080Ti, razlika u performansama bi morala da bude znatno veca i od onoga sto sam ja nagadjao. A iskreno mislim da ce 3080 biti nesto vise od 35% brzi od 2080ti, jer gadjaju/kazu da je 3070 tu negde sa starijom generacijom - prema onome sto znamo, 3080 ima daleko vise sirove snage.
Evo pojašnjenja:
Could you elaborate a little on this doubling of CUDA cores? How does it affect the general architectures of the GPCs? How much of a challenge is it to keep all those FP32 units fed? What was done to ensure high occupancy?
[Tony Tamasi] One of the key design goals for the Ampere 30-series SM was to achieve twice the throughput for FP32 operations compared to the Turing SM. To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.
Doubling the processing speed for FP32 improves performance for a number of common graphics and compute operations and algorithms. Modern shader workloads typically have a mixture of FP32 arithmetic instructions such as FFMA, floating point additions (FADD), or floating point multiplications (FMUL), combined with simpler instructions such as integer adds for addressing and fetching data, floating point compare, or min/max for processing results, etc. Performance gains will vary at the shader and application level depending on the mix of instructions. Ray tracing denoising shaders are good examples that might benefit greatly from doubling FP32 throughput.
Doubling math throughput required doubling the data paths supporting it, which is why the Ampere SM also doubled the shared memory and L1 cache performance for the SM. (128 bytes/clock per Ampere SM versus 64 bytes/clock in Turing). Total L1 bandwidth for GeForce RTX 3080 is 219 GB/sec versus 116 GB/sec for GeForce RTX 2080 Super.
Interesanto kako je išao dizajn SM-ova tokom generacija:
- Fermi: 32 CUDA/SM, visoke CUDA frekvencije, mali broj CUDA jezgara
- Kepler: 192 CUDA/SM, drastično povećanje ukupnog broja CUDA jezgara, slabija efikasnost po jezgru
- Maxwell: 128 CUDA/SM, malo povećanje broja CUDA jezgara, znatno povećanje efikasnosti po jezgru
- Pascal: 128 CUDA/SM, efikasnost CUDA na nivou Maxwell, znatno povećane frekvencije
- Turing: 64/SM, iste frekvencije kao Pascal, povećanje efikasnosti po CUDA jezgru
- Ampere: 128 CUDA/SM (ili 64 FP+64INT / SM), znatno povećanje broja CUDA jezgara, smanjena efikasnost po jezgru
Po ovome ispada da su se sa Ampere vratili na Pascal/Maxwell dizajn SM-a mada nisam siguran kako je tada išlo izvršavanje INT operacija, pretpostavljam da je kod Ampere fleksibilnije jer svaki od 4 podbloka SM-a može da izvrši ili 32 FP operacije ili 16 FP+16 INT operacije. Plus su ovde duplirali deljenu memoriju u SM-u.
Sve u svemu, po meni sve je do kompromisa između uloženog (broj tranzistora) i dobijenog (performanse) pa su napravili odluku da idu ovim putem.
Većina je očekivala da RT resursi budu znatno uvećani a klasični shader samo malo a desilo se obrnuto. Međutim Tony Tamasi kaže ovo:
Ray tracing denoising shaders are good examples that might benefit greatly from doubling FP32 throughput.
Tako da verujem da oni znaju zašto su tako dizajnirali GPU.