Šta je novo?

AMD Phenom

  • Začetnik teme Začetnik teme Nedjo
  • Datum pokretanja Datum pokretanja
Status
Zatvorena za pisanje odgovora.
To ne znaci da je u pravu. Covek je video samo jednu igricu, na kratko. Isto kao i HL Lost Coast demo za Penryna.
Posto ti ocigledno znas kako radi fen, najbolje je reci ovo:
"Speak now, or forever hold your peace".
Tj. daj konkretne brojke, ili ne ocekuj poverenje. Bar ne dok ne vidimo brojke.

Covek nije rekao sta je video...Sigurno nije video samo jednu igricu.Ti mnogo pretpostavljas a malo znas,bar manje od samog Mr.Sood-a.Lepo je rekao:"And for the record, if you were to benchmark Phenom at 3GHz you would see that it kicks the living crap out of any current AMD or Intel processor—it is a stone cold killer (at 3GHz, now imagine how it would perform if they could squeek some more juice out of it?). "

"If you were to benchmark"-nadam se da dobro stojis sa engleskim jezikom 🙂.posto se zna sta ovo znaci.A covek koji vodi gaming department male firme znane kao HP sigurno ne bi pricao napamet a da nije video benchmarke koje je Phenom provrteo.Ali ko ce njemu jos verovati,sumnjiv neki lik :d.

A sto se tice cena i perf. Barcelone,necemo jos dugo cekati,jos samo 20tak dana.
 
To ne znaci da je u pravu. Covek je video samo jednu igricu, na kratko. Isto kao i HL Lost Coast demo za Penryna.
Posto ti ocigledno znas kako radi fen, najbolje je reci ovo:
"Speak now, or forever hold your peace".
Tj. daj konkretne brojke, ili ne ocekuj poverenje. Bar ne dok ne vidimo brojke.
Ma bilo je freakova koji su i za Penryna govorili kako u njihovoj igri sa 2x8800GTX yebe kevu. :d
Stoji to da mi HOCEMO CIFRE ! 😀
 
Ma bilo je freakova koji su i za Penryna govorili kako u njihovoj igri sa 2x8800GTX yebe kevu. :d
E, bash to 😉

@ivanho, ne zanima me ni da li je on neko od Intel founder-a. Ono sto je rekao nigde ne kazhe da je on video benchmarke, osim te igrice.
Nemoj da mu stavljas reci u blog.
A i da je rekao precizno, ne bih mu verovao bez dokaza.

I da, kad kazem ocu brojke, ne znaci ocu SuperPI ili SpecFP rate. Hocu korisne aplikacije (rendering, DivX (ili ajde ne mora dok ne urade SSE4 😛), etc.)
 
Poslednja izmena:
Ali Core ima deljen RS i za FP i integer, sto nije slucaj kod K8/K10. K8/K10 ima odvojen FP cluster. Pri izvrsavanju cistog integer koda je mozda Core u prednosti.

Ima K8/K10 odvojene integer i FP shedulere, ali nije bas da se mogu sabirati.
Sa integer strane K8 zapravo ima 2x24-entry RS posto MacroOp-ove (a.k.a. intel fused microOp) cuva u jednom slotu (a nezavisno izvrsava), dok Core razdvaja vec kad stignu u RS. Dakle, tu je 2x24 vs. 32, medjutim Core je jos uvek u prednosti jer moze da posalje bilo koji op u bilo koji ALU, dok je K8/K10 RS organizovan u 3x8 tako da op sa lane 0 mora u ALU0, lane1 ALU1, lane2 ALU2.

U tekstu pise (ako sam dobro razumeo) da ako K10 ima spremnu instrukciju od 2 µop-a i ne moze da nadje slobodnu/spremnu instrukciju od 1 µop da je izvrsi (posto ima 3 µop-a po ciklusu) ubacice "empty" µop umesto nje.

Mislis na pack fazu, neposredno posle dekodiranja. Postoji restrikcija da se vector path instrukcije ne mogu mesati sa direct path single i double u jednom redu od 3 MacroOp-a. I integer multiplier je dostupan samo sa ALU0 pa i tu moze doci do loseg pakovanja ako se vise takvih instrukcija blizu jedna druge.

Core zato ima unified (L1 i L2) DTLB i L1 ITLB sto je meni logicnije

Zasto je logicnije?
 
Poslednja izmena:
Ima K8/K10 odvojene integer i FP shedulere, ali nije bas da se mogu sabirati.
Sa integer strane K8 zapravo ima 2x24-entry RS posto MacroOp-ove (a.k.a. intel fused microOp) cuva u jednom slotu (a nezavisno izvrsava), dok Core razdvaja vec kad stignu u RS. Dakle, tu je 2x24 vs. 32, medjutim Core je jos uvek u prednosti jer moze da posalje bilo koji op u bilo koji ALU, dok je K8/K10 RS organizovan u 3x8 tako da op sa lane 0 mora u ALU0, lane1 ALU1, lane2 ALU2.
A zasto mislis da je prednost ako op moze da se posalje u bilo koji ALU ? Mislim, svi ALU-ovi su simetricni na K8 arhitekturi. Kod FP-a nije sve jedno, ali pretpostavljam da se FP slot odredjuje josh u fetch/pick fazi pre nego sto op dodje u FP scheduler.
 
Barcelona Launch Clock Speeds Changing?

http://www.dailytech.com/Barcelona+Launch+Clock+Speeds+Changing/article8592.htm

AMD is changing its "Barcelona" launch plans again

Let's come right out and say it: when AMD announced that the launch clock speeds of Barcelona would start at sub-2.0 GHz, people were frustrated. This long-awaited monolithic quad core release from AMD would be the pinnacle of their core design technology and would usher in a new era of powerful computing.

Sound about right? So, imagine the collective chagrin when a 1.6 GHz base clock speed was announced with top-end clock speeds ending around 2.0 GHz. In the interim, reports surfaced about AMD having issues with top-end core speeds, which was quickly countered by the 3.0 GHz Phenom demonstration, leakage, and other processor related functions.

We knew AMD was hard at work on perfecting the Barcelona core even before the official production launch and that the promise of HT3 in this generation of processors was going to be eschewed in favor of the more widely integrated HT1.x version.

Now, word is coming that AMD might drop the 1.6 GHz from the launch line-up in favour of a 2.1 GHz or 2.2 GHz core launch speed. Perhaps the fabrication process is going much smoother than anyone realized and some of the leakage problems have been ironed out? Who knows but, looks like things are picking up for AMD.

A Dave Graham jos rece(pored sto pise sa DailyTech povremeno,lik drzi flickerdown.com i radi vrlo blisko sa amd-om):
It's reliable as it came from an inside source at AMD. I work very closely with these guys and, for the record, this falls outside the scope of my NDA with them.

cheers,

Dave
 
Poslednja izmena:
A zasto mislis da je prednost ako op moze da se posalje u bilo koji ALU ? Mislim, svi ALU-ovi su simetricni na K8 arhitekturi. Kod FP-a nije sve jedno, ali pretpostavljam da se FP slot odredjuje josh u fetch/pick fazi pre nego sto op dodje u FP scheduler.

Recimo da se u jednom od tri RS-a nalaze 3 op-a koja su istovremeno spremna za izvrsavanje, a jedan ili oba od preostala dva ALU-a su slobodna. Scheduler moze da posalje samo jedan op samo na svoj ALU, tako da ce drugi slobodni ALU-i biti neiskoristeni. Unifed scheduler bi poslao sve spremne op-ove na sve slobodne ALU-e.
Jedna od K7->K8 unapredjenja je bila pack faza koja vrsi organizaciju MOp-ova, pre nego sto sitgnu u ROB/RS, da bi se izbegle takve situacije (gde je moguce).
 
U svakom slucaju nesto drugaciji pristup nego kod P6/Core arhitekture. Pri pack fazi se odredjuje gde ce MOP-ovi biti poslati.
 
Nove informacije sa VR-zone

We heard that AMD partners has gotten B1-step Barcelona for compatibility testing with better and newer AGESA code compared to the B0-step Barcelona. There is also B2-step Barcelona which is bug free and then BA-step Barcelona for mass production. AMD plans for two 2-way Opteron at 1.9 and 2.0GHz at launch and higher clocked parts at 2.1 to 2.4GHz coming later this year.

http://www.vr-zone.com/articles/BA-Step_Barcelona_For_Launch_In_Sept/5196.html

Znaci nova BA revizija je odgovorna za veci launch clock K10-ke.Jos jedna zanimljiva stvar je tvrdnja da su sve revizije do B2-ke bile na neki nacin "bagovite".Znaci B2 je prva bug free revizija i to je ona koja je postigla visoke taktove sa ES-ovima (fudzilla,inq,vr zone).
Ovo dalje znaci da je Phenom verovatno baziran isto na nekoj verziji BA i da je verovatno sada sve u redu sto se samih frekvencija tice.Moguce da AMD uradi jos jednu reviziju pre kraja godine za Q1 launch X2-ke.BTW pojavila se prva GA AM2+ HT3.0 QuadFire ploca sa preview-om na ocworkbench-u (navodno izlazi u Septembru).
 
pravi redosled je:
B0 - Q1
B1 - Q2
BA - Q3
B2 - Q4
Sto se tice "bug free", malo je nezgodno koristiti tu terminoogiju, jer ni jedna revizija nije bugovita u kontekstu funkcionalnosti. Stvar je samo u poliranju proizvodnog procesa i cinculiranja dizajna zarad otklanjanja prepreka za ostvarivanje sve visih i visih frekvencija.

:etirijezgra koja cine celinu, svako sa svojim PLL-om za ostvarivanje nezavisnog takta, pa povrh svega jos i mogucnost nezavisnog gasenja odredjenih delova samih jezgara koji se ne koriste... sve je to razlog izuzetno trnovitog puta dosezanja visih radnih frekvencija.

No zahvaljujuci AMD-ovoj APM tehnologiji proizvodnje, mnogo lakse se i maltene u letu vrse promene u "receptu" po kome se izradjuje sam CPU, tako da teba ocekivati da iz kvartala u kvartal stizu sve klokabilnija i klokabilnija K10 jezgra.
 
Poslednja izmena:
Pored ovoga,AMD planira da objavi extension za x86 intr. set za 3 dana :
http://www.extremetech.com/article2/0,1697,2175100,00.asp
.Moze biti svasta u rangu Virtuelizacije,preko SSE extenzija(najmanje verovatno) do neophodnih instrukcija za skoro objavljeni Lightweight Profiling Proposal(ili nesto vrlo blisko tome)

"Last week, AMD announced the availability of an early specification describing Light-Weight Profiling, a technology supporting the recently introduced Hardware Extensions for Software Parallelism initiative," the AMD spokeswoman said. "That was something different, but closely related."
Hm closely related ,ali nije to 🙂.Sta bi to moglo da bude 🙂?
 
Poslednja izmena:
Evo ovdje ima nekih predviđanja, možda nekome bude zanimljivo.

For desktops, Barcelona will probably lead multithreaded performance and applications that strongly depend on high bandwidth, but single threaded workloads may slightly favor Intel’s designs. Note that the dual core desktop processors are likely to remain a large portion of the product mix, until the marginal cost for the additional cores is low, or most applications can use 4 threads. A mobile variant of Barcelona will not be introduced till 2009, after Griffin. This is because many of the improvements in Barcelona are focused on server performance, and may not have the right power/performance balance for notebooks.

http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT051607033728&mode=print
 
Recimo da se u jednom od tri RS-a nalaze 3 op-a koja su istovremeno spremna za izvrsavanje, a jedan ili oba od preostala dva ALU-a su slobodna. Scheduler moze da posalje samo jedan op samo na svoj ALU, tako da ce drugi slobodni ALU-i biti neiskoristeni. Unifed scheduler bi poslao sve spremne op-ove na sve slobodne ALU-e.
Jedna od K7->K8 unapredjenja je bila pack faza koja vrsi organizaciju MOp-ova, pre nego sto sitgnu u ROB/RS, da bi se izbegle takve situacije (gde je moguce).
Ja verujem i da jedan i drugi pristup imaju svoje prednosti i mane.
Kod K8/K10 se u pack fazi unapred odredjuje "put" na koji ce instrukcija da krene, dok se kod Core on odredjuje unutar RS verovatno na osnovu nekakvih kontrolnih bitova o zauzetosti ALU i FPU jedinica.
Mozda je dekoding kakav je primenjen kod AMD-ovih procesora malo komplikovaniji, ali ipak mozda omogucava malo vecu skalabilnost, zbog odvojenog FP klastera.

Bajato! 🙂
 
Poslednja izmena:
A da se mi jos jedared podsetimo noviteta koje donosi K10 jezgro?


Macro/micro-architectural improvements over K8:

Quad-core
- Native quad-core design
- Redesigned and improved crossbar(northbridge)
- Improved power management
- New level of cache added, L3 VICTIM
Power management - DICE(Dynamic Independent Core Engagement)
- PLLs for each core, clocked independently and varies clock speed depending on usage.
- ODMC power management: ability to shut down read channels if memory is only using writes and vice versa:
* Reduces the power consumption of the memory controller by up to 80% on "many" workloads.
- Aggressive grained clock gating
- Power management state invariant time stamp counter (TSC)
- Enhanced AMD's PowerNow - works independently without OS driver support
Virtualization improvements
- Nested Paging(NP):
* Guest and Host page tables both exist in memory.(The CPU walks both page tables)
* Nested walk can have up to 24 memory acesses! (Hardware caching accelerates the walk)
* "Wire-to-wire" translations are cached in TLBs
* NP eliminates Hypervisor cycles spent managing shadow pages(As much as 75% Hypervisor time)
- Reduced world-switch time by 25%:
* World-switch time: round-trup to Hypervisor and back
Dedicated L1 cache
- 256bit 128kB (64kB instruction/64kB data), 2-way associative
- 2 x 128bit loads/cycle
- lowest latency
Dedicated L2 cache
- 128bit 512kB, 16-way associative
- 128bit bus to northbridge
- reduced latency
- eliminates conflicts common in shared caches - better for virtualization
Shared L3 cache
- 128bit 2MB, 32-way associative
- Victim-cache architecture maximizes efficiency of cache hierarchy
- Fills from L3 leave likely shared lines in the L3
- Sharing-aware replacement policy
- Expandable
Independent DRAM controllers
- Concurrency
- More DRAM banks reduces page conflicts
- Longer burst length improves command efficiency
- Dual channel unbuffered 1066 support(applies to socket AM2+ and s1207+ QFX only)
- Channel Interleaving
Optimized DRAM paging
- Increase page hits
- Decrease page conflicts
Re-architect northbridge for higher bandwidth
- Increase buffer sizes
- Optimize schedulers
- Ready to support future DRAM technologies
Write bursting
- Minimize Rd/Wr Turnaround
DRAM prefetcher
- Track positive and negative, unit and non-unit strides
- Dedicated buffer for prefetched data
- Aggressively fill idle DRAM cycles
Core prefetchers
- DC Prefetcher fills directly to L1 Cache
- IC Prefetcher more flexible
* 2 outstanding requests to any address
HyperTransport 3
- Up to three 16bit cHT links
- Up to 5200MT/s per link
- Un-ganging mode: each 16bit HT link can be divided in two 8bit virutal links
- Can dynamically adjust frequency and bit width to save power
- AC mode (higher latency mode) to allow longer communications distances
- Hot pluggable
barcelonadieshotbygojdocf8.jpg

CPU Core IPC Enhancements:
Advanced branch prediction
- Dedicated 512-entry Indirect Predictor
- Double return stacksize
- More branch history bits and improved branch hashing
- History-based pattern predictor
32B instruction fetch
- Benefits integer code too
- Reduced split-fetch instruction cases
Sideband Stack Optimizer[/]b
- Perform stack adjustments for PUSH/POP operations “on the side”
- Stack adjustments don’t occupy functional unit bandwidth
- Breaks serial dependence chains for consecutive PUSH/POPs
Out-of-order load execution
- New technology allows load instructions to bypass:
* Other loads
* Other stores which are known not to alias with the load
- Significantly mitigates L2 cache latency
TLB Optimisations
- Support for 1G pages
- 48bit physical address (256TB)
- Larger TLBs key for:
* Virtualized workloads
* Large-footprint databases and
* transaction processing
- DTLB:
* Fully-associative 48-way TLB (4K, 2M, 1G)
* Backed by L2 TLBs: 512 x 4K, 128 x 2M
- ITLB:
* 16 x 2M entries
Data-dependent divide latency
Additional fastpath instructions
– CALL and RET-Imm instructions
– Data movement between FP & INT
Bit Manipulation extensions
- LZCNT/POPCNT
SSE extensions
- EXTRQ/INSERTQ (SSE4A)
- MOVNTSD/MOVNTSS (SSE4A)
- MWAIT/MONITOR (SSE3)
Comprehensive Upgrades for SSE
- Dual 128-bit SSE dataflow
- Up to 4 dual precision FP OPS/cycle
- Dual 128-bit loads per cycle
- New vector code, SSE128
- Can perform SSE MOVs in the FP “store” pipe
- Execute two generic SSE ops + SSE MOV each cycle (+ two 128-bit SSE loads)
- FP Scheduler can hold 36 Dedicated x 128-bit ops
- SSE Unaligned Load-Execute mode:
* Remove alignment requirements for SSE ld-op instructions
* Eliminate awkward pairs of separate load and compute instructions
* To improve instruction packing and decoding efficiency

Most of the informations are from Ben Sander's presentation at AMD FPF 2006, but also there are other informations included from various internet sites.

AMD1_thumb.jpg
 
Mislim da nije vreme da se podsetimo takvih stvari vec da dobijemo konkretne brojke ^^
Jedva cekam da isprobam VM 😉
 
Sa linka koji je dao Snejk...

2x Opteron 2332 (8 cores), WIn XP x64.............................13295
2x Xeon X5350 @ 2 GHz (8 cores), Win XP x64.....................14260

To je to, kada je Cinebench 10 u pitanju.

Poslednji post u threadu (za sada, naravno)...

Ouch!!!!

Bad News : So far it's down in Cinebench, wPrime and Superpi

Good News: All are synthetic and not real world benches
 
Ovo je prilicno los rezultat, s' tim sto ako se pogleda kako skalira skor u multithread okruzenju, stvari izgledaju nesto bolje.

Prosto je neverovatno da posle onoliko unapredjenja mikroarhitekture, dobitak u performansama bude tako mizeran. K10 po ovome je jedva brzi od K8.
 
Kao sto sam rekao u drugoj temi,ovi rez. klok za klok su mizerno bolji od K8,sto je nemoguce.Pa samo 32 byte fetch i L3 ce da dodaju ukupno 5-10%,da ne govorimo o ostalim stvarima koje su promenjene(prefect,branch prediction ,brzi IMC,itd.).
 
Poslednja izmena:
Znaci fake?🙂
Stare ploce su u pitanju, ili je to, ili je slaba memorija slucajno.
 
Poslednja izmena:
😀 Jaoj Ivane care! Pazi kad sam krenuo rukom da je sklonim :d zzasu!!!
 
Hahhaa 😀,znao sam da ce proci 😀.Bas mi je krivo sto signature ne dozvolja img kod,inace bi buba bila u sig-u za stalno 😀(ala bi se lupalo po TFT-ima 😀;naravno stavio bih i disclaimer posle bube 🙂)
 
Zasu, i ja isto 😀 Pomislio sam, sta mi se bre seta ova muva po ekranu MacBook-a Pro koji sam juce taman ocistio 😀
 
Mac privlaci bube? 😀
 
Pazi kad sam isprva pomislio da mi se buba nekako uvukla izmedju samog ekrana satklene zastite (koju uz to i nemam):d
 
Izgleda da nije bio disableovan L2 na K10-tki sa coolaler.com -a.
http://img.coolaler.com.tw/images/jj5qwwymvemckdzozmt1.jpg

Ali nesto drugo se primecuje, a to su ogromne latencije za memoriju. AMD je najavio smanjenje latencije memorije u odnosu na K8. Na ovako niskim taktovima latencije za memoriju ne bi trebalo da prelaze 110 ciklusa na K8 masini sa DDR2 memorijom.

S' druge strane to nije razlog za los render rezultat. Ima tu jos necega sigurno.
 
pa znaci zbog toga su na ovom linku postovali rezultate od x2 3800+ kao poredjenje...bas sam se pitao zasto? 🙂 zanimljivo poredjenje phenom, x2 i a64...
 
Poslednja izmena:
Takodje zanimljiv je i 64-byte stride na K10, za L2 cache. Samo 8 ciklusa, sto ide u prilog onoj prici o prefetch-u u L1 i maskiranju latencija L2 kesha.
 
Status
Zatvorena za pisanje odgovora.
Nazad
Vrh Dno