OFJAVA
BY
KevinD.Clark
B.S.,UniversityofNewHampshire(1993)
THESIS
SubmittedtotheUniversityofNewHampshire
inpartialfulfillmentof
therequirementsforthedegreeof
MasterofScience
in
ComputerScience
May2005
Thisthesishasbeenexaminedandapproved.
Thesisdirector,PhilipJ.HatcherProfessorofComputerScience
R.DanielBergeron
ProfessorofComputerScience
RobertD.Russell
AssociateProfessorofComputerScience
Date
ii
Dedication
ToSharon,
thanksforalloftheloveand,ofcourse,
thanksforteachingmetonevertrustaroundpizza.
iii
Acknowledgments
Thisthesishasbeenalongbutworthwhileeffort.Iwouldliketorecognizesomeofthepeoplewhohavehelpedwiththisprojectoverthepastfewyears.
First,theIwouldliketothankmywife,SharonKunz,whohasenduredmanynightsandweekendsofherhusbanddisappearingintoourofficetoworkonthisproject.Ineverwouldhavegottenthroughthisprojectwithoutherloveandsupport.Second,Iwouldliketothankmythesisdirector,ProfessorPhilipHatcher,whohasalwaysbeenveryhelpfulandpatientinguidingmethroughthisproject.ProfessorHatcherofferedmecountlesshelpfulsuggestionsduringthecourseofthisproject,aswellasanswerednumerousquestionsregardingtheinternalsofHyperion.Next,Iwouldliketothanktheothermembersofmythesiscommitteeforreadingovermythesisdraftsandfortheirsuggestionsastohowtomakethedraftbetter.Theseweregreatlyappreciated.
IwouldliketothankEmilyPoworoznekoftheUNHEngineering,MathematicsandComputerScienceLibraryforherassistanceintrackingdownsomeofthesourcesusedinthisdocument.IwouldliketothanktheUNHComputerScienceDepartmentforofferingnightclasses—withoutthese,givenmybusycareer,Ineverwouldhavebeenabletoworktowardsmymaster’sdegree.IwouldliketothankthevariousfolksinthecomputersciencedepartmentwhowerekindenoughtofixtheStarClusteraftereventslikelightningstrikesandblownair-conditioners.Last,butnotleast,IwouldliketothankBrianJohnson,whowasmyfirstcomputerscienceteacherwhenIwasanundergraduate—IcreditthesolidfoundationthatIgotincomputersciencetohim.
iv
TableofContents
Chapter
Dedication........................................Acknowledgments....................................ListofTables......................................ListofFigures.....................................Abstract.........................................1Introduction
2BackgroundandRelatedWork2.12.2
TheStarCluster.................................Hyperion.....................................2.2.12.2.22.2.3
Motivations................................Hyperion’sCompiler...........................Hyperion’sRuntimeSubsystem.....................2.2.3.12.2.3.22.2.3.32.2.3.42.2.3.52.2.3.6
Threads.............................Monitors............................OtherSynchronization/CommunicationMechanisms....DistributedSharedMemorySubsystem...........Low-levelCommunications..................FeaturesofJavaNotPresentinHyperion..........
v
Pageiiiivviiiixxi144445556791112
2.3MPI........................................2.3.1
CommunicationsOperations
......................
2.4TheBenchmarks.................................2.4.1Overview.................................2.4.2
NotesRegardingSystemConfigurationandDataCollection.....
2.5RelatedWork...................................2.5.1MathewReno’sMastersThesis.....................2.5.2TreadMarks................................2.5.3
Split-C
..................................
3BenchmarkComparison3.1Introduction....................................3.2
GaussianElimination...............................3.2.1JavaImplementation...........................3.2.2C/MPIImplementation.........................3.2.3ResultsandPerformance........................
3.3
ShallowWater..................................3.3.1JavaImplementation...........................3.3.2C/MPIImplementation.........................3.3.3
ResultsandPerformance........................
3.4MatrixMultiplication..............................3.4.1JavaImplementation...........................3.4.2C/MPIImplementation.........................3.4.3
ResultsandPerformance........................
3.5RadixSort
....................................
vi
131317171819192021232323242526272829293636363739
3.5.1Overview.................................3.5.2Split-C
..................................
3.5.3JavaImplementation...........................3.5.4C/MPIImplementation.........................3.5.5ResultsandPerformance........................
3.6
MinimalCostGraphColoring..........................3.6.1JavaImplementation...........................3.6.2C/MPIImplementation.........................3.6.3
ResultsandPerformance........................
4HyperionAssessment4.1
LessonslearnedfromportingcodetoHyperion................4.1.1
MemoryManagement..........................4.1.1.1Ownership...........................4.1.1.2ObjectCaching........................4.1.1.3ProblemswithObjectReferences..............4.1.1.4
OtherMemoryManagementIssues..............
4.1.2
Other...................................
5ConclusionsBibliography
AAppendixA—BenchmarkData
vii
3940424243444849495353535358606869727780
ListofTables
A.1DatafortheGaussianEliminationBenchmark.................A.2DatafortheShallow-WaterBenchmark.....................A.3DatafortheMatrixMultiplicationBenchmark.................A.4DatafortheRadixSortBenchmark.......................A.5DatafortheMinimalCostGraphColoringBenchmark............
viii
8181818282
ListofFigures
2-1ExampleofHowToCreateaThreadinJava..................2-2AnExampleofJava’sMonitors.
........................
2-3AnExampleMPIProgram............................2-4AnExampleofMPI’sCollectiveCommunicationsOperations.........2-5CollectiveCommunicationOperationsinMPI.................3-1GlobalDataLayoutfortheGaussianEliminationBenchmark.........3-2AverageRuntimesforGaussianEliminationBenchmark............3-3MeasuredSpeedupforGaussianEliminationBenchmark...........3-4AverageRuntimesfortheShallowWaterBenchmark.............3-5MeasuredSpeedupfortheShallowWaterBenchmark.............3-6GeneralComputation/CommunicationsPatternfortheShallowWaterBench-mark........................................3-7SimplifiedDescriptionofaSimplesendrecv()ImplementationforHyperion.3-8OptimizedComputation/CommunicationsPatternfortheShallowWater
Benchmark.....................................3-9GlobalDataLayoutfortheMatrixMultiplicationBenchmark........3-10AverageRuntimesfortheMatrixMultiplicationBenchmark.........3-11MeasuredSpeedupfortheMatrixMultiplicationBenchmark.........3-12ExampleDeclarationofa“spread”ArrayinSplit-C..............
ix
681415162426273031
3133
3537383841
3-13AverageRuntimesfortheRadixSortBenchmark...............3-14MeasuredSpeedupfortheRadixSortBenchmark...............3-15SortTimeperKeyfortheRadixSortBenchmark...............3-16DescriptionofDijkstra’sDistributedTerminationDetectionAlgorithm...3-17AverageRuntimesfortheMinimalCostGraphColoringBenchmark.....3-18MeasuredSpeedupfortheMinimalCostGraphColoringBenchmark....4-1ExampleofClashBetweenObject-OrientedStyleandHyperion’sDSM...4-2IllustrationofanInefficientHyperionMemoryUsePattern(1)........4-3ExampleofCachingObjectsResidingonOtherProcessors..........4-4IllustrationofanInefficientHyperionMemoryUsePattern(2)........4-5AStartatMovingaThread’sVariablestotheStack..............4-6ASub-OptimalWayToMoveVariablestoaThread’sStack.........4-7ASimplifiedParallelHarnessClass......................4-8IllustrationofanEfficientHyperionMemoryUsePattern...........
x
4344454750505557596365666768
ABSTRACT
EVALUATINGTHEPERFORMANCEOFHYPERION,ADISTRIBUTEDSHAREDMEMORYIMPLEMENTATIONOFJAVA
byKevinD.Clark
UniversityofNewHampshire,May,2005
InthisthesisweexaminetheperformanceofHyperion,ahigh-performanceimplemen-tationofJavacreatedattheUniversityofNewHampshire(UNH).TheintentofHyperionistoallowprogrammerstodeveloptheirparallelapplicationsinanenvironmentthatiscommonandfamiliar:multithreadedJavarunningonapersonalworkstationequippedwithpowerfuldevelopmenttools.Aftertheprogramisdeveloped,itcanbemovedwithoutmodificationtoaHyperionsystemtoberunonaclusterofcomputers.ArepresentativesetofparallelbenchmarkshasbeendevelopedinordertoexaminetheperformanceofHyper-ion,writtenonceinpureJavaandwrittenagaininCusingtheMessagePassingInterface(MPI).Therelativeperformanceofeachbenchmarkiscompared.Ifanychangesweremadetoabenchmarkinordertoobtainadequateperformance,thesechangesareexplained.Ingeneral,thisresearchshowsthatitispossibletoobtainreasonableperformancewhenex-ecutingparallelJavacodeinHyperion,butwritingJavacodethatadherestoHyperion’sDistributedSharedMemory(DSM)featuresrequiresagreatdealofattentiontodetail.Inaddition,therichfeaturesinamessage-passinglibrarylikeMPIappeartomakeiteasiertodevelopapplicationsthatperformwellonaclusterofcomputers.
xi
Chapter1
Introduction
Parallelprocessingtechnologyisimportantforsolvingvariouslargeproblems.Someprob-lemsaretoolargetoreasonablysolvewithtoday’sprocessorswithoutresortingtosomekindofparallelprocessingtechnique.Suchproblemsincludesimulationsoflargesystems,ge-neticsequencingproblems,fastFouriertransformationsrunonlargedatasets,largesortingproblems,andrealworldoptimizationproblems.
Thequestionofhowbesttoprogramparallelcomputersingeneralhasbeenthesub-jectofmuchresearchovertheyears.Twooftheparadigmsthathaveemergedfromthisresearcharemessage-passingandshared-memorysystems.Messagepassingsystemsallowprogrammerstoexplicitypassmessagesbetweenprocessorsactinginconcert.Shared-memorysystemsallowprocessorsactinginconcerttoaccessagloballysharedmemoryarea.Message-passingsystemsgivetheprogrammertheopportunitytocontrolpreciselyhowandwhenmessagesaresentatruntime,butthisdegreeofcontroldoescomewithaddedresponsiblity:theprogrammerhastopreciselyinsertinhisorhercodecallstothemessage-passingApplicationProgrammingInterface(API)ateverypointcommunicationisrequired.Incontrast,shared-memorysystemstakecareofallofthecommunicationdetailsfortheprogrammer.
Inrecentyearstherehasbeenmuchinterestinconstructingparallelprocessingsystems
1
intermsofclustersofcommodityequipment:off-the-shelfPCsrunningstandardoperatingsystemsinterconnectedbycommerciallyavailablenetworkingequipment.Suchsystemsarerelativelyinexpensivecomparedtodedicatedparallelmachines,relativelyeasytoassemble,andthereexistprogrammingtoolsandlibrariestoassistintheuseofsuchclusters.
Tothisend,aclusterofworkstationshasbeenconstructedattheUniversityofNewHampshireforthepurposeoffacilitatingresearchintoparallelcomputation.Thisclus-ter(calledthe“StarCluster”)supportsthetwodifferentparallelprogrammingparadigmsmentionedabove—amessagepassingsystemandasharedmemorysystem.ThemessagepassingsystemisanimplementationoftheopenMessagePassingInterface(MPI)[11]stan-dardcalledMPICH[6].ThesharedmemorysystemisaversionofJavadevelopedattheUniversityofNewHampshirecalledHyperion.HyperionprovidesaJavaVirtualMachine(JVM)thatconformstotheJavaLanguageSpecification(JLS)[5].
Javaisanattractivelanguagefromaparallelprogrammingperspectiveformanyreasons.Javaisaportable,well-defined,widely-availableandflexiblelanguagethatcurrentlyenjoyspopularitybecauseofitsrichfeatureset.Thisfeaturesetincludessupportforthreads,syn-chronizationmechanisms,andathread-safeclasslibrary.Thelanguageandprogrammingenvironmentarefairlymatureatthispoint,withmanyexcellentprogrammingtoolsavail-able(inparticular,optimizingcompilers,debuggers,andacleanstandardclasslibrary).AparallelapplicationthatisdevelopedusingtheJavaprogramminglanguagehasaverygoodchanceofbeingabletorunontomorrow’scomputingplatforms.
ThisthesisevaluatestheparallelprocessingperformanceofHyperionbycomparingittotheperformanceofafairlywidespreadmessagepassingimplementation(MPICH).Asetofbenchmarkshasbeendevelopedforthispurpose.Eachbenchmarkhasbeenimplementedtwice—onceutilizingathreadsparadigm(usingJava’sthreads)andonce
2
utilizingamessage-passingparadigm(usingCandMPI).Alltestshavebeenrunonthesamehardware.TheMPI-basedbenchmarkswerefairlystraightforwardtoimplement,buttheJavathreads-basedbenchmarksbenefitedfromadditionaloptimizationsduetothewaythatthreadsanddistributedshared-memoryareimplementedinHyperion.Allsuchmodificationsandoptimizationsarenotedanddiscussedinthisthesis.
InChapter2wediscussHyperion,MPI,andthebenchmarksthathavebeenwritteninordertoevaluatethesetwoparallelprogrammingapproaches.InChapter3wediscusstheperformanceresultsofthebenchmarksthemselves,includingspecificmodificationsthatweremadetothebenchmarksinordertoimprovetheirperformance.InChapter4adetailedexaminationofthelessonslearnedwhileportingJavacodetotheHyperionenvironmentarepresented.Finally,conclusionsarepresentedinChapter5.
3
Chapter2
BackgroundandRelatedWork
2.1
TheStarCluster
TheStarClusterisacollectionofcommodityPCsconnectedtogetherusingaprivate,high-speednetworkswitch.Actinginconcert,thisclusterofequipmentisaparallelcom-puter.ThetypesofparallelprogrammingenvironmentsthatthisclustersupportsincludesHyperionandMPICH(bothdescribedindetailbelow).
Thehardwareforthisclusterconsistsof16PCsconnectedviaa100MbsEthernetswitch.EachPCisa662MHzIntelPentiumIII(Coppermine),configuredwith512MRAM,SCSIhard-disks,runningLinux,kernelversion2.4.4.
2.2
2.2.1
Hyperion
Motivations
HyperionisanimplementationofJavaversion1.1thathasbeencreatedattheUniversityofNewHampshire.ThecreationofHyperionaddressesseveralmotivations:First,Hyperionprovidesanexecutionenvironmentthatcanharnessthepowerofaclusterofworkstations.Second,Hyperionimplementsacommonprogrammingenvironment:Java.Third,thisprogrammingenvironmentisstandard—forthemostpart,therearenolanguageextensions
4
toadjustto,nonewAPIstolearn,andnorequiredsourcecodechangesthatwouldpreventtheprogrammerfromusingsomeotherJavaimplementation.
2.2.2Hyperion’sCompiler
Generallyspeaking,Hyperioniscomprisedoftwoparts:acompilerandaruntimeenviron-ment.Hyperion’scompilerdoesnotjusttranslateJavasourcecodeintoJavabytecode—italsotranslatesthiscodeintorawexecutablemachinecode.Thecompilerdoesthisbytak-ingtheoutputofastandardJavacompiler(Javabytecode),translatingthisintoCcode,andthenrunningthiscodethroughanoptimizingCcompiler(GCC).Thispointbearsrestating:atruntime,theHyperionsystemdoesnotinterpretJavabytecode;insteadthegeneratedmachinecodeisexecuted.Hyperion’scompilerisdescribedindetailin[1].
2.2.3Hyperion’sRuntimeSubsystem
Hyperion’sruntimesystemprovidesallofthefunctionalitythatastandardmultithreadedJavaprogramwouldneedinordertorun(generallyunmodified)onaclusterofworkstations.AlargesubsetoftheJavaclasslibraryhasbeenimplementedandisavailableforanyprogrammertouse.Moreimportantly,theHyperionruntimesystemprovidessupportforJava’sthreadcreation,synchronization,andshared-memorysubsystems.Thesefeaturesarecentraltotheworkinthisthesis;thereforethesewillbeexplainedindetailhere.
2.2.3.1Threads
IntheJavaprogrammingenvironment,thestandardwaytocreateanewthreadistousemethodsinthejava.lang.Threadclass.Foranexampleofhowanewthreadiscreated,pleaserefertoFigure2-1.
5
publicclassSomeClassimplementsRunnable{
publicstaticvoidmain(Stringargv[]){
SomeClasssc=newSomeClass();Threadthr=newThread(sc);thr.start();
System.out.println(‘‘originalthreadisrunninghere’’);//waitfornewlycreatethread‘‘thr’’tofinish
try{thr.join();}catch(InterruptedExceptione){}}
publicvoidrun(){
System.out.println(\"newlyspawnedthreadisrunninghere\");}}
Figure2-1:ExampleofHowToCreateaThreadinJava.
Hyperion’sruntimeimplementsallofthemethodsinthisclass,sostandardmultithread-ed/parallelJavaapplicationsdonotneedtobemodifiedinordertorunontheStarCluster.Hyperion’sruntimecreatesthreadsinthefollowingmanner:eachthreadthatiscreatedisdistributedinaround-robinmannertoaCPUthatisparticipatinginthecluster.Itispossibletocreatemorethreadsthanprocessorsinthecluster.Inthissituation,oneormoreoftheprocessorsruntwoormorethreadssimultaneously.Noneofthebenchmarksdetailedbythisthesisusethisfeature,butoccasionallythisfeaturecanbeuseful.1
2.2.3.2Monitors
Criticalsectionsofcode(sectionsofcodethatshouldbeexecutedbyonlyonethreadatatime)canbeprotectedwithJava’smonitors.HyperionfullysupportsJava’smonitors.
1
Anexampleofthiswillbepresentedinsection3.6.3.
6
MonitorsareusedinJavavia:•thesynchronizedkeyword
•thejava.lang.Object.wait()method•thejava.lang.Object.notify()method•thejava.lang.Object.notifyAll()method
AnexampleoftheuseofJava’smonitorscanbefoundinFigure2-2.ThisexamplepresentsaclasscalledIntQueue,whichcanbeusedasaqueuebetweenreadingandwritingthreads.
TheJVMimplementsmonitorsintermsoflocks.Theprocessofacquiringandreleasinglockshascertainside-effectsonthememorysubsystem.Thesewillbediscussedinthenextsection.
2.2.3.3OtherSynchronization/CommunicationMechanisms
Javawasdesignedtofacilitategeneralconcurrentprogramming—notparallelcomputation.Untilrecently([13]),therehasbeennodirectsupportforparallelcomputationinJava.Assuch,withtheversionofJavausedwithHyperion(Java1.1),therearenonativeprimitivesforbarriersynchronizationorcollectivecommunications.ThefactthatthisversionofJavadoesnotspecifyanyofthesecommonlyusedfeaturesisnotserious.Hyperion’sruntimedoesprovidesomesupportfortheseoperations.Currently,Hyperion’sruntimeprovidessupportforthefollowingoperations:
•Arraybroadcastofsimplescalars(floats,doubles,etc.).
•Simplereduction(allthreadscanfindglobalmaximum,minimum,andsumofsimplescalars).
7
publicclassFixedIntArrayQueue{
intm_queueSize,m_numElements=0,
m_head,m_tail,m_intArr[];
publicFixedIntArrayQueue(intqueueSize){m_queueSize=queueSize;m_intArr=newint[m_queueSize];m_numElements=m_head=m_tail=0;}
publicsynchronizedvoidinsertAtBack(intn){while(numElements()==m_queueSize){
try{wait();}catch(InterruptedExceptione){}}m_intArr[m_tail]=n;
m_tail=(m_tail+1)%m_queueSize;if(m_numElements++==0)notifyAll();}
publicsynchronizedintremoveFromFront(){intresult;
while(numElements()==0){
try{wait();}catch(InterruptedExceptione){}}
result=m_intArr[m_head];
m_head=(m_head+1)%m_queueSize;
if(m_numElements--==m_queueSize)notifyAll();returnresult;}};
Figure2-2:AnExampleofJava’sMonitors.
8
•Customreduction—aspartofthisthesisHyperion’sReductionclasswasextendedtoallowforallprocessorstocollectivelydetermineaglobalmaximumandarrayindexofthatmaximum.MoredetailsregardingthisreductionarepresentedinthediscussionoftheGaussianeliminationbenchmarkinsection3.2.1.
TheseoperationshavebeenoptimizedtobeefficientintheclusterenvironmentthatHyperionruns.Forexample,thebarriersynchronizationoperationusesaminimumoflow-levelmessageexchanges(itusesalogPalgorithm).2
2.2.3.4DistributedSharedMemorySubsystem
Hyperion’sDistributedSharedMemory(DSM)subsystemcomplieswiththerulessetforthintheJavaLanguageSpecification(JLS).AJavaprogrammerwhodealswiththreadsmustcompletelyunderstandthememoryconsistencyrulescodifiedintheJLS.Inadditiontounderstandingtheserules,aJavaprogrammerwhowishestousetheHyperionsystemefficientlymustalsounderstandcertainstraightforward(butfar-reaching)propertiesofHyperion’sDSMsubsystem.Alloftheseissuesaredetailedinthissubsection.
EachthreadthatHyperioncreatesonbehalfoftheuser’sprogramisrunninginthesameJavaVirtualMachine(JVM).Inturn,thisJVMisrunningonaclusterofworkstations(theStarCluster).TheJLSspecifiesthatallofthesethreadsruninasharedmemoryarea(calledmainmemory.Hyperion’sDistributedSharedMemory(DSM)subsystemprovidesmainmemorythatallofthethreadsrunninginthedistributedJVMcanaccess.MainmemoryiswheretheJVM’sheapresides(objectsthatareallocateddynamicallyuseheapmemory).3
Hyperion’sruntimedoesn’tofferastandardbarriersynchronizationmechanism—therealityisthatHyperion’sreductionoperationsarethemostefficientcommunicationsoperationsavailableinthesystem.Developerswhoneedastandardbarriersynchronizationoperationonlyhavetocallhyperion.Reduction.sum(0)andignorethesubsequentreturnvalue.
3
2
AthreadrunninginaJVMcannotaccessanotherthread’sstack.Thisisnotablydifferentthantypical
9
AsthreadsintheJVMuseregionsofmainmemory,theycacheregionsofmainmemoryintowhatiscalledworkingmemory.Eachthreadmaintainsitsownregionofworkingmemory.TheJLSrequiresthateachthreadmaintainacoherentcopyofitsownworkingmemorybutdoesnotdictatethatathread’sworkingmemorybeidenticaltomainmemoryorindeedtosomeotherthread’sworkingmemoryeither.
TheJLSrequiresthatthreadsrunninginaJVMresolvetheirdifferencesbetweentheirownworkingmemoryandmainmemorywhenlocksareacquiredandwhenlocksarere-leased.LocksareoneofthecomponentsofJava’smonitors(describedabove).ThisdesignimpliesthatatagiveninstantthethreadsrunninginaJVMcanhaveagreatlydifferentviewofwhatmainmemorytrulylookslike.Whilethismightseemalarming,thisde-signisflexibleandstillallowstheprogrammertowriteprogramsthatarecorrect.ThisdesignisflexiblebecauseJVMimplementorsarenotrequiredtoensurethatathread’sworkingmemoryisalwaysconsistentwithmainmemory(thiswouldgetmoreandmoreinefficientasthenumberofthreadsincreased).Thisrelaxeddesignallowstheprogrammertowritecorrectprogramsbecausetheprogrammercanusesynchronizationmechanismsatstrategicpointsintheprogramtoensurethateverythreadhasaconsistentviewofmem-orywhenthisconsistencymatters.Withoutusingsynchronizationatstrategicpointsinamultithreaded/parallelprogram,theprogramwilllikelysufferfromraceconditions.
Hyperion’sDSMsubsystemadherestoallofthememoryconsistencyrulesspecifiedbytheJLS.However,italsohasthefollowingadditionalproperty:memoryaccessisnon-uniform.IntheHyperionsystem,memorythatisallocatedbyathreadrunningonagivenprocessorislocaltothatprocessor.Accessinglocalmemoryisveryefficient.Accessingnon-
threadspackagesthatareavailablefortheCprogramminglanguage.Suchaccessisnearlyalwaysanerror.ThefactthatJavadisallowssuchaccessismuchmoreusefulthannot.
10
localmemoryissupportedbyHyperionbutincursaperformancepenalty.ThelogisticsofhowHyperion’sDSMsubsystemworksareexplainedindetailin[1].Generallyspeaking,mainmemoryinHyperion’sJVMconsistsofadistributedcollectionofmemorythatisallocatedlocallyoneachprocessorparticipatinginthecluster.ProgrammerswhowishtouseHyperion’sDSMsubsystemefficientlymuststrivetominimizetheamountofnon-localmemoryaccessthateachthreadperforms.
2.2.3.5Low-levelCommunications
MuchofHyperion’sruntimeisimplementedintermsoftheDistributedSharedMemoryonaParallelMultithreaded(DSM-PM2)library[12].Inparticular,thislibraryisusedforitsefficientthreadanddistributedshared-memorymechanisms.ThethreadfunctionalitythatDSM-PM2providesismoreefficientthanthenativethreadlibraryinthiscluster’senvironment(POSIXthreadsrunningonLinux).Thisisespeciallytrueintheareaofsignal-handling.DSM-PM2usessignal-handlingtoimplementpage-faults—efficiencyinthisareaiscrucialtotheoverallperformanceofHyperion’sDSMsubsystem.
OntheStarCluster,DSM-PM2hasbeenconfiguredtouseTransmissionControlProto-col(TCP)asitslow-leveltransportprotocol.TCPisareliable,stream-basedprotocolthatimplementsvariouscongestioncontrolmechanisms(slow-start,detectionoflost/duplicatedsegments,etc.).ItcouldbearguedthatTCPisnottheoptimalchoicefortheStarClusterbecauseinthisclusteraprivateswitchednetworkisbeingused.ThelatencythatTCPisgenerallyunderstoodtoincurmightnotbeappropriateforhigh-performanceparallelcomputation,wherelowmessagedeliverylatencyisusuallyprized.Nevertheless,TCP’sperformanceiscertainlyadequate,andDSM-PM2usesitforthisthesiswork.
11
2.2.3.6FeaturesofJavaNotPresentinHyperion
Javaspecifiesseveralfeaturesthattendtomakeprogrammingmoresafeandconvenient.Twoofthesefeaturesaregarbage-collectionandruntimecheckingofarraybounds.
Havingarobustgarbage-collectionsystemcanmakedealingwithdynamicmemorymoreconvenientandhelpeliminatememoryleaks.Hyperiondoesnotcurrentlyimplementagarbage-collector.Hyperion’sDSMarchitecturecomplicatesthedesignofanypotentialgarbage-collectionsystem,buttheimplementationofsuchasystemisn’timpossibleeither.Withtheabsenceofagarbage-collector,memorydynamicallyallocatedinaprocessrunningviaHyperionisneverreturnedfromthefreememorypool.Obviously,thismeansthatsomeJavaprogramsmighthavetobeadjustedtobemorefrugalwiththeirmemoryusagewhenrunviaHyperion.
Performingruntimechecksonarrayboundarieshelpsflaganextremelycommonpro-grammingerror.InJava,whenaprogramgoesbeyondanarray’sbounds,anexceptionisthrown.Technicallyspeaking,Hyperiondoesimplementarrayboundschecking,asdic-tatedbytheJLS.However,thesechecksintroduceasignificantperformancepenalty.Arrayboundsshouldn’tbeexceededinproperlyrunningprograms,so,forthepurposesofthisthesisHyperionhasbeenconfiguredtonotperformanyarrayboundschecking.Allofthebenchmarksthatthisthesisdiscussesarescientificprogramswithfairlyregulararraybounds.Moderncompilerscommonlyoptimizearrayboundscheckingoutofsuchprograms.Hyperion’scompilerdoesnothappentoimplementthisoptimization,butitnodoubtwillsomeday.So,again,turningoffarrayboundscheckingseemsveryreasonable.
12
2.3MPI
“MessagePassing”referstoanapproachtoparallel-programmingwherebyprocessorscom-prisingaparallelcomputersendeachothermessagestofacilitatetheoverallparallelcom-putation.MessagePassingInterface(MPI)isanindustry-standardmessagepassinglibrary[11].ThislibraryhasbindingstoC,Fortran,andJava.ThedesignofMPIemphasizescor-rectness,portability,andconvenience.Inaddition,MPIwasdesignedtoallowforefficientandscalableimplementations(forexample,itallowsforzero-copycommunicationsopera-tions).MPIitselfisanopenstandard,andquiteafewimplementationsofthisstandardexist.TheimplementationthatisbeingusedforthisthesisiscalledMPICH[6].PleaserefertoFigure2-3foranexampleMPIprogram.
2.3.1CommunicationsOperations
MPIprovidesaconvenientAPIfordeveloperstoprogramwith.WiththisAPI,adevelopernolongerhastocodetowhateverAPIthevendoroftheirtargetparallelmachinehaspro-vided.Thisfacilitatesportability.ThesetofcommunicationsoperationsofferedbyMPIisfeature-rich—sometimesmorefeature-richthanthesetofoperationsofferedbytheven-dor’sAPI.Sometimesthiscanbeveryuseful.AsimpleexampleofausefulcommunicationsoperationisshowninFigure2-4.
MPIoffersarichvarietyofcommunicationsoperations:•Barriersynchronization.
•Blockingandnon-blockingoperationssend/receiveoperations.•Operationsthatcombinesendingandreceiving.•Operationsthatworkonscalarsandarrays.
13
#include\"mpi.h\"
intwhoami;/*whichnodenumberami?*/intnumProcessors;/*totalnumberofprocessors
runningthisprogram*/enum{SAMPLE_DATA1,SAMPLE_DATA2,ETC};intmain(intargc,char*argv[]){
enum{DATA_SIZE=200};
doubledata[DATA_SIZE],data2[DATA_SIZE];MPI_Statusstatus;MPI_Init(&argc,&argv);
/*getnumberofprocessesintheprocessgroup*/MPI_Comm_size(MPI_COMM_WORLD,&numProcessors);
/*getthisprocess’srank(ID)withintheprocessgroup*/MPI_Comm_rank(MPI_COMM_WORLD,&whoami);MPI_Barrier(MPI_COMM_WORLD);
/*barriersynchronization*/
/*messagetypes*/
do_work(data,DATA_SIZE);/*performarbitrarycalculation*/if(whoami==0){
/*senddata[]toprocessor1*/
MPI_Send(data,DATA_SIZE,MPI_DOUBLE,1,
SAMPLE_DATA1,MPI_COMM_WORLD);
}
else{
/*blockingreceive;waitforSAMPLE_DATA1
datafromprocessor0*/
MPI_Recv(data2,DATA_SIZE,MPI_DOUBLE,MPI_ANY_SOURCE,
SAMPLE_DATA1,MPI_COMM_WORLD,&status);
}
MPI_Barrier(MPI_COMM_WORLD);returnEXIT_SUCCESS;}
Figure2-3:AnExampleMPIProgram.
/*barriersynchronization*/
14
/*everyprocessorexecutesthiscode...*/intlocal_sum,global_sum;
local_sum=calculate_local_sum();
MPI_Allreduce(&local_sum,&global_sum,1,
MPI_INT,MPI_SUM,MPI_COMM_WORLD);/*atthispointglobal_sumcontainsthesumofall
ofthelocal_sumvaluesintheprocessgroup*/
Figure2-4:AnExampleofMPI’sCollectiveCommunicationsOperations.
•OperationsthatallowthedevelopertospecifyacustomdatatypeandthenusethisdatatypejustlikeanyotherMPIdatatype.
•Arichvarietyof“collectivecommunicationsoperations,”including:
–Reductiontoasingleprocessor.–Reductiontoallprocessors.
–Broadcasttoallprocessors.ThisoperationisillustratedinFigure2-5.–Scatterdatafromoneprocessortoallotherprocessors.Thisoperationisillus-tratedinFigure2-5.
–Gatherdatafromallprocessorstoasingleprocessor.ThisoperationisillustratedinFigure2-5.
–Gatherdatafromallprocessorsandcollecttheresultonallprocessors(thisiscalled“allgather).ThisoperationisillustratedinFigure2-5.
–Acombinationscatter/gatheroperation(sometimescalleda“full-exchange”).InMPIterminologythisiscalled“alltoall.”ThisoperationisillustratedinFigure2-5.
15
dataprocessesA0A0A0broadcastA0A0A0A0A0A1A2A3A4A5scatterA0A1A2gatherA3A4A5A0B0C0D0E0F0A0A0B0B0B0B0B0B0C0C0C0C0C0C0D0D0D0D0D0D0E0E0E0E0E0E0F0F0F0F0F0F0allgatherA0A0A0A0A0B0C0D0E0F0A1B1C1D1E1F1A2B2C2D2E2F2A3B3C3D3E3F3A4B4C4D4E4F4A5B5C5D5E5F5A0A1B0B1B2B3B4B5C0C1C2C3C4C5D0D1D2D3D4D5E0E1E2E3E4E5F0F1F2F3F4F5alltoallA2A3A4A5Figure2-5:CollectiveCommunicationOperationsinMPI.
–Prefixscanoperations.
TheMPIspecificationdoesnotdictatethatimplementationsofthesecommunicationsoperationsareefficient.Ofcourse,implementorsoftheMPIspecificationareencouragedtomaketheirMPIimplementations(suchasMPICH)asefficientaspossible.
TheMPIspecificationdoesnotdictatewhichlow-leveltransportprotocolsshouldbeusedinimplementationsofMPI.MPICHisflexibleinthisregard,itcanbeconfiguredtousevariousotherparallelcommunicationssystemsasitsbackend(likeIntelNX,MPL,andNcube’scommunicationlibraryetc.).Forthisparticularthesis,MPICH’s“chp4”backendhasbeenemployed.ThisbackendusesTCPasitstransportprotocol.EverythingthatwassaidregardingHyperion/DSM-PM2’suseofTCP(above)alsoappliestoMPICH’suseofTCP.
16
2.4
2.4.1
TheBenchmarks
Overview
Thebenchmarksthatthisthesisdiscusseshavebeenchosenbecauseitwasthoughtthatthesewereafairlyrepresentativesetofparallelapplications.Itisthoughtthattheper-formanceofthesebenchmarkswillprovideagoodinsightintothetrueperformanceofHyperion.Thesearethebenchmarks:
•Gaussianeliminationtosolveasystemoflinearequations.•Ashallowwatersimulation.•Matrixmultiplication.
•Aradixsort,asdescribedin[3].
•Aminimalcostcoloringgraph-coloringprogram.
Eachbenchmarkhasbeenimplementedtwice:onceinJavaandonceinC/MPI.IntheinitialdevelopmentofmostoftheJavaversionsofthesebenchmarksnoattentionwaspaidtoHyperion’sruntimecharacteristics(especiallythepropertiesofitsDSMsubsystem).TheperformanceoftheinitialversionsoftheseJavabenchmarksinHyperionwasgenerallyabysmalandtheseresultsarenotworthdiscussinghere.However,whatthisthesisdoespresentarethestepsthathadtobeperformedinorderfortheJavaversionsofthebench-markstoperformadequatelywithHyperion.Ineachbenchmarksection,theperformanceoftheC/MPIandHyperionversionsarecompared.
17
2.4.2NotesRegardingSystemConfigurationandDataCollection
HyperionisanimplementationofJava1.1.ItusestheSunJava1.1JavaDevelopmentKitcompilertogeneratebytecode.Next,theHyperioncompilergeneratesCcodefromthisJavabytecode.Afterthis,theGCC3.0compilerisusedtocompilethisCcodeintoexecutablecode.Hyperioninvokesgccwiththe“-O6”flagforoptimization.Atthetimeofexecution,Hyperion’sruntimeinvokesmanyservicesfromtheDSM-PM2package.Asconfigured,Hyperiondoesn’thaveagarbage-collectornordoesitperformanyarrayboundschecks(bothofthesearediscussedabove).
TheC/MPIenvironmentusesGCC3.0andMPICH[6]version1.2.2.AlloftheC/MPIcodeinthesebenchmarkswascompiledwithgcc’s“-O6”optimizationflag.
Allofthedatapresentedinthebenchmarksectionsinthenextchapterweregeneratedastheresultofrunningeachbenchmarkthreetimesandthengeneratinganaverageoftheseresults.Usingaveragesinthiscaseisjustifiedbecause,forthemostpart,alloftheruntimedatacollectedintheseexperimentsisrelativelystable.Mostofthecalculatedstandarddeviationvaluesforthecollecteddataarelessthan1.2.Thelargeststandarddeviationvalueforanyofthecollecteddatais32.016(fortheminimalgraph-coloringbenchmark).Thislargecalculatedstandarddeviationvaluedoesnotdetractfromtheoverallanalysisoftheminimalgraph-coloringbenchmark,sincethecorrespondingC/MPIprogramconsistentlyrunsinhundredsofsecondslessthantheHyperionversionofthisbenchmark(overthirteentimesthevalueofthestandarddeviationinthiscase).Theuseofaveragesinthisthesisclearlyconveystheperformanceofeachversionofeachbenchmark.Interestedreadersareinvitedtolookattheappendicesofthisdocumentiftheyareinterestedintheoriginaldata.
18
2.5
2.5.1
RelatedWork
MathewReno’sMastersThesis
Reno’sthesis[16]examinestheperformanceofparallelbenchmarksobtainedfromtheJavaGrandeForum(JGF).Twoversionsofthesebenchmarksareavailable:thefirstisimple-mentedusingamessage-passingstyleandthesecondiswritteninashared-memorystyle.Themessage-passingversionsofthesebenchmarksuseavariantofMPIcalledjavaMPI.Theshared-memoryversionsofthesebenchmarksarewrittenusingstandardJava.AfterportingjavaMPItotheHyperionJVM,bothversionsofthesebenchmarkswererunwithHyperionontwodifferentclusters:theStarClusterandtheParaskiCluster.TheStarClusteristhesameclusterasusedinthisthesis,whiletheParaskiClusterisanotherclus-terofworkstationslocatedinRennes,France.TheParaskiclusterconsistsof16PentiumIII1GHzPCsinterconnectedbya2Gb/sMyrinetnetwork.TheParaskiCluster’snet-workhasmorebandwidthandhasalowerlatencythantheStarCluster’snetwork.Reno’sthesisshowedthatitwaspossibletogettheshared-memoryversionsofthesebenchmarkstorunincomparabletimesasthoseofthemessage-passingversionsofthesebenchmarks.However,theoriginalmessage-passingversionsofthebenchmarksfromtheJGFassumedauniform-accessshared-memoryenvironment.Thisassumptioninitiallyledtounacceptableperformancefiguresfromtheshared-memoryversionsofthesebenchmarks.Afterthesebenchmarksweremodifiedtonolongerassumeauniform-accessshared-memoryenviron-ment,theirperformancefiguresbecamecomparablewiththemessage-passingversionsofthesebenchmarks.Inparticular,whentheshared-memorybenchmarkswererunontheParaskiCluster,withitslowerlatencynetwork,theperformanceofthesebenchmarksbe-camemuchmorecomparable.Themodificationsthatwererequiredforthisportgenerally
19
entailedtransformingtheseshared-memorybenchmarksintobenchmarksthatresembledtheirmessage-passingcounterparts.
2.5.2TreadMarks
TreadMarks[8]isaparallelcomputingenvironmentwithadistributedshared-memorysub-system(implementedinsoftware).TreadMarks’screatorshavebenchmarkedthissystemagainstPVM(anothermessage-passingsystem,somewhatsimilartoMPI)[10].Forthiscomparison,TreadMarks’screatorsusedeightbenchmarksfromtheSPLASHbenchmarksuite.TheTreadMarkssystemprovidesaCAPIandisportabletoanumberofUnixsystems.LikeHyperion,globalmemoryinTreadMarksconsistsofthesumtotalofeachparticipatingprocessor’slocalmemory.
TreadMarksmemoryconsistencyschemeissomewhatsimilartotheschemethattheJLSdictates(threadsneedtouselocksinorderforremotememorychangestobeseenbyotherprocessors).However,onedifferencebetweenthesesystemsisthatTreadmarksimplementsa‘lazy”releaseconsistencyschemewhereasJavaimplementsan“eager”releaseconsistencyscheme.TheJLSdictatesthatchangestoathread’sworkingmemoryareflushedbacktomainmemorywhenalockisrelinquished,whereasintheTreadMarkssystemchangesarepropagatedwhenlocksareacquired.IntheTreadMarksenvironment,accesstolocalmemoryisefficientandaccesstonon-localmemoryincursaperformancepenalty.
TheresearcherswhobenchmarkedTreadMarksagainstPVMusedtwodifferentplat-formstogatherdatafrom;thefirstplatformconsistedofeightSparc-20model61worksta-tionsconnectedbya155MbsATMnetworkandthesecondconsistedofaneightprocessorIBMSP/2.
TheresultsofthecomparisonbetweenTreadMarksandPVMwereasfollows:
20
•EachTreadMarksversionofeachbenchmarkissuedmorelow-levelcommunicationsmessagesthanthecorrespondingPVMversionsofeachbenchmark.
•IncomparisontotheCPUspeedsoftheprocessorsineachoftheseparallelcomputers,thespeedofnetworkcommunicationisveryslow.
•TheobservedTreadMarksspeedupforsixoutoftheeightbenchmarkswaswithin15%oftheobservedPVMspeedup,fortheseventhapplicationthedifferencewasbetween15-30%,andforthelastthedifferencewasalmost50%.
Manyprogrammersprefercodingparallelprogramsusingashared-memorystyle,anda15%discrepancybetweenashared-memoryimplementationandamessage-passingimple-mentationmightbesmallenoughforthemtoaccept.
2.5.3Split-C
TheradixsortbenchmarkthatthisthesisdiscusseswasderivedfromoneofthebenchmarksforSplit-C[2].Split-CisavariantoftheCprogramminglanguagewithextensionsforparallelprocessing.Theseextensionsincludeamixoffunctionalitythatallowaprogrammertouseshared-memory,message-passing,andevendata-paralleltechniques.Split-Coffersthefollowingfeatures:
•ASingleProgram,MultipleData(SPMD)environment.
•Adistributed,sharedmemorysubsystem.Globalmemoryconsistsofthesumtotalofallofthelocalmemorysegmentsoftheprocessorsinvolvedintheparallelcompu-tation(justlikeHyperion).Anyprocessormayaccessanyoftheglobalmemory,butaccessinglocalmemoryisefficientandaccessingnon-localmemoryismoreexpensive(justlikeHyperion).UnlikeHyperion,inSplit-Caccesstonon-localmemorymustbe
21
codedexplicitly,typicallybyusingglobalpointersandarrays.InHyperion,accesstonon-localmemoryhappenswithoutanyspecialcodeandtheruntimesystemperformsthisprocessautomatically.4
•“Spreadarrays”—Split-Callowstheprogrammertodefineglobalarrayswhoseindi-ciesincludetheprocessornumber.Sucharraysarespreadacrossalloftheprocessorsparticipatingintheparallelcomputation.Theprogrammercanoperateonsucharraysinadata-parallelmanneriftheysowish.
•“Split-phase”memoryusageoperators—likeHyperion,Split-C’sDSMsubsystemen-forcesarelaxedmemoryconsistencyscheme.Split-C’ssplit-phaseoperatorsallowaprogrammertopacktogethermemory-modificationssothattheycanbeperformedinbulkoroverlappedwithcomputation.
•Split-Cprovidestheprogrammerwithefficientbulkdatatransferoperations,whichallowsaprogrammerusingSplit-Ctogainsomeoftheadvantagesofmessage-passing.Hyperionoffersasimilarfeaturethroughitsjava.lang.System.arraycopy()imple-mentation.
ThedesignersofSplit-CdevelopedlanguageextensionstoCinordertofacilitateanenvironmentfordevelopingparallelprograms.ThisdiffersfromtheapproachtakenbythedesignersofHyperioninoneveryimportantmanner:ThedesignersofHyperionpurposelydecidedthatdevelopingextensionstoawell-knownlanguagewasnotdesirable.Thedesign-ersofHyperionfeltthatlanguageextensionsmadepotentialusersofasystemapprehensive,notwantingtogettrappedintousinganon-standardlanguagethatmightnotbearoundforthelifeoftheirparallelapplication.
4
Itwillbeshowninsection4.1.1.3thatfeatureisn’talwaysdesirable.
22
Chapter3
BenchmarkComparison
3.1
Introduction
Thischapterdescribesthebenchmarksusedinthisthesis.Ineachsection,acomparisonismadebetweentheobservedperformanceoftheJava/HyperionandtheC/MPIversionsofeachbenchmark.TheJava/HyperionversionsofthesebenchmarkshadtobemodifiedfromtheirinitialimplementationsinorderfortheirperformancetobecomparabletotheC/MPIversionsofthesebenchmarks.Ineachsection,thesemodificationsarediscussed.
Asareminder,allofthedatainthischapterwascollectedinthefollowingmanner:eachbenchmarkprogramwasrunthreetimes.Anaveragewascalculatedfromthesethreesamples.Inthegraphspresentedinthischapter,thevaluesthatarepresentedarebasedonthesecalculatedaverages.Alloftherawbenchmarkdataappearsintheappendicesofthisdocument.
3.2GaussianElimination
ThisbenchmarkinvolvesperformingGaussianeliminationonasystemoflinearequations,representedbya2048x2048matrixoffloats.Twoprogramsweredevelopedforthiscomparison,thefirstinJavaandthesecondinC/MPI.Theinitialimplementationofthese
23
ABProc 00123Proc 14567Proc 2891011Proc 312131415Figure3-1:GlobalDataLayoutfortheGaussianEliminationBenchmark.
programswasbasedoncodewrittenbyProfessorPhilipHatcher.
3.2.1JavaImplementation
TheJavaversionofthisbenchmarkisfairlystraightforward,andyethasbeenheavilymodifiedfromtheoriginalshared-memoryversion.Theoriginalversionofthisbenchmarkusedastaticarrayinwhichalloftheprocessorsworked;intheupdatedbenchmarkeachprocessormaintainsitsownportionoftheglobalarray.Figure3-1illustratesthismemoryarrangement.Bymakingthischange,memorycontentionandnetworkcommunicationaregreatlyreduced.ThissinglechangetransformedthisbenchmarkfrombeingunusableinaHyperionenvironmenttobeingabenchmarkthatperformsreasonably.
AteachstepduringtheprocessofGaussianeliminationaprocessoffindingthebestpivotrowintheglobalmatrixneedstobeperformed(thisisknownas“partialpivoting”[14]).Thisstepinvolvesalloftheprocessorscollectivelyagreeingonwhichrowinthelinearsystemcontainsthevariablewiththelargestabsolutevalue.Itisimportanttonotethatthisprocessinvolvestwopiecesofdata:•thevariable’sabsolutevalue
24
•therowinwhichthevariableoccurs
Intheoriginalversionofthisbenchmark,partialpivotingwaseasytoimplementsinceallofthedatawasassumedtobeeasilyathand.Withthechangesdescribedabove,thisprocesswasnolongerstraightforward.HyperionprovidesanefficientReductionclass,butthisclassdoesnotprovidetheabilitytoreducethesetwovariablesatomically.Tocomplicatematters,theback-endofHyperion’sefficientReductionclassisn’twritteninJava—itisimplementedinC.ThetaskofmodifyingHyperion’sReductionclassforthisbenchmark’spurposeswasaccomplishedwithsomeassistancebyextendingsomeofthethelow-levelReductioncodethatispartoftheHyperionruntimesystem.AtthistimeifaparallelprogrammerneedstouseanysortofcustomizedReductionoperationwithHyperion,theyneedtoresorttoimplementingthisoperationinC,muchinthesamewaythatwasdoneforthisbenchmark.AfterthelocationofthepivotrowiscollectivelydeterminedbythethreadsperformingGaussianelimination,thecontentsofthepivotrowaredistributedfromtherow’sownertoalloftheotherthreadsserially.
3.2.2C/MPIImplementation
TheupdatedJavaimplementationofthisbenchmarkisverysimilartotheC/MPIver-sionofthisbenchmark.ThepartialpivotingcodeintheMPIbenchmarkusesMPI’sMPI_Allreduce()function.Thisfunctionprovidesanefficientandeasy-to-extendreduc-tionoperation.Afterthepivotrowisdetermined,thisrowisdistributedfromtheprocessorthatownstherowtoalloftheotherprocessorsserially.MPIoffersfacilitiesto“broadcast”datalikethisfromoneprocessortomanyothers(seeFigure2-5),butthisfeaturewasn’tusedinthisparticularbenchmark.
25
Figure3-2:AverageRuntimesforGaussianEliminationBenchmark.
3.2.3ResultsandPerformance
Figure3-2presentstherawruntimesobtainedinthisbenchmark.TheC/MPIimplemen-tationofthisbenchmarkisquiteabitfasterthantheJava/Hyperionimplementationofthisbenchmark.Thisisageneralissuethatwedonotaddressinthisthesis.TherationaleforthisdecisionisthatthisreportconcernsthesubjectofparallelcomputationandnotcompilationandJavaVirtualMachine(JVM)performance.Therestofthisreportfocusesonspeedupcalculationsandnotonrawrunningtimes,whichwillinsulatethisdiscussionfromthesedetails.1
Figure3-3presentsthecalculatedspeedupforthesebenchmarks.ThisfigureshowsthatthespeedupobtainedfromrunningthesebenchmarksinparalleliscomparablebetweentheHyperionandC/MPIversionsofthesebenchmarks.Gaussianeliminationisaproblemthat
ItshouldbefurthernotedthattheseresultsinnowaycontradictReno’sThesis[16].ThisthesiscomparesbenchmarkswritteninC/MPItobenchmarkswritteninJava/Hyperion.Reno’sthesiscomparesbenchmarkswritteninjavaMPI(portedtoHyperion)toshared-memorybenchmarkswritteninJava(alsoportedtoHyperion).
1
26
Figure3-3:MeasuredSpeedupforGaussianEliminationBenchmark.
doesnotparallelizelinearlyasthenumberofprocessorsincreases–atsomepointthereisn’tenoughworkforeachprocessortodo.Forthisparticularbenchmark,thisfactbecomesespeciallyapparentwhenthenumberofprocessorsincreasesfrom8to16processors.
3.3ShallowWater
Theshallowwaterbenchmarkisatwo-dimensionalsimulationofanon-compressiblemassofparticlesastheyrotateonahypotheticalflatearth(representedbya256x256arrayoffloats).Thissimulationusesthephysicalpropertiesofmass,inertia,gravity,etc.ascomponentsofthissimulation.Althoughthissimulationisonlyperformedintwodimen-sions,inmanyreal-worldsystemsoffluids(water,air,etc.)thehorizontalcomponentofeachparticle’smotiongreatlyexceedstheverticalcomponentoftheparticle’smotion.So,whilethissimulationonlyconsidersmotionintwodimensions,itisstillausefulmodelforapproximatingthemotionofparticlesinsuchafluidsystem.
27
Theparticularalgorithmusedinthisbenchmarkisdescribedin[7].ThecodethatthisbenchmarkisbaseduponcomesfromimplementationswrittenbyProfessorPhilipHatcherandKennethCain.Acommentintheinheritedcodereads:
Thisprogramsolvesthesystemofshallowwaterequationsonarectangulargridusingafinitedifferencesmethod.Itassumesperiodicboundaryconditionsandusestheleapfrogtimedifferencingschemewithatimefilter.
Thisbenchmarkinvolvesafairamountofcomputationateachstep,inadditiontotheusualcooperationbetweenprocessors.Thisbenchmarkruns1200timesteps,andateverytimestepeachprocessorcommunicateswithitsneighborsateachcompasspointonalogicalgrid.
3.3.1JavaImplementation
TheoriginalJavaimplementationofthisbenchmarkdevelopedforthisthesiswasbasedonashared-memoryprogramoriginallywrittenbyProfessorPhilipHatcher(writtenforC/-POSIXthreads).ThiscodewasnotorientedtowardHyperion’sdistributedshared-memoryarchitecture—alloftheprocessorsdidtheirworkinastaticarrayallocatedbyProcessor0.Theinitialimplementationofthisbenchmarkperformedextremelypoorlyasaresultoftheensuingmemorybottleneck.Anattempttoremedythisdeficiencywasundertaken.How-ever,thedata-dependencieswithinthisprogrammadethistaskverydifficult.Thetaskofsplittingupallofthearraysthatthisbenchmarkusesintheshared-memoryversionofthisbenchmarkintoadistributedsetofarrays(alongwithalloftheassociatedcodechanges)provedtobetime-consumingandcomplicated.TheC/MPIversionofthisbenchmarkhadbeenrecentlycompleted(seebelow),therefore,itwasdecidedthatthemostprudentthingtodowouldbetoconverttheC/MPIversionovertoJava/Hyperion.UsingtheC/MPIversion
28
ofthisbenchmarkasastartingpointmadethedevelopmentoftheHyperionversionmucheasier,becausetheC/MPIversionofthecodewasalreadywrittenwiththeassumptionthateachprocessorwouldallocateanduseitsownmemory.ThecommunicationpatternusedbytheC/MPIversionofthisbenchmarkisveryregularanditwasstraightforwardtocomeupwithaHyperionimplementationoftheonecommunicationoperatorthatthisprogramusesoverandover—sendrecv().Generallyspeaking,thisoperationisusedtotransportprocessorgridedgestoneighboringprocessors.Virtuallyallofthecommunicationperformedbythisbenchmarkisimplementedintermsofthisoperation.ExceptforhavingitsownoptimizedJavaimplementationofsendrecv(),theJava/HyperionimplementationofthisbenchmarkisnearlyidenticaltotheC/MPIimplementation.
3.3.2C/MPIImplementation
TheC/MPIversionofthisbenchmarkisbasedoncodeoriginallywrittenbyKennethCain.Cain’sversionappearstohavebeenderivedfromaC∗versionofthiscode.Infact,Cain’sversionappearslikeitwasderivedfromtheoutputofthebackendofaC∗compiler.ItwasfortuitousthattheauthorofthisthesiscameacrossCain’scode,becauseconvertingtheoriginalC∗codetoC/MPIwastime-consumingandcomplicated.
3.3.3ResultsandPerformance
Currently,bothparallelimplementationsofthisbenchmarkarelimitedtorunningoneither4,8,or16nodes.Otherprocessorarrayscouldbeusedinthefuturewithasmallamountofwork.
TheperformanceofthesebenchmarksissummarizedinFigure3-4andFigure3-5.Thesetwobenchmarkimplementationsexhibitvirtuallyidenticalspeedupfor4and8processor
29
Figure3-4:AverageRuntimesfortheShallowWaterBenchmark.
clusters,butat16processorstheJavaversionofthisbenchmarkexhibitsnotablypoorspeedup.Thereasonsforthisarediscussedhere.
Theshallow-waterbenchmarkhasalargenumberofcomputationandcommunicationssteps.Thegeneralcomputation-communicationpatternfortheshallow-waterbenchmarkispresentedinFigure3-6.
Theshallow-waterbenchmarkmakesextensiveuseofasendrecv()operation.Thisisacommonoperationinparallelcomputation.Inessence,whenProcessorxperformsasendrecv()operation,itsendsamessagetosomeProcessoryandreceivesamessagefromsomeProcessorz.Thesendrecv()doesnotreturntothecalleruntilboththesendandthereceiveoperationarecomplete.Inthemostgeneralcase,x,y,andzcanallbethesameordifferent.
Hyperion’sruntimedoesnotprovideasendrecv()operation,soonewasimplementedforthisthesis(otherbenchmarksinthisthesisusethisoperationtoo).Inorderforthissendrecv()implementationtobesimpleandfast,thisimplementationofsendrecv()does
30
Figure3-5:MeasuredSpeedupfortheShallowWaterBenchmark.
do1200times
computation
Five(5)callstosendrecv()computation
Six(6)callstosendrecv()computationdone
Figure3-6:GeneralComputation/CommunicationsPatternfortheShallowWaterBench-mark.
31
nothandleallofthethingsthatamoregeneralsendrecv()might.Intheshallow-waterbenchmark,whensendrecv()isused,everyprocessorisperforminganoperationthatcanbegenerallydescribedlike“sendachunkofdatatothenorthandreceiveachunkofdatafromthesouth”—everyprocessoractsinlockstepfortheseoperations.Theversionofsendrecv()thatwaswrittenforthisbenchmarkisdesignedtohandlethesecommunicationspatterns.Thisversionofsendrecv()islistedinFigure3-7.
Ideally,asimplesendrecv()operationsuchasisrequiredforthisbenchmarkwouldonlyinvolvecommunicationandsynchronizationbetweentwoprocessors.Experimentationhasshown,however,thatsendrecv()implementedinthismannerintermsofJava’snativesynchronizationmethods(synchronized,wait(),notify(),etc.)isextremelyslowwhenruninHyperion.At16processors,theruntimesobservedfortheshallow-waterbenchmarkwiththisstraightforwardsendrecv()implementationwereatleastanorderofmagnitudehigherthanwhatisreportedhereintheappendices.Inaddition,theamountofgeneratednetworktraffictransmittedduringruntimeincreasedtremendously.Theroot-causeofthisproblemremainsunknownatthistime.
Itwasnecessarytofindanotherwaytoperformthissendrecv()operationbecausetheshallow-waterbenchmarkperformedpoorlyinHyperion.Hyperion’s
System.arraycopy()wasnotsuspectedofbeingthetheculprit(thiswasalreadyhighlyoptimizedforReno’sthesis[16]),andsothisleftreconsideringwhichsynchronizationoper-ationtouse.2Inthiscase,thefastestsynchronizationprimitivethattheHyperionsystem
AlthoughitdoesneedtobepointedoutthatSystem.arraycopy()differsfromanoperationlikeMPI’sMPI_Send()inoneveryimportantrespect:Whencopyingdatatoaremoteprocessorwiththismethod,System.arraycopy()doesnotreturntothecalleruntilitisconfirmedthatthedatahasbeenproperlycopiedtothedestination.MPI_Send(),inthecaseoftheStarCluster,utilizesTCP.MPICH’sMPI_Send()dependsonTCP’spropertyofreliabletransport,andthisfunctionsimplyreturnstothecallerwhenthedatatobesenthasbeencopiedtotheoutgoingTCPbuffer.System.arraycopy()’sconfirmationthatthedatahassafelyarrivedataremoteprocessorisanextrastepthatisn’tperformedinMPICH.MPICHismoreefficienthere.Thislineofreasoningalsoappliestoimplementationsofsendrecv().
2
32
classShallowMPimplementsRunnable{
//thisclassimplementstheshallow-water//benchmark,inparallel
/*shallowArray[]isastaticarrayof\"ShallowMP\"instances.
EachShallowMPisassociatedwithasingleprocessor.*/staticShallowMPshallowArray[];
FloatQueuesendrecv_fq[];//indexedbymessagetagvalue
//....
voidsendrecv(floatsendbuf[],intsendcount,intdest,
intsendtag,floatrecvbuf[],intrecvcount,intsource,intrecvtag)
{
intlower=(whoami //blockwhilewaitingforanincomingmessage//oftype‘‘recvtag’’ fqarr=sendrecv_fq[recvtag].removeFromFront();intsmaller=(recvcount Figure3-7:Simplifieddescriptionofasimplesendrecv()implementationforHyperion.TheFloatQueueclassisextremelysimilartothecodepresentedinFigure2-2,exceptthatthisclasshandlesarraysoffloatsinsteadofsingleintegers. 33 offersisareductionoperation(hyperion.Reduction.sum(0)—thisactsasabarriersyn-chronization).Thisoperationisaglobaloperation,sowhensendrecv()isimplementedintermsofthisoperationitcannolongerbesaidthatsendrecv()merelyinvolvessynchro-nizationbetweentwoprocessors—nowitinvolveseveryprocessor. Finally,wearriveatatheoryastowhytheHyperionversionofthisbenchmarkperformspoorlywith16processors: 1.Experimentationshowsthatusinghyperion.Reduction.sum(0)isthefastestmethodofsynchronizationforuseinasendrecv()implementation. 2.Itcanbearguedthataglobalreductionisn’treallytherightoperationforthistask.Everytimesuchanoperationisused,everyprocessormustarriveatagivenpointandwaitforallotherprocessorstoreachthispointaswell.Thiscanresultinprocessorstalling.Thegreaterthenumberofprocessorsparticipatinginthecalculation,thelargerthechancebecomesthatsomeoftheprocessorswillhavetowaitaroundforaslightlyslowerprocessor. 3.Thisglobalreductioniscalledmanytimesinaloopthatexecutes1200times.Togivethereaderaclearerpicture,eachsendrecv()inFigure3-6translatesintotwocallstohyperion.Reduction.sum(0).So,thealgorithminFigure3-6callshyperion.Reduction.sum(0)1200×((5×2)+(6×2))=1200×22=26400times. 4.TheresultspresentedinFigure3-5wouldbeevenworseexceptforthefactthatfurtheroptimizationstosendrecv()arepossibleinthisbenchmark.Theseoptimizationsde-pendonthehighlyregularcommunicationpatternutilizedbythisbenchmark,i.e.ifateveryiterationeveryprocessorsenddatatotheprocessortothenorth,theneachpro-34 do1200times computation Five(5)callstomodifiedsendrecv()--NOSYNCHRONIZATIONhyperion.Reduction.sum(0) datacopyingrelatedtoprevioussendrecv()callshyperion.Reduction.sum(0)computation Six(6)callstomodifiedsendrecv()--NOSYNCHRONIZATIONhyperion.Reduction.sum(0) datacopyingrelatedtoprevioussendrecv()callshyperion.Reduction.sum(0)computationdone Figure3-8:OptimizedComputation/CommunicationsPatternfortheShallowWaterBenchmark. cessorcanallocateafixedbuffertoreceiveincomingdatafromthesouth(andrepeatforeverycardinalandinter-cardinalcompassdirection).Suchoptimizationsarenotpossibleinamoregeneralsendrecv()implementation(theMPIimplementationofthisoperationcannotbeoptimizedinthismanner).ByoptimizingandrestructuringthecodetobelikewhatispresentedinFigure3-8,alargenumberofcallstohy-perion.Reduction.sum(0)wereavoided.Insteadofcallinghyperion.Reduction.sum(0)22timesperiteration(asdescribedinthepreviousparagraph),thissynchronizationoperationwasonlycalled4timesperiteration. Whenthenumberofprocessorsinvolvedintheshallow-waterbenchmarkisincreasedto16processors,theperformanceofthisbenchmarkseemstosuffergreatlyforalloftheaforementionedreasons.Unfortunately,atthistimeitisnotpossibletorunprogramsexecutedintheHyperionenvironmentinconjunctionwithanexecutionprofilingtool,soit 35 isdifficulttodeterminetheexactcauseoftheslowdown.Thissubjectisdiscussedfurtherinsection4.1.2. With16processorsrunningthisbenchmarkona256x256matrix,theperformanceofthisbenchmarksuffersfromanotherproblemaswell:theamountofcomputationthatneedstobeperformedperprocessorisrelativelysmall.Thisisdiscussedfurtherinsection3.4.3. 3.4MatrixMultiplication Thisbenchmarkinvolvesarow-orientedmatrixmultiplication(AxB=C).Eachmatrixisa2Dmatrixconsistingof1024×1024floats.BoththeC/MPIandJava/HyperionversionsofthisbenchmarkwerederivedfromserialcodeobtainedfromProfessorPhilipHatcher. 3.4.1JavaImplementation BeingcognizantofthepropertiesofHyperion’sDSMsubsystem,thisimplementationwasfromitsinceptionwrittensuchthatprocessorsmaximizeduseoftheirownlocalizedmemoryandminimizeduseofremotememory.Theresultingprogramlooksverysimilarinstructuretoamessage-passingprogram. 3.4.2C/MPIImplementation TheC/MPIimplementationofthisbenchmarkisastraightforwardderivationoftheorig-inalserialCcode.Eachprocessorownsa“chunk”ofthematricesbeingworkedwith.ThisdatalayoutisillustratedinFigure3-9.AfterProcessorxgetsdoneworkingwiththechunkofmatrixBthatitcurrentlyhasinitspossesion,itpassesthischunkofftoProcessor(x+1)modnumprocessors.AftereachprocessorhasworkedwithallofthechunksfromthematrixB,eachprocessorownsachunkoftheend-resultmatrixC.MatrixC 36 ABCX=Figure3-9:GlobalDataLayoutfortheMatrixMultiplicationBenchmark. ischeckedforcorrectnessattheendofthecalculation.MPI’ssendrecv()isthemaincommunicationsworkhorseinthisbenchmark. 3.4.3ResultsandPerformance TheperformanceofthesebenchmarksissummarizedinFigure3-10andFigure3-11.Thesetwobenchmarkimplementationsexhibitvirtuallyidenticalspeedupfor2,4and8processorclusters.Whena16processorclusterisused,theHyperionversionofthisbenchmarklagsbehindtheMPIversionofthisbenchmarkintermsofspeedup.TheperformanceoftheMPIversionofthisbenchmarkissimilartotheresultsreportedby[14]. Thereasonforthisbenchmark’ssub-optimalperformanceontheJava/Hyperion16-nodeclusterispresentlyunknown.ThecodefortheC/MPIandJava/Hyperionprogramsisverysimilar.AsmalldifferencebetweenthetwoimplementationsisthattheC/MPIprogramenjoysasmalladvantageovertheHyperionimplementation:Bothoftheseprogramsuse2Darrays,andthefactthatthememoryfora2DarrayiscontiguousinCmakescopyingmatrixdatafromoneprocessortoanotherslightlymoreefficient.TheJava/Hyperionversionofthisbenchmarkattemptstominimizethisdifferencebyusinga1Dstagingarray 37 Figure3-10:AverageRuntimesfortheMatrixMultiplicationBenchmark. Figure3-11:MeasuredSpeedupfortheMatrixMultiplicationBenchmark. 38 tocopydatatoandfromeachprocessor.Copyingdatatoandfromthisstagingareaaddsadditionalmemory-copyingtoeveryinterprocessor-communicationoperation,butoverall,thisshouldbearelativelyinexpensiveoperation(theadditionalmemorycopiesarealllocal).TheJava/HyperionversionofthisprogramisextremelysimilartotheC/MPIversionoftheprogram—infact,asendrecv()implementationisalsousedinthisprogramaswell.TheJava/HyperionversionofthisbenchmarkmayinfactbesufferingfromthesameproblemastheJava/Hyperionshallow-waterbenchmark.Inparticular,theshallow-waterbenchmarkemploysanO(n2)algorithmandoperatesonarelativelysmallproblem,wherasthematrixmultiplicationbenchmarkemploysanO(n3)algorithmandoperatesonarelativelylargeproblem.Thesetwobenchmarkscouldbesufferingfromthesameproblem,butinfacttheshallow-waterbenchmarkmightbedisplayingthisproblemmuchmoreprominentlybecause,comparedtothematrixmultiplicationbenchmark,itisdominatedbycommunicationinsteadofcomputation. 3.5 3.5.1 RadixSort Overview Forthisbenchmark,aparallelradixsortbenchmark(asdescribedin[3])wasportedtobothJavaandC/MPI.Thisbenchmarkisaparallel“straightradixsort”—integerkeysaresortedfromleast-significant-bittomost-significant-bit.Moreprecisely,thissortworksasfollows:Theitemstobesortedareb-bitkeys(inthiscase,b=32).Thesekeysaresortedusinganr-bitradix(inthiscase,r=8).Ther-bitradixspecifiesanr-bitdigitwithineachb-digitkey(muchlike“3”isadigitinthebase-10number“1234”).Thissortingprocessusesb/rpassestocompletethesortingprocess.Ateachpassinthesorteachkeyispartially 39 sortedaccordingtother-bitdigitassociatedwiththepass.Thisstepisaccomplishedbyformingaglobalhistogramofthekeys/digitsbeingsortedinthegivenpassandbythenredistributingthepartiallysortedkeysamongsttheparticipatingprocessors.Attheendoftheb/rpassestheintegerkeyshavebeenredistributedtotheprocessorssuchthatallofthekeysresidingonProcessorkareinsortedorderandallofthesekeysarelessthanorequaltoallofthekeysresidingonProcessork+1. Theimplementationofthisbenchmarkplacesagreatdealofstressonthecommu-nicationsfacilitiesofaparallel-programmingenvironmentbecausethereisbothalotofbroadcastingaswellasirregularcommunications(communicationpatternsthatcan’tbeknownapriori). 3.5.2Split-C Theradix-sortbenchmarkisinterestingbecauseitwastakenfromtheSplit-Cbenchmarks.InsomewaysSplit-Cresemblesadistributedshared-memorysystem(likeHyperion)andinotherwaysitresemblesamessage-passingsystem(likeMPI).Forexample,Split-Ccomeswithadistributedsharedmemorysubsystem(likeHyperion).However,aprogrammermustexplicitlydenoteadata-structureasbeingdistributedacrossalloftheprocessors(a“spreadarray”,inSplit-Cterminology),andmustusespecialoperatorstoaccessthistypeofmemory(somewhatlikeMPI3).See[3]foracompletedescriptionofSplit-C. PortingabenchmarkwritteninSplit-CtoJavaandC/MPIentailedconvertingeveryuseofaSplit-CfeaturetoacorrespondingJavaandC/MPIfeature.Theseconversionsarepresentedinthissection. Figure3-12presentsanexampleofa“spread”arrayinSplit-C.PROCSisalanguage 3 Actually,thismightalsoremindthereaderofC∗. 40 #defineNP_MAX(1024*1024) unsignedg_keys0[PROCS]::[NP_MAX]; Figure3-12:ExampleDeclarationofa“spread”ArrayinSplit-C. featureofSplit-C—thisisanintegerthatissettothetotalnumberofprocessorsrunningtheprogram.Thisspread-arraydeclarationdeclaresanarrayofunsignedintsthatisspreadacrossalloftheprocessors,witheachprocessorinpossessionofalocalsegmentofthisarraythatisNP_MAXinsize. TheobviouswaytoportcodelikethisovertoJavaandC/MPIistosimplydeclarearraysofsizeNPMAXonalloftheparticipatingprocessorsandthenconvertanycodethatutilizesthesearraysovertousewhatevercommunicationconstructsareavailable. Split-Cofferstwospecialoperators::=(split-phaseassignment)and:-(signaling-store).Thesplit-phaseassignmentoperationallowstheprogrammertosignaltotheSplit-Cruntimethatanassignmentneedstobeperformed,butthattheconfirmationthatthisoperationiscompleteisnotneeded(yet).Theprogramcancallsync()toconfirmthatallofthesplit-phaseassignmentshavebeencompleted.Byutilizingthisoperation,aprogrammercanoverlapcommunicationwithcomputation. Thesignaling-storeoperationcausestheSplit-Cruntimetostoreavalueonaremoteprocessorandthensignaltheremoteprocessorthatthestoreoperationhasbeenperformed.Aprocessormaywaitforthesenotificationmessagesbycallingsplitc_poll(). Bothoftheseoperationsallowtheprogrammertocontrolcomputationandcommunica-tioninamoreprecise(andpossiblymoreefficient)manner.Javadoesn’tdirectlyallowthistypeofcontrol,butstrategiccallstosynchronizedblocksmightallowaprogrammertogetthistypeofcontrol.Therewassomeopportunityintheradixsortporttoexperimentwith 41 thisinHyperion,butduetootherissuesinthecodeanothertechniquewasusedtoexpressthiscomputation/communicationspattern.MPIdoesallowforthistypeoffunctionalitythroughitsextensivesetofnon-blockingoperations.However,onlyblockingoperationswereusedfortheportinginvolvedinthispartofthisthesis, 3.5.3JavaImplementation TheJavaimplementationofthisbenchmarkwasfromthestartstructuredinawaythatwas(forthemostpart)amenabletoHyperion’sDSMcharacteristics.Thiswasafairlynaturalprocess.Forexample,data-structuresthatwerecontainedin“spread-arrays”intheoriginalSplit-CcodewerenaturalcandidatesfordistributingtoeachprocessorintheportedJavacode.AllofthespecialSplit-Coperatorsthatwereusedontheoriginalparalleldatastructureswereportedinafairlystraightforwardmanneraswell.Itwasactuallyhandytohaveclearindicationsinthecodewheretheportingworkneededtobeperformed.Hyperion(andothersystemslikeit)offeraDSMsubsystembutnon-localaccessisalwaysexpensive.Sometimesthisnon-localaccessisn’teasytospot.Split-C’smethodofhandlingmemorydoesappeartohavesomemerit—non-localaccessisallowedbutitneedstobedeclaredexplicitly. 3.5.4C/MPIImplementation TheC/MPIimplementationofthisbenchmarkissimilartotheJavaversion—“spread”data-structureshavebeenreplacedbyadistributedsetofdata-structuresthatarelocaltoeachprocessor.ThespecializedSplit-CcommunicationsoperationsthatexistedintheoriginalSplit-CcodehavebeenreplacedwithMPIcommunicationsoperations. 42 Figure3-13:AverageRuntimesfortheRadixSortBenchmark(512×65536=33,553,920keys). 3.5.5ResultsandPerformance Theresultspresentedinthissectionwereobtainedfrombenchmarkrunsinwhich33,553,920(512×65535)random32-bitkeysweresorted.Theradixthatwasusedfortheseresultswas8-bitsinsize.Initiallya16-bitradixwasdesired,butitwasdeterminedthattheHyperionsystemhadinsufficientmemoryforthisscenario.Configuringthisbenchmarktousean8-bitradix(asopposedtoa16-bitradix)doublesthenumberofpassesthatthisbenchmarkneedstoperforminorderforthekeystobecomesorted.Morepassestranslatestomorenetworkcommunications. Currently,theJavaversionofthisbenchmarkdoesn’truncorrectlyontheStarCluster,duetoabuginHyperion.ThisimplementationdoesruncorrectlywhenrunwithotherJVM’s(forexampleJVMsproducedbySunMicrosystemsandIBM),eveninmulti-CPUenvironments.Areproducibletestcasethatexhibitsthisbughasbeendeveloped.TheJavaversionofthisbenchmarkremainsunusablebecausetheauthorofthisthesisdoesnot 43 Figure3-14:MeasuredSpeedupfortheRadixSortBenchmark(512×65536=33,553,920keys). possessadetailedunderstandingoftheinternalsofHyperionandPM2. TheC/MPIversionofthisbenchmarkrunscorrectlyanddisplayssomedegreeofspeedup.ResultsfromthisbenchmarkaredisplayedinFigure3-14.Theseresultsdoappeartobesimilartothosereportedin[3].ThedevelopersoftheoriginalSplit-Cradixsortneverpresentedtheirperformancenumbersintermsofspeedup,insteadtheypre-sentedtheirdataintermsofexecutiontimeperkey.TheC/MPIversionofthisbenchmarkgenerallymatchestheperformancereportedin[3]—forlargerproblems,asthenumberofprocessorsincreases,theexecutiontimeperkeydecreases.ThisisshowninFigure3-15. 3.6MinimalCostGraphColoring Givenanarbitraryconnectedgraphconsistingofnodeswitharbitrarysizes,andgivenasetofcolorsthathaveassociatedwiththemasetofgivencosts,thisbenchmarkattemptstodeterminetheminimalcostcoloringofthegraph.Or,lessformally,supposewehaveamap 44 Figure3-15:SortTimeperKeyfortheRadixSortBenchmark(512×65536=33,553,920keys). ofthecontiguous48UnitedStatesthatwewishtopaint.Supposewhitepaintcosts$10acan,greenpaintcosts$20acan,yellowpaintcosts$30acan,andpurplepaintcosts$40acan.Obviously,twoadjacentstatesshouldnotbepaintedthesamecolor.Wewishtopaintthemapascheaplyaspossible.So,wemighthopetousesomecolorotherthanpurpletopaintthestateofTexas.Theminimalgraphcoloringprobleminthiscaseistheproblemofhowtopaintthemapascheaplyaspossible.Thisbenchmarkisatypeofbranch-and-boundproblem,wheretheprocessorssearchthroughacombinatoricstate-spaceinsearchofanoptimalanswertoaproblem. Bothimplementationsofthisbenchmarkpresentedheredrawuponseveralsources:•BothimplementationsofthisbenchmarkhaveastheirdirectancestorProfessorPhilipHatcher’sminimalcostgraphcoloringprogram. •TheoriginalserialprogramwasparallelizedusingQuinn’spaper[15]asaguide.Inparticular,“asynchronousalgorithm4”wasusedfortheseimplementations.Inthis 45 approachtosolvingtheminimalgraphcoloringproblem,thereisnocentralizedqueuetostorepartialsolutionstotheminimalcostgraphcoloringproblem,noristhereanysynchronizationbetweentheprocessorswhiletheoptimalcoloringisbeingsearchedfor(andthisisthesenseinwhichthisalgorithm/implementationcanbesaidtobe“asynchronous”).Generallyspeaking,inthisalgorithmaseachprocessorgeneratessub-problems,itplacesthesesub-problemsinalocalpriorityqueueandthenremovesthesecond-bestsub-problemfromthepriorityqueueandsendsthisproblemtoan-otherprocessorinthecluster.Whilethisalgorithmhasthepotentialtoperformveryefficientlyinaparallelenvironment,theauthorofthisthesishasconcludedthatinpractice,programsthatusethisalgorithmaremoredifficulttodebugthanusualparallelalgorithms(becauseofthelackofsynchronization).Inaddition,theasyn-chronousnatureofthisalgorithmplacesagreatdealofstressonthesystemruntime;inthecourseofdevelopingthisbenchmarkseveralbugswerefound(andfixed)intheHyperionruntime. •Becausethereisnocentralizedpriorityqueueinthisimplementationtostorepartialsolutionstotheminimalcostgraphcoloringproblem,andbecausethereisabsolutelynosynchronizationbetweentheprocessorsinvolvedinthesearch,theissueof“ter-minationdetection”wasanissue.ThesolutiontothisproblemisDijkstra’selegantdistributedterminationalgorithm[4].Byutilizingthisalgorithm,theprocessorsin-volvedinthesearchfortheoptimalgraphcoloringcanefficientlyandcollectivelyconcludethatthesearchfortheoptimalsolutioniscomplete.Dijkstra’salgorithmisdescribedinFigure3-16. •Hatcher’soriginalminimalcostgraphcoloringimplementationwasentirelybasedon 46 1.Whenthread0becomesidleitinitiatestheterminationdetectionalgorithm.2.Eachthreadmaintainsacountercount.Innormalprocessingeachthreadcansendmessagestootherthreadsinordertogivethosethreadsadditionalworktobedone.Foreverymessagethatathreadsendstosomeotherthread,countisincrementedbyone.Foreverymessagethatathreadreceivesfromsomeotherthread,countisdecrementedbyone.Countercountisinitializedtozero.Everythreadalsomaintainsavariablecalledcolor;thisvariablecantakeonthevalueblackorwhite.Allthreadsstartwithacolorvalueofwhite.3.Whenthread0initiatestheterminationdetectionalgorithm,ittransmitsa“termina-tiontoken”tothreadN−1.Thistokenhasitsowncountandcolorvalues,whichareinitializedwiththread0’scorrespondingvalues.4.Whenathreadreceivesamessagefromanotherthread(thusgivingthethreadmoreworktodo),thethread’scolortransitionstoblack.Whenathreadbecomesidleagain,thethread’scolortransitionstowhite.5.Uponreceiptoftheterminationtoken,ifathread’scolorvalueisblack,thecolorofthetokenischangedtoblack,otherwisethecolorofthetokenisleftunchanged.Eachthreadkeepstheterminationtokenuntilitbecomesidle.Beforethreaditransmitstheterminationtokentothreadi−1itaddsitscountvaluetotheterminationtoken’scountvalue.Afterathreadforwardstheterminationtokenthisthread’scolorvaluebecomeswhite.6.Uponreceiptoftheterminationtoken,thread0canconcludethattheallofthethreadsareidleifthread0isstillidle,theterminationtoken’scoloriswhite,andthetermi-nationtoken’scountiszero.Otherwise,atleastoneofthethreadsinthesystemisstillactiveandtheterminationdetectionalgorithmwillneedtoberunagain.Figure3-16:DescriptionofDijkstra’sDistributedTerminationDetectionAlgorithm. 47 breadth-firstsearch.Forlargerproblems,thissolerelianceonbreadth-firstsearchdidnotscale,becausethequeuesthatholdpartialsolutionstothecoloringproblemgrewexponentiallylarge.Thesebenchmarkswereaugmentedwithadepth-firstsearchatProfessorHatcher’ssuggestion.Thisdepth-firstsearchgetsinvokedinthefollowingtwosituations: 1.ifagivenpartialsolutionismorethan60%solved 2.ifthelocalqueuethatholdspartialsolutionsismorethan70%full. Adjustingthesetwoparametersmightleadtoenhancedperformance.Thiscouldbethesubjectoffurtherstudy. 3.6.1JavaImplementation TheJavaimplementationofthisbenchmarkresemblesamessage-passingprogram.Itusesjava.util.Vectorasamessage-passingmechanism—whenonethreadwantstopassamessagetoanotherthread,itsimplyconstructsthismessageandplacesitonthereceivingthread’smessageVector.Theminimalcostgraphcoloringcodeitselfdoesn’tperformanysynchronizationinthismessage-passingprocess—ifanysynchronizationisneeded,itisuptotheVectorclasstoprovideit. TwoissueswereencounteredinthedevelopmentoftheHyperionversionofthisbench-markthatsignificantlycomplicatedimplementation:theproblemofefficientlypassingacomplicateddata-structurefromoneprocessor’slocalmemorytoanother,andHyperion’slackofagarbage-collector.Intheformercase,carehadtobetakentoensurethatdatapassed(byreference,notbyvalue4)toanotherprocessor’sVectorobjectwasdeep-copied 4 ObjectsarepassedtomethodsbyreferenceintheJavaprogramminglanguage. 48 uponremovalfromtheVector(viajava.lang.Object.clone()).Inthelattercase,be-causeminimalcostgraphcoloringisadynamicproblemandmemoryhastobeallocatedconstantlytoholdpartialsolutionstoproblems,asmallmemory-managementsystemhadtobedevelopedanddeployedinorderforthisbenchmarktobeusablewithlargerprob-lems.Thismemorymanagementsystemessentiallyconsistsofpre-allocatingapoolofthestructuresthatrepresentpartialsolutionsinthisimplementationandthenensuringthatallofthesestructuresareallocatedanddeallocatedfromthispoolinsteadofthegeneralheap.Becauseeachthreadallocatesitsownpoolandneverusesstructuresfromanotherthread’spool(excepttocopyastructure),thisschemeisefficientintheHyperionenvironment. 3.6.2C/MPIImplementation TheC/MPIimplementationofthisprogramusesmessagestopassstateinformationaroundastheprogramisrunning.Thepartialsolutionobjectthatthisbenchmarkpassesaroundisitselfacomplexdatastructure,butMPIhassupportfordeclaringandpassingaroundsuchobjects.Thisisausefulfeaturewhichmightnotbeavailableinothermessage-passinglibraries.ThisprogramisverysimilartotheHyperionimplementation—partialsolutionsandbest-so-farsolutionsarepropagatedviamessagesinanasynchronousmanner. 3.6.3ResultsandPerformance Bothimplementationsofthisbenchmarkhavebeengreatlyimprovedfromtheirinitialimplementations.Olderversionsofthesebenchmarkscouldonlyworkwithverysmallproblems.Theadditionofadepth-firstsearchtothesebenchmarkshasmadeitpossibleforthesebenchmarkstoworkwithverylargeproblems.Theintroductionofmemory-managementcodetotheHyperionversionofthisbenchmarkalsohelpswiththisaswell. 49 Figure3-17:AverageRuntimesfortheMinimalCostGraphColoringBenchmark(Graph:35ContiguousStatesintheContinentalUnitedStates,startinginMaine). Figure3-18:MeasuredSpeedupfortheMinimalCostGraphColoringBenchmark(Graph:35ContiguousStatesintheContinentalUnitedStates,StartinginMaine). 50 Theperformanceresultsofthesebenchmarkswhenrunagainstaparticulargraphcolor-ingproblemispresentedinFigure3-18.Itshouldbenotedthatthisisacombinatoricsearchproblem,thereforesuperlinearspeedupispossible.TheC/MPIversionofthisbenchmarkdisplaysspeedup—someofthisisevensuperlinear. TheHyperionversionofthisbenchmarkclearlylagsbehindtheC/MPIversionofthisbenchmark.Thereasonforthisisunclear,butwecanmakesomeeducatedguesses.Quinn’salgorithm[15]doesrequirethateachprocessorsendandreceivealotofmessages.Forexample,everytimeanewbestpartialsolutionisfound,thispartialsolutionisbroadcasttoallprocessors.Thisleadstoalotof“messagestorms,”andsendingandreceivingmessagesinHyperionisknowntobeexpensive.Thisbenchmarkworksalloftheprocessorsparticularlyhard—eachprocessorspinsinaloopthatattemptstosolvepartial-solutionsfromthelocalpriorityqueuewhiledealingwithsendingandreceivingmessagestoandfromotherthreadsinthecluster.Sending,receiving,andevencheckingfortheavailabilityofincomingmessagesallrequirecallingsynchronizedmethods,andcallingtheseoperationsinaloopcertainlymustbeexpensive.ItseemslikelythatQuinn’salgorithm,developedonaparallelcomputerwithrelativelyslowCPUsandarelativelyfast/low-latencynetworkdoesnottranslatewelltotheHyperionsystem,withitsfastCPUsbuthigh-latencymessage-passing.Insummary,Quinn’salgorithmdoesnotseemtotranslatewelltotheHyperionenvironment—toomuchcommunicationandnotenoughcomputation. ItisnotimpossibletowriteanefficientminimalcostgraphcoloringimplementationfortheHyperionsystem:forexample,Hatcherhasanalgorithmthatcreatesalargenumberofworkerthreads(64—morethanthe16physicalprocessorsavailableintheStarCluster).Inthisalgorithm,thebestknownsolutioniskeptinacentrallocationonProcessor0.Wheneveraprocessorfindsasolutionthatisbetterthanthebestcurrentlyknownsolution, 51 itupdatesthebestknownsolutiononProcessor0.Eachprocessorkeepsitsowncopyofthebestknownsolutionsofarbutinsteadofsendingmessageseverytimeanewbestsolutionisfound,eachthreadjustupdatestheirlocalcopyofthebestsolutionsofarevery1000state-spaceexpansions.TheresultingimplementationundoubtedlygeneratesfarfewermessagesthantheimplementationofQuinn’salgorithmpresentedhere.Hatcher’sHyperionimplementationiscompetitivewiththeC/MPIversionofQuinn’salgorithmpresentedhere. 52 Chapter4 HyperionAssessment 4.1 LessonslearnedfromportingcodetoHyperion ThepreviouschapterofthisthesispresentsbenchmarksandresultsthatillustratetheperformanceoftheHyperionsystem.TheaimofthischapteristofurtherillustratetheissuesthataprogrammermightencounterwhenportingtheirparallelJavacodetotheHyperionsystem.Inaddition,somerecommendationsarepresentedforimplementorswhowishtoenhanceHyperion,orforthosewhowishtoimplementtheirownDSMsubsystem. 4.1.1MemoryManagement TheoverwhelmingmajorityofissuesthataprogrammerwillencounterwhenportingtheircodetotheHyperionsystemarerelatedtomemorymanagement.Theseissuesarediscussedhere. 4.1.1.1Ownership Intermsoflessonslearnedwithrespecttomemorymanagementissues,onelessonwasreinforcedinallofthebenchmarksdiscussedinthisthesis:inprogrammingthereareseveraldifferentnotionsoftheconceptof“ownership”andHyperion’sDSMrulesaddyetanother 53 slightlyorthogonalnotiontothiscollection.Forexample,inthecontextofprogramming,ifobjectA“owns”objectB,thismightimplyanyofthefollowing: •ObjectBlieswithinthememorysegmentthatasawholecomprisesObjectA.•ObjectAhasapointer/referencetoObjectBandwieldsabsolutecontroloverthisobject.ObjectAdoesn’tletanyotherobjectinthesystemaccessObjectBexceptunderverycontrolledsituations. •ObjectAhasapointer/referencetoObjectBandhasagreedtodeallocate/destroyObjectBwhenitisnolongerneededintheprogram. •(HyperionDSMdefinition)ObjectAisaJavathreadandObjectBwascreatedbythisthread.ObjectBiscontainedwithinthememoryassociatedwiththeprocessoruponwhichObject(thread)Aisrunning.Otherthreadsrunningonotherprocessorsintheclusterwon’thavelocalandefficientaccesstoObjectB. Allofthesenotionsareverysimilar,buteachhasitsdifferencesaswell.ThelastnotionofownershipwillbenewtoprogrammerswhoareapproachingaDSMsystemsuchasHyperion’sforthefirsttime.Aprogrammercanendupwithaninefficientparallelprogramifheorshedoesn’tkeeptheseconceptsclearlyinmindwhenportingJavacodetoHyperion.TheparticulardetailsofhowHyperionimplementsmemoryownershipmightalsotripuptheprogrammer.Theseissuesareexaminedindetailinthenextfewpages. ToillustratehowaprogrammermightendupwithaninefficientHyperionprogram,pleaserefertoFigure4-1.ThecodepresentedinthisfigureisaveryreasonableapproachtocodingamultithreadedapplicationinJava.Inthisexample,theprogrammerhasdefinedaclassthatimplementsJava’sRunnableinterface.EachinstanceofParCompSolveris 54 1publicclassParCompSolverimplementsRunnable2{3staticfinalintNUM_THREADS=16;45staticParCompSolverpcsArr[];!!!6staticThreadthreadArr[];78intm_whoami,m_x[];9intm_y=newint[20];!!!10SomeClassm_sc;111213ParCompSolver(intwhoami)14{15m_whoami=whoami;16m_x=newint[1024];//!!!17m_sc=newSomeClass();//!!!18}1920publicvoidrun()21{22m_x[0]=6;//!!!2324do_parallel_computation();25}2627publicstaticvoidmain(Stringargv[])28{29pcsArr=newParCompSolver[NUM_THREADS];30threadArr=newThread[NUM_THREADS];3132for(inti=1;i 55 associatedwithanewlycreatedJavathread,andeachinstanceofParCompSolverhasanumberofmembervariablesassociatedwithitthatitusesduringitscomputation. TheexampleprogrampresentedinFigure4-1suffersfromanumberofmemoryproblemswhenrunintheHyperionsystem.First,itmustbeunderstoodthatthemain()methodthatthisclassdefinesrunsinthemainthread,whichrunsonProcessor0.Whenthemain()routineruns,itstartsbycreatingotherParCompSolverobjects(online33),andthenbycreatingthreadstogoalongwiththeseobjects.However,thecreationofParCompSolverobjectsisexactlywherethisclassstartstorunintotroublewithHyperion.RecallthatinHyperionthreads“own”thememorythattheyallocate.Thisimpliesthatnewly-createdthreads,whentheyreachline33,areactuallyaccessingmemoryallocatedbyProcessor0.ThisinefficiencyispictoriallyillustratedinFigure4-2.Sincethismemoryisgenerallynon-local1,anyaccesstotheobject’sownmemberswillbeslow.Thisexamplehelpsillustratehowconfusionovertwodifferentnotionsof“ownership”canleadtoinefficiencyinaparallelprogram. HyperionProgrammingRule#1ThefactthataclassimplementsJava’sRunnableinterfacedoesnotimplythatmembersofinstancesofthisclasscanbeaccessedinaHyperion-efficientmanner.InHyperion,memoryallocatedbyonethreadislocaltotheprocessoronwhichthatthreadisrunning.ThisincludesmemoryallocatedforthepurposeofcontaininginstancesofclassesthatimplementtheRunnableinterface,orinstancesofclassesthatextendtheThreadclass.Hyperionprogrammersmustensurethattheircodeaccessesmemoryinanefficientmanner. IntheHyperionsystemitispossibletocreatemorethreadsthanexistprocessorsinthecluster.Inthiscase,someoftheprocessorsintheclusterwillberunningmorethanonethreadatruntime.Twothreadsrunningonthesamephysicalprocessorcanaccesseachother’sheapmemorywithoutincuringnetworkcommunication,andthisaccessisveryefficient.Noneofthebenchmarksinthisthesisusethisfeature.Thissubjectisalsodiscussedinsection3.6.3. 1 56 Processor 0Thread 0Object ReferenceProcessor 1Thread 1Use of\"member\" objectNetworkObjectFigure4-2:IllustrationofanInefficientHyperionMemoryUsePattern(1). AnotherexampleofamemoryproblemthatJavaprogrammersencounterwithHyperionispresentedonline9ofFigure4-1.Javaallowsprogrammerstospecifyhowclassmembersareinitializedattheplaceinwhichthesevariablesaredeclared(andtypographicallyoutsideoftheclassconstructor(s)).InthecontextofHyperion,suchaninitializationisequivalenttoplacingtheequivalentcodeintotheobject’sconstructor.Becausesuchcodeleadstoexactlythesamesituationthatwasdiscussedintheprevioustwoparagraphs,theuseofthisfeatureinHyperioncancausethesamesortofinefficiencyasdescribedthere.HyperionProgrammingRule#2JavaprogrammerscannotusevariableinitializersforclassvariablesthatallocatememoryinclassesthatextendJava’sThreadclassorextendJava’sRunnableinterface.Instead,programmersmustfindsomeotherwaytoallocatememoryinaHyperion-efficientmanner. MethodsthatallowtheprogrammertoconfinuetouseJava/Hyperioninbothanobject-orientedandefficientmannerarediscussedlaterinthischapter. 57 4.1.1.2ObjectCaching ThecodecontainedinFigure4-1hassomeotherqualitiesthatinhibittheprogramfromrunningoptimallyintheHyperionenvironment.Online5astaticarrayofParCompSolverobjectreferencesisdeclared.AnarraylikethisallowsalloftheParCompSolverobjectsparticipatingintheparallelcomputationtoreferenceeachother.Forexample,ifoneParCompSolverobjectwantedtosenddatatoanotherParCompSolverobject,thisarraywouldbethestartingpointforoneobjecttofindtheotherobject.TheproblemwithastaticarraylikethisintheHyperionenvironmentliesinthefactthatthememoryforsuchanarrayresidesonasingleprocessor.2Duringruntime,astaticarraylikethisislikelytogetusedoftenandfrequentaccesstosuchanarraycanresultinabottleneck.Infact,frequentaccesstoanystaticvariablecanbeabottleneck.ThisresultsinafewmoreguidelinesforoptimizingprogramsthataportedtoHyperion: HyperionProgrammingRule#3Ifpossible,allstaticvariablesshouldalsobefinal.Thiseffectivelymakesthesevariablesconstant,andthecompilerwilloptimizetheaccesstothesevalues. HyperionProgrammingRule#4Ifitisimpossibletocomplywiththeprecedingrule,then,ifpossible,alocalcacheofthestaticvariablesshouldbeestablishedbyeachthread.Thepurposeofthesecachedvariablesistoensureefficientaccesstothesevariables.ThistechniqueisillustratedinFigure4-3. 2 Staticvariablesgetdistributedinaround-robinorderonprocessorsinthecluster. 58 1publicclassParCompSolverimplementsRunnable2{3staticintNUM_THREADS=16;4staticParCompSolverpcsArr[];5staticThreadthreadArr[];6staticBarrierbarrier;78intlocal_NUM_THREADS;9ParCompSolverlocal_pcsArr[];1011ParCompSolver(intwhoami){...}1213publicvoidrun()14{15local_NUM_THREADS=NUM_THREADS;//cache16local_pcsArr=newParCompSolver[local_NUM_THREADS];1718//neededtopreventracecondition--don’taccess19//pcsArr[]beforemainthreadisdonecreatingallthreads20barrier.barrier();2122for(inti=0;i 59 4.1.1.3ProblemswithObjectReferences TheprogrampresentedinFigure4-3isnotcompletelyoptimized.Online6astaticBarrierclassisdeclared.Thereferencetothisobject(nottheobjectitself)isnotlocallycached,whichmeansthateverytimeagiventhreadwishestoaccessthisBarrierobject,itwillfirsthavetoobtaintheactualreferencetothisobject,andthenusethisreferencetonavigatetotheactualobject.Formostofthethreadsinthecluster,boththeobjectreferenceandtheobjectitselfwillexistonaremoteprocessor.Ifthisobjectreferencehadbeencached,theneachthreadwouldhavebeenabletoaccesstheobjectreferenceinitsownlocalmemory(whichisefficient)andthennavigatedirectlytotheunderlyingstaticobject.Developinganobjectreferencecacheforstaticobjectsisn’tlikelytobeahugeoptimization,butsectionsoftime-criticalcodethatfrequentlyaccessstaticvariablesinthismannermightbenefitfromthis. ThelargerpointinthepreceedingexampleisthatHyperionanditsDSMsubsystemwillsilentlyallowprogramstoaccessmemoryinefficiently.Currently,thebestwaytodetectsubtleproblemslikethisisthroughauditingthecodeandbyturningonafeatureofHyperioncalled“instrumentation.”Instrumentationoutputcangiveaprogrammerahigh-leveloverviewofhowtherunningprogramisusinglocalandnon-localmemory,aswellastheprogram’suseoflocks.Withsomedetectivework,theprogrammercanoftenfindthesourceofthebottleneckbyexaminingtheinstrumentationoutput.Forexample,aprogrammerwhoislookingthroughtheinstrumentationoutputmightnoticethatforeveryaccesstoastaticvariable,tworemotememoryreadsareperformed.Thismightbeahinttotheprogrammerthatasubtleproblemlikewhatwasdescribedinthepreceedingparagraphispresentinthecode.LanguageslikeSplit-CandenvironmentslikeMPIhavesomeprotectionsthathelppreventthissortofinadvertentaccess—inSplit-Caprogrammer 60 mustexplicitlymarkavariableasbeing“global”andinMPIthereisnodistributedsharing.Javadoeshaveoneveryimportantfeaturethatcanbeexploitedtominimizethistypeofinadvertentsharing—Javaisanobject-orientedlanguage.Recallthatinobject-orientedlanguages,accesstoeachobject’smemberscanbecontrolledratherprecisely.Thismeansthatiftheprogrammerisprogrammingusinganobject-orientedstyle,itwillneverbenecessaryforoneobjecttohavedeepreferencestoanotherobjectinordertomanipulatethatobject—theobjectcanbeforcedintousingtheotherobject’smethods(classfunctions).Inparticular,itiseasytolookatasectionofcodethatreads: local_pcsArr[whoami+1].someMethod(x,y,z); ...andconcludethatsomesortofnetworkcommunicationistakingplace.Ontheotherhand,supposethethesakeofargumentthatinthefollowingstatement,x[]residesonadifferentprocessorthantheonethatisexecutingthestatement: x[28]=496; Suchastatementwouldbemuchmoredifficultforaprogrammertospotbymerelyreadingthecode. HyperionProgrammingRule#5ThefactthatJavaisanobject-orientedlanguagecanbeexploitedtohelptuneprogramstorunmoreefficientlyinHyperion.Javaprogram-mersshouldusetheobject-orientedfeaturesofthelanguagetodisallowclassesfromhavingadeepknowledgeoftheinternalstructureofotherclasses.StructuringcodeinthismannerwillaidHyperionprogrammersintheeventthattheyneedtocloselyexaminehowandwherenetworkcommunicationsaretakingplace. Whileworkingonthebenchmarksforthisthesis,thisauthorproposedthattheHyperionRuntimeshouldbechangedsuchthatanymemoryallocatedforthepurposeofcontaining 61 anobjectthatimplement’sJava’sRunnableinterface(oranyobjectownedbythisobject)shouldbeallocatedsuchthatitisassociatedwiththeThreadthatisassociatedwiththisobject.ThiscouldbeaveryusefulfeaturetoaddtoHyperion.Unfortunately,itappearsthatitwouldbedifficulttoimplement,sinceisisverylikelythatthethreadthatwilleventuallyrunthisRunnableclasswon’thavebeencreatedyet.Foranillustrationofthis,notehowinFigure4-3,online35aRunnableobjectParCompSolveriscreated(andhence,memoryisallocated).Subsequentlyonline36theThreadthatexecutesthisobject’srun()methodiscreated.AnybodywhowishedtotakeituponthemselvestoaddthisproposedusefulfeaturetoHyperionwouldhavetocontendwithahard-to-solveissuelikethis. ThecodeinFigure4-3canbeoptimizedevenfurtherfortheHyperionenvironment.Online16theprogrammerhasallocatedmemoryfromtheobject’srun()method.Sincethisisthefirstmethodthateachnewthreadruns,anymemoryallocatedinthiscontextisallocatedinthisthread’slocalmemorysegment.Sinceitistheintentoftheprogrammertousethismemoryentirelylocallytoeachthread,thismemoryisallocatedinanefficientmanner.However,thesubtleprobleminthiscodeisthatalthoughthememorypointedtobythereferencelocal_pcsArrisallocatedinaHyperion-efficientmanner,thereferenceitself(local_pcsArr)isnot.ThisinefficiencyispictoriallyillustratedinFigure4-4.Infact,thememorythatcontainsthereferencelocal_pcsArrisactuallycontainedinasingleprocessorsomewhereinthecluster(recallHyperion’scardinalruleofmemorymanagement:memoryisownedbythethreadthatallocatesthatmemory).Thisimpliesthataccesstothereferencelocal_pcsArrissubjecttoallofthepropertiesofHyperion’sDSMsubsystem:thismemoryislikelytobenon-local,cachingofthismemoryisperformedbutsometimesthecachegetsdirty,etc.Dependingontheprogram’scircumstances,thismightbeabottleneckintheprogram’sperformance. 62 Processor 0Thread 0Processor 1Thread 1Use of\"member\" objectNetworkObject ReferenceObjectFigure4-4:IllustrationofanInefficientHyperionMemoryUsePattern(2). Solutionstothepotentialmemorybottleneckproblempresentedbytheprecedingpara-grapharecomplicatedbyanumberoffactors.First,anyprogramfromwhichadequateparallelperformanceissoughtwilllikelybecomprisedofmorethanonesubroutine.Second,Javaisanobject-orientedlanguageandprogrammersarelikelytouseitinthismanner—theprogrammerisapttocreatelogicalclassesandmethodsasabstractionswithwhichtomanipulatetheirdata.Thethirdcomplicatingfactoristhefactthattherearenopointerstosimplevariables(thosewithtypesint,float,double,etc.)inJava. ConsiderthecodepresentedinFigure4-5:inthisexample,theprogrammerhasstartedtheprocessoftryingtorestructuresomeexistingcodeinordertoeliminatethebottle-neckthattheprogramexperiencesasitaccessesvar1.TheprogrammerhasmodifiedtheParCompSolverclasstonolongerhaveanyclassmembers.Nowthatthemembersofthisclassareallocatedonthethread’sstack,thereisnopossibilityofabottleneckonProcessor0.Unfortunately,thistechniquedoesnotworkinthegeneralcase,becausethedesignofJavadoesnotallowthemethodscomputation_part1()andcomputation_part2()tomodifyvar1.InmanyRealWorldimplementationsofparallelalgorithms,havingaccesstosuchside-effectsisrequired.Figure4-6presentsapossiblestrategyforaddressingthisproblem; 63 thisentailsturningordinaryscalarvariablesintosingleelementarraysandpassinginevery“member”variableintoeverymethoddefinedbytheclass.Whilethismethodmightworkforasimpleclasswithfewmembersandmethods,itshouldbeobviousthatsuchaschemeiserror-prone,tedious,andhard-to-maintain. Thesolutionthattheauthorofthisthesisimplementedinordertoaddressthisparticularobjectreferenceproblemcanbedescribedasfollows:anewclasscalledParallelHarnesswascreated(seeFigure4-7foranoverviewofthisclass).Thesolepurposeofthisclassistocreatethreads,whichinturncreateobjectslikeParCompSolver.UsingtheParallelHarnessclassentirelysolvesthisparticularobjectreferenceproblem,becauseeachreferenceandtheobjectthatthereferencepointsto(suchasParCompSolver)isproperlyallocatedbytheParallelHarnessclassonthestackinwhicheachobjectruns.ClasseslikeParCompSolvernolongerhavetobeburdenedwiththecodethatactuallycreatesthreads.Inaddi-tion,verylittlemodificationisrequiredtoobjectslikeParCompSolver,certainlynothingliketheschemeoutlinedinFigure4-6.TheHyperion-efficientmemorylayoutthattheParallelHarnessclassmakesavailableinaconvenientmannerisillustratedpictoriallyinFigure4-8 HyperionProgrammingRule#6WhileJavamakesitpossibleforaclasslikeParCompSolvertodefineastaticmain()routineandforParCompSolverobjectstorunasthreads,alevelofindirectioninthestyleofParallelHarnesscanprovidesimilarefficiencyforasmallamountofeffort.TheParallelHarnessclasscreatesthreadsandthenallocatesmemoryfor“worker”objects(classeslikeParCompSolver)inaHyperion-efficientmanner.Theprogrammerdoesn’thavetomodifyalloftheir“worker”classestoallocatetheirownmemoryontheirthreadstacksandlocalheap. 64 publicclassParCompSolverimplementsRunnable{ //nomembersofthisclass,sincereferencestothese//variableswillgetallocated/bottleneckedat//Processor0publicvoidrun(){ //var1isusedforourparallelcomputation//routineslikecomputation_part1and //computation_part2useandmanipulatethisvariable//duringthecomputationprocess//var1andworkArrayareallocatedonthestackinorder//topreventabottleneckfromthisvariablebeing//allocatedinProcessor0intvar1; floatworkArray[][]=newfloat[...][...]; //Unfortunately,thefollowingtwocallsto //computation_part#don’tworkbecausethereisno//wayinJavafortheseroutinestomodifyvar1! computation_part1(var1,workArray);//side-effects!computation_part2(var1,workArray);//side-effects!} voidcomputation_part1(intv,floatw[][]){ ...v=1; //Note:sinceworkArrayisnolongeramemberof//ParCompSolverthereisnowaytoaccessithere//unlessitispassedinasanargument....} voidcomputation_part2(intv,floatworkArray){ ... //similarbodytocomputation_part1()...}...} Figure4-5:AStartatMovingaThread’sVariablestotheStack. 65 publicclassParCompSolverimplementsRunnable{ publicvoidrun(){ intvar1[1]=newint[1];//usedtobejust’intvar1’floatworkArray[][]=newfloat[...][...]; computation_part1(var1,workArray);computation_part2(var1,workArray);} voidcomputation_part1(intv[],floatw[][]){ ... v[0]=1;...} voidcomputation_part2(intv[],floatw[][]){ ... //similarbodytocomputation_part1()...} //side-effects!//side-effects! ...} Figure4-6:ASub-OptimalWayToMoveVariablestoaThread’sStack. 66 publicclassParallelHarnessimplementsRunnable{ staticintnumProcessors;staticThreadthreadArray[];staticParallelHarnessphArr[];intwhoami; publicstaticvoidmain(Stringargv[]){ numProcessors=hyperion.Runtime.numberOfNodes();//thisallowsParCompSolvertoinitialize//anyrequiredstaticstructures ParCompSolver.static_ctor(numProcessors);threadArray=newThread[numProcessors];phArr=newParallelHarness[numProcessors]; for(inti=0;i for(inti=1;i phArr[0].run();//themainthreadparticipatesheretoofor(inti=1;i publicParallelHarness(intw){whoami=w;} publicvoidrun(){ //noticethatpcsandtheobjectthatpcspointstoare//allocatedonthisthread’sstack.//ThisisHyperion-efficient! ParCompSolverpcs=newParCompSolver(whoami, numProcessors); pcs.run();}} Figure4-7:ASimplifiedParallelHarnessClass. 67 Processor 0Processor 1Thread 1Use of\"member\" objectObject ReferenceObjectFigure4-8:IllustrationofanEfficientHyperionMemoryUsePattern. Network4.1.1.4OtherMemoryManagementIssues AtleastsomeofJava’spopularityisduetotherich,well-designedclasslibrariesthatnearlyalwaysaccompanythebaselanguage.Forexample,programmersappreciatethefactthattheydon’thavetoprovidetheirownHashtableclass—thestandardclasslibraryalreadyprovidesagenericone.Thiscangreatlyspeeddevelopment. ParallelprogrammerswhoareimplementingtheircodeinJava(withthegoalofrunningtheircodeviaHyperion)mightbeinclinedtousethestandardclasslibrariesthataccompanyJava.Forexample,theauthorofthisthesisinitiallymadeuseofthejava.lang.util.Vectorclassinordertopassobjectsfromonethreadtoanother.Thiswasconvenient,becauseallofthestandardclasslibrariesthatcomewithJavaarethread-safe.Therewasnoneedtocobbletogetherathread-safeobject-passingmechanismbecausethesystemalreadycamewithone,completewithextensivedocumentation. Unfortunately,noneoftheJavaclasslibrarieswereoriginallyimplementedwithHype-rion’smemorypropertiesinmind.Worse,intypicalinstallationsthesourcecodetotheselibrariesisn’tevenavailable,soitisverydifficulttoauditthiscodeforHyperion-efficiency. 68 Forexample,vectorclassesaretypicallyimplementedinternallyasarrays.Java’sVectorclassisexpandable,sothismeansthatwhenthevectorgetsfilledup,somespecialpro-cessingneedstooccur.Typicallywhathappenswhenavectorfillsupitsinternalarrayisthatthevectorallocatesanewlargerarray,copiesalloftheentriesfromtheoldarraytothenewarray,andthendiscardstheoldarray.Overall,vectorsareefficientbecausetheyeliminatetheneedtopreallocateenoughmemorytoholdthelargestnumberofobjectsthattheprogramneedstostore,thereisnoneedfortheoverheadofalinkedlist,andnewmemoryallocationsoccuronlyrarely.Unfortunately,becausevectorclassestypicallyallocatememoryatcertainpointsintheirlifetimes,thismakesthemunsuitableforuseintheHyperionenvironment.Forexample,ifathreadrunningonProcessor3insertsanobjectintoaVectorthatresidesonProcessor2andthisactioncausesamemoryallocation,thenewmemorywillbeallocatedfromtheProcessor3pool.Withoutwarning,theVectorresidingonProcessor2willsuddenly(andinefficiently)startusingmemoryonProcessor3!Itisnothardtoenvisionthatthistypeofbehaviorcouldsooncauseasignificantandhard-to-findsourceofinefficiency. HyperionProgrammingRule#7JavaprogrammerswhoareportingtheircodetoHy-perionmustnotuseanyclasswhosememoryuseandallocationpatternsarenotcompletelyunderstoodandin-linewithHyperion’smemoryproperties.ThisincludesanyclassfromthestandardJavaclasslibraries.Instead,onlyclasseswhichusememoryinaHyperion-efficientmannershouldbeused. 4.1.2Other InthecourseofthisprojecttheauthorspentaconsiderableamountoftimedebuggingandoptimizingcodesothatitwouldrunefficientlyontheUNHStarCluster.Atvariouspoints 69 inthedevelopmentoftheC/MPIbenchmarksforthisthesis,thefollowingdevelopmenttoolswereused: •gdb/DDD—debuggerandgraphicaldebugger.•gprof/Quantify—executionprofilingtools. •Purify/Valgrind/Mpatrol—memorydebuggingtools•ifconfig/tcpdump/Ethereal—networkdiagnostictools Thesetoolsweren’tusedveryoften,butoccasionallythesetoolsprovedtobeextremelyhandy. Incontrast,theonlydebugginganddiagnosticfacilitiesthatwereavailableforusewithHyperionwereutilizingprintstatements,“ifconfig,”andturningonafeatureofHyper-ioncalled“instrumentation.”Instrumentationoutputcanbepost-processedtotelltheprogrammerhowmanytimesnon-localmemorywasused,howmanytimeslockswereac-quired/released,andtheaverage/maximumtimesforeachofthesetooccur.Itiscertainlypossibletodebugandtuneprogramswiththesetools.However,itmightseemstrangethatthesearetheonlytoolsavailable,sinceitisknownthatHyperioncompilesJavacodeintoJavabytecodeintoCcodeintoexecutableprograms.IfthelonglistofaforementionedtoolsisavailableforC/MPI,whyarethesetoolsunavailableforuseintheHyperionenvironment? TheanswertothisquestionhastodowiththefactthatHyperion’sruntimedoesn’tusethehost’snativethreadinglibraries(inthiscase,POSIXthreads).Instead,Hyperion’sruntimeusesacustomthreadlibraryfromthePM2package.Therationaleforthisisthat,intermsofperformance,thiscustomthreadinglibraryismoreefficientthanthenative(POSIX)threadinglibrary.ProgramslinkedagainstthecustomPM2threadinglibrarycannotbeeasilyusedinconjunctionwithanyoftheaforementioneddebugging/profiling 70 tools,sincemostoftheselibrariesdependupontheprogram’sstackbeingarrangedinastandardmanner. TheauthorunderstandsthatPM2canbeconfiguredtousethehostsystem’snativethreadlibraries,butthatthisconfigurationisonlysupportedwhenthesystemasawholeisrunonasinglenode—notaclusterenvironment.TheauthorfeelsthatupdatingPM2tohavethecapabilitytousethehostsystem’snativethreadinglibrarywhileoperatinginaclusterenvironmentwouldbeaveryusefulfeaturetoadd.ThissingleadditiontoPM2wouldgoalongwaytowardsmakingtheseusefuldebugging/profilingtoolsavailableforsystemslikeHyperion. 71 Chapter5 Conclusions ThegoalofthisthesisistoevaluatetheperformanceofHyperionbycomparingparallelJavaprogramsrunintheHyperionenvironmenttoparallelCprogramsthatutilizetheMPIAPI.Thisthesisdetailstheresultsofrunningfivebenchmarkprogramsineachoftheseenvironments,andincludesdescriptionsofwhatchangesweremadetoeachbenchmarkimplementationinordertoimproveitsperformance.Fromalloftheseobservations,wecanmakeanumberofconclusions. Insomecases,theperformanceofthebenchmarkswhenrunineachenvironmentiscomparable.ThiscanbeseeninboththeGaussianeliminationandmatrixmultiplicationbenchmarks.InothercasestheHyperionbenchmarksclearlylagbehindtheirC/MPIcounterparts.Thiscanbeseeninboththeshallow-waterandminimalcostgraphcoloringbenchmarks. ProgrammerswhowishfortheirparallelJavaprogramstorunefficientlyinHyperionmuststructuretheirprogramsverycarefully.BothoftheprecedingtwochaptersdiscussthemanywaysinwhichaJavaprogramwrittenoutsideoftheHyperionenvironmentmighthavetoberestructuredinordertorunefficientlyunderHyperion.Javatoutsitselfasbeinga“writeonce,runanywhere”language,butitseemssafetoconcludethatnearlyanyRealWorldparallelJavaapplicationwouldhavetoberestructuredinsomewayifitistoberun 72 efficientlyintheHyperionenvironment. Manyresearchgroupshaveassertedthatwritingparallelprogramsinashared-memoryenvironmentiseasierthanwritingprogramsinamessage-passingenvironment[16][8].Inthisauthor’sopinionthisassertionhassomemerit:thisauthorfounddealingwithallofthedata-dependenciesinthemessage-passingversionoftheshallow-watercodetobeverycomplicated.However,thisassertionseemstocarrylessweightinaDSMenvironment:inordertoachievereasonableperformance,efficientDSMprogramstendtostartlookingverysimilartotheirmessage-passingcounterparts.AlloftheHyperionbenchmarksdis-cussedinthisthesishavethisproperty.Inbothmessage-passingandinDSMenvironmentsprogrammershavetopaycloseattentiontohoweachprocessoruseslocalandnon-localmemory. Hyperionwascreatedforuseoncommodityclustersofcomputers.Someofthebench-marksdiscussedinthisthesissufferedduetherelativeslownessofthenetworkascomparedtotheperformanceoftheCPUsintheStarCluster.ThereaderisremindedthatinReno’sthesis[16]itwasshownthatthemovetotheParaskiClusterwithitsdedicated2Gb/sMyrinetnetworkhadalargeeffectontheperformanceofsomeofthebenchmarksdis-cussedinthatthesis.Someofthebenchmarksdiscussedinthisthesis,mostnotably,theHyperionversionoftheminimalcostgraphcoloringbenchmark,verylikelywouldhavebenefitedwithaccesstoahigher-performancenetwork.Suchatestwasneverperformed.However,whenrunonthesameclusterhardware,theC/MPIversionsofthesebench-marksperformedreasonably.Whatisthereasonforthis?Thisisadifficultquestiontoanswer,butonestrikingdifferencebetweenthesetwosystemsisthefactthatinterprocessorcommunicationinC/MPIsimplyinvolvesplacingbytesonTCP’sreliablestream,whereasinterprocessorcommunicationinHyperionfrequentlyinvolvessynchronizedblocks,mem- 73 oryconsistancyupkeep,andverificationthatdatathatwassentarrivedatitsdestinationsafely.Surelythisdifferencemustbetheunderlyingcauseforasignificantamountoftheperformancedisparitybetweenthesetwosystems. Inthecourseofdevelopingthebenchmarksforthisthesis,thisauthordevelopedmanydifferentimplementationsofsendrecv().ThefirstimplementationofthiscommunicationspatternwasimplementedusingthenativeJavafunctionalityofsynchronized,wait(),andnotifyAll().Benchmarkimplementationsthatutilizedthisversionofsendrecv()(shallow-water,matrixmultiplication)sufferedfromextremelypoorperformanceandgen-eratedahugeamountofnetworktraffic.Theperformanceofbenchmarksthatrelieduponsendrecv()improvedgreatlyafterthiscommunicationsoperationwasrewrittentoutilizeHyperion-specificsynchronizationmethods(e.g.Reduction.sum()).TheauthorneverwasabletodeterminetherootcauseoftheproblemswiththeuseofHyperion’simplementa-tionsofsynchronized,wait(),andnotifyAll()inthesebenchmarks.However,becauseoneofthedesigngoalsofHyperionistoefficientlysupportnativeJavaapplications,andbecausetheauthorbelievesthattheoriginalsendrecv()implementationutilizedthesesynchronizationmethodsinafairlycommonmanner(describedin[9],forexample),theauthorfeelsitnecessarytomentionthatsomedifficultieswereencounteredinthisarea.Itisthehopeofthisauthorthatfutureresearcherscandeterminetherootcauseoftheproblemhere. EfficientDSMprogramstendtoevolvetowardsbeingsimilarinstructuretomessage-passingprograms.DuringthisevolutiononeoftheweaknessesofDSMsystemsbecomesapparent:DSMsystemstendtonothaverichcommunicationsfacilitiesforpassingmes-sages.Thisonlymakessense,sinceDSMsystemsarenotmessage-passingsystems.IntheprocessofoptimizingtheHyperionbenchmarksforthisthesis,thisauthorfrequentlyhad 74 thetaskofcobblingtogetheracommunicationspatternforuseinaDSMenvironment,apatternwhichwasalreadyimplementedinMPI.MPIsupportsawidevarietyofblock-ing/nonblocking,point-to-point/collective-communicationoperationsthatoperateonanydatatypethattheuserwishestouse.Thepointhereisnotthatwhethertheseoperationsaremoreefficientthananythingthatcouldbeimplementedina“pure”DSMenvironment,ratherthepointisthatitisratherhandyforaprogrammerthattheseroutinesareal-readyimplemented.Thisfactonlybecomesmoreapparentwhenthereiscallforsomesortof“collective”communicationoperation.IfanoperationlikethoselistedinFigure2-4wererequiredinaHyperionprogram,theoperationitselfwouldhavetobedesignedandimplementedinadditiontothebenchmarkbeingoptimizedforHyperion.Efficientlyimplementingsomeoftheseoperationswouldentailalotofwork. HyperionusestheDSM-PM2libraryforitsthreadandshared-memoryfunctionality.Inparticular,DSM-PM2eschewsPOSIXthreadsbecauseoftheirperceivedinefficiencywithsignalhandling.Eliminatingthisinefficiencyiscertainlyanadmirabledesigngoal,howeverthisalsohasaside-effectaswell:theuseofanon-standardthreadpackagebasicallyelim-inatesthepossibilityofusinganystandarddebuggingtoolinconjunctionwithHyperion.Writingparallelprogramsisdifficult,andoccasionallyproblemsdocropup.Itisnotsuf-ficienttobeabletorundebuggingtoolsinamorefeature-richdevelopmentenvironment(runningonasingleworkstation)andthensubsequentlyruntheprograminquestionviaHyperion–sometimestheavailabilityofsuchtoolsontheparallelmachineitselfiscritical.ThisauthorfoundthatbeingabletousesuchtoolsintheC/MPIenvironmentwas,onoccasion,extremelyhandy.ItwouldbeveryusefulforHyperion/DSM-PM2userstobeabletousesuchtools.IftheDSM-PM2librarywasenhancedinordertobecapableofbeingconfiguredtousePOSIXthreadsinaclusterenvironment,thiswouldbeaboonto 75 programmerswhousethesesystems. Allofthesepointsaside,Javaisadeservedlypopularprogramminglanguage,andoneoftheconceivableusesforthislanguageisparallelprogramming.ForprogrammerswhowishtouseJavatoruntheirparallelcomputersonclustersofcommoditycomputers,HyperionisagoodJVMtouse.ItisacredittotheJavacommunity,andespeciallytothecreatorsofHyperionthatthisJVM’sperformancecanrivalthatofmucholder,highlytunedmessage-passingsystems. 76 Bibliography [1]GabrielAntoniu,LucBoug´e,PhilipJ.Hatcher,MarkMacBeth,KeithMcGuigan,and RaymondNamyst.TheHyperionSystem:CompilingMultithreadedJavaBytecodeforDistributedExecution.ParallelComputing,27(10):1279–1297,2001. [2]D.Culler,A.Dusseau,S.Goldstein,A.Krishnamurthy,S.Lumetta,T.vonEicken, andK.Yelick.IntroductiontoSplit-C.Technicalreport,UniversityofCalifornia,Berkeley,1993. [3]D.Culler,A.Dusseau,R.Martin,andK.Schauser.PortabilityandPerformance forParallelProcessing,chapter4:FastParallelSortingunderLogP:fromTheorytoPractice,pages71–98.JohnWiley&SonsLtd.,1994. [4]EdsgerW.Dijkstra,W.H.J.Feijen,andA.J.M.vanGasteren.Derivationofatermina-tiondetectionalgorithmfordistributedcomputations.Inf.Proc.Letters,16:217–219,1983. [5]J.Gosling,W.Joy,andG.SteeleJr.TheJavaLanguageSpecification.Addision-Wesley,Reading,Massachusetts,1996. [6]WilliamGropp,EwingLusk,NathanDoss,andAnthonySkjellum. Ahigh- performance,portableimplementationoftheMPImessagepassinginterfacestandard.Technicalreport,ArgonneNationalLaboratory,1996. 77 [7]G.Hoffman,P.Swarztrauber,andR.Sweet.Aspectsofusingmultiprocessorsforme-teorologicalmodelling.InProceedings,WorkshoponMultiprocessinginMeteorologicalModelling,pages125–196,1988. [8]P.Keleher,S.Dwarkadas,A.L.Cox,andW.Zwaenepoel.Treadmarks:Distributed sharedmemoryonstandardworkstationsandoperatingsystems.InProceedingsoftheWinter1994USENIXConference,pages115–131,January1994.[9]DougLea. ConcurrentProgramminginJava–DesignPrincipalsandPatterns. Addision-Wesley,Reading,Massachusetts,1996. [10]HonghuiLu,SandhyaDwarkadas,AlanL.Cox,andWillyZwaenepoel.Messagepassing versusdistributedsharedmemoryonnetworksofworkstations.1995. [11]MessagePassingInterfaceForum.Documentforastandardmessage-passinginterface. Technicalreport,OakRidgeNationalLaboratory,1993. [12]R.NamystandJ.F.Mehaut.PM2:ParallelMultithreadedMachine:Acomputing environmentfordistributedarchitectures.InParCo’95(ParallelComputing),pages279–285.ElsevierSciencePublishers,September1995. [13]S.OaksandH.Wong.AdvancedSynchronizationinJavaThreads.Availablevia http://www.onjava.com/pub/a/onjava/excerpt/jthreads3_ch6/index1.html,2004. [14]M.Quinn.ParallelComputing:TheoryandPractice.McGraw-Hill,1994. [15]MichaelJ.Quinn.Analysisandimplementationofbranch-and-boundalgorithmsona hypercubemulticomputer.IEEETransactionsonComputers,39(3),March1990. 78 [16]M.Reno.ComparingthePerformanceofDistributedSharedMemoryandMessage PassingProgramsUsingtheHyperionJavaVirtualMachineonClusters.Master’sthesis,UniversityofNewHampshire,2003. 79 AppendixA BenchmarkData Thisappendixpresentsthedatathatwasusedinthisdocument.Allgraphscitedinthesetablesusethe“average”valuesfordatapointsonthegraphs. 80 ProcessorsObservedRuntimesAverageStandardDeviation1167.120,167.820,167.440167.4600.350292.020,91.620,91.98091.8730.220453.570,53.500,53.63053.5670.065844.380,44.270,44.14044.2630.1201678.290,77.650,78.10078.0130.329Java/Hyperion1217.392,215.549,216.291216.4110.927 2119.411,118.251,119.623119.0950.739476.487,77.176,78.50777.3901.027877.851,77.721,77.13977.5700.37916115.546,114.980,115.463115.3300.306SeeFigure3-2andFigure3-3forcorrespondinggraphs.TableA.1:DatafortheGaussianEliminationBenchmark. ProcessorsObservedRuntimesAverageStandardDeviation 135.340,35.190,35.17035.2330.093416.810,16.850,16.76016.8070.04588.660,8.690,8.7508.7000.046164.950,4.920,5.1305.0.114Java/Hyperion176.412,76.212,76.35076.3250.102426.910,26.765,26.66426.7800.124821.489,22.229,21.41621.7110.45016351.615,352.894,351.321351.9430.836SeeFigure3-4andFigure3-5forcorrespondinggraphs.TableA.2:DatafortheShallow-WaterBenchmark. Processors 124816Java/Hyperion124816 SeeFigure3-10EnvironmentC/MPIObservedRuntimesAverageStandardDeviation56.800,56.800,56.79056.7970.00629.190,29.200,29.20029.1970.00615.560,15.570,15.57015.5670.0068.930,8.760,8.8508.8470.0855.710,5.890,5.7005.7670.10796.986,98.864,98.10497.9850.94562.449,62.488,62.57062.5020.06232.681,32.796,32.66432.7140.07218.440,18.450,18.44818.4460.00513.397,13.407,13.44713.4170.026andFigure3-11forcorrespondinggraphs. EnvironmentC/MPIEnvironmentC/MPITableA.3:DatafortheMatrixMultiplicationBenchmark. 81 ProcessorsObservedRuntimesAverageStandardDeviation123.0,23.0,24.023.3330.577236.0,38.0,37.037.01.0423.0,23.0,23.023.00.0814.0,15.0,15.014.6670.577169.0,9.0,10.0,9.3330.577 Java/Hyperion1Pleaseseetext. 2Pleaseseetext.4Pleaseseetext.8Pleaseseetext.16Pleaseseetext.SeeFigure3-13andFigure3-14forcorrespondinggraphs.TableA.4:DatafortheRadixSortBenchmark. EnvironmentC/MPIProcessorsObservedRuntimesAverageStandardDeviation1115.090,117.390,117.480116.6531.355236.660,36.670,36.66036.6630.006420.400,19.880,20.48020.2530.326818.180,17.090,17.87017.7130.562168.990,12.880,9.05010.2702.262 Java/Hyperion1241.005,239.581,239.980240.1890.735 2238.860,239.226,239.582239.2230.3614189.410,202.443,190.713194.1897.1788224.190,222.661,200.738215.86313.12116450.930,440.520,500.440463.96332.016SeeFigure3-17andFigure3-18forcorrespondinggraphs.TableA.5:DatafortheMinimalCostGraphColoringBenchmark. EnvironmentC/MPI82
因篇幅问题不能全部显示,请点此查看更多更全内容