搜索
您的当前位置:首页正文

Date Dedication

来源:六九路网
EVALUATINGTHEPERFORMANCEOFHYPERION,ADISTRIBUTEDSHAREDMEMORYIMPLEMENTATION

OFJAVA

BY

KevinD.Clark

B.S.,UniversityofNewHampshire(1993)

THESIS

SubmittedtotheUniversityofNewHampshire

inpartialfulfillmentof

therequirementsforthedegreeof

MasterofScience

in

ComputerScience

May2005

Thisthesishasbeenexaminedandapproved.

Thesisdirector,PhilipJ.HatcherProfessorofComputerScience

R.DanielBergeron

ProfessorofComputerScience

RobertD.Russell

AssociateProfessorofComputerScience

Date

ii

Dedication

ToSharon,

thanksforalloftheloveand,ofcourse,

thanksforteachingmetonevertrustaroundpizza.

iii

Acknowledgments

Thisthesishasbeenalongbutworthwhileeffort.Iwouldliketorecognizesomeofthepeoplewhohavehelpedwiththisprojectoverthepastfewyears.

First,theIwouldliketothankmywife,SharonKunz,whohasenduredmanynightsandweekendsofherhusbanddisappearingintoourofficetoworkonthisproject.Ineverwouldhavegottenthroughthisprojectwithoutherloveandsupport.Second,Iwouldliketothankmythesisdirector,ProfessorPhilipHatcher,whohasalwaysbeenveryhelpfulandpatientinguidingmethroughthisproject.ProfessorHatcherofferedmecountlesshelpfulsuggestionsduringthecourseofthisproject,aswellasanswerednumerousquestionsregardingtheinternalsofHyperion.Next,Iwouldliketothanktheothermembersofmythesiscommitteeforreadingovermythesisdraftsandfortheirsuggestionsastohowtomakethedraftbetter.Theseweregreatlyappreciated.

IwouldliketothankEmilyPoworoznekoftheUNHEngineering,MathematicsandComputerScienceLibraryforherassistanceintrackingdownsomeofthesourcesusedinthisdocument.IwouldliketothanktheUNHComputerScienceDepartmentforofferingnightclasses—withoutthese,givenmybusycareer,Ineverwouldhavebeenabletoworktowardsmymaster’sdegree.IwouldliketothankthevariousfolksinthecomputersciencedepartmentwhowerekindenoughtofixtheStarClusteraftereventslikelightningstrikesandblownair-conditioners.Last,butnotleast,IwouldliketothankBrianJohnson,whowasmyfirstcomputerscienceteacherwhenIwasanundergraduate—IcreditthesolidfoundationthatIgotincomputersciencetohim.

iv

TableofContents

Chapter

Dedication........................................Acknowledgments....................................ListofTables......................................ListofFigures.....................................Abstract.........................................1Introduction

2BackgroundandRelatedWork2.12.2

TheStarCluster.................................Hyperion.....................................2.2.12.2.22.2.3

Motivations................................Hyperion’sCompiler...........................Hyperion’sRuntimeSubsystem.....................2.2.3.12.2.3.22.2.3.32.2.3.42.2.3.52.2.3.6

Threads.............................Monitors............................OtherSynchronization/CommunicationMechanisms....DistributedSharedMemorySubsystem...........Low-levelCommunications..................FeaturesofJavaNotPresentinHyperion..........

v

Pageiiiivviiiixxi144445556791112

2.3MPI........................................2.3.1

CommunicationsOperations

......................

2.4TheBenchmarks.................................2.4.1Overview.................................2.4.2

NotesRegardingSystemConfigurationandDataCollection.....

2.5RelatedWork...................................2.5.1MathewReno’sMastersThesis.....................2.5.2TreadMarks................................2.5.3

Split-C

..................................

3BenchmarkComparison3.1Introduction....................................3.2

GaussianElimination...............................3.2.1JavaImplementation...........................3.2.2C/MPIImplementation.........................3.2.3ResultsandPerformance........................

3.3

ShallowWater..................................3.3.1JavaImplementation...........................3.3.2C/MPIImplementation.........................3.3.3

ResultsandPerformance........................

3.4MatrixMultiplication..............................3.4.1JavaImplementation...........................3.4.2C/MPIImplementation.........................3.4.3

ResultsandPerformance........................

3.5RadixSort

....................................

vi

131317171819192021232323242526272829293636363739

3.5.1Overview.................................3.5.2Split-C

..................................

3.5.3JavaImplementation...........................3.5.4C/MPIImplementation.........................3.5.5ResultsandPerformance........................

3.6

MinimalCostGraphColoring..........................3.6.1JavaImplementation...........................3.6.2C/MPIImplementation.........................3.6.3

ResultsandPerformance........................

4HyperionAssessment4.1

LessonslearnedfromportingcodetoHyperion................4.1.1

MemoryManagement..........................4.1.1.1Ownership...........................4.1.1.2ObjectCaching........................4.1.1.3ProblemswithObjectReferences..............4.1.1.4

OtherMemoryManagementIssues..............

4.1.2

Other...................................

5ConclusionsBibliography

AAppendixA—BenchmarkData

vii

3940424243444849495353535358606869727780

ListofTables

A.1DatafortheGaussianEliminationBenchmark.................A.2DatafortheShallow-WaterBenchmark.....................A.3DatafortheMatrixMultiplicationBenchmark.................A.4DatafortheRadixSortBenchmark.......................A.5DatafortheMinimalCostGraphColoringBenchmark............

viii

8181818282

ListofFigures

2-1ExampleofHowToCreateaThreadinJava..................2-2AnExampleofJava’sMonitors.

........................

2-3AnExampleMPIProgram............................2-4AnExampleofMPI’sCollectiveCommunicationsOperations.........2-5CollectiveCommunicationOperationsinMPI.................3-1GlobalDataLayoutfortheGaussianEliminationBenchmark.........3-2AverageRuntimesforGaussianEliminationBenchmark............3-3MeasuredSpeedupforGaussianEliminationBenchmark...........3-4AverageRuntimesfortheShallowWaterBenchmark.............3-5MeasuredSpeedupfortheShallowWaterBenchmark.............3-6GeneralComputation/CommunicationsPatternfortheShallowWaterBench-mark........................................3-7SimplifiedDescriptionofaSimplesendrecv()ImplementationforHyperion.3-8OptimizedComputation/CommunicationsPatternfortheShallowWater

Benchmark.....................................3-9GlobalDataLayoutfortheMatrixMultiplicationBenchmark........3-10AverageRuntimesfortheMatrixMultiplicationBenchmark.........3-11MeasuredSpeedupfortheMatrixMultiplicationBenchmark.........3-12ExampleDeclarationofa“spread”ArrayinSplit-C..............

ix

681415162426273031

3133

3537383841

3-13AverageRuntimesfortheRadixSortBenchmark...............3-14MeasuredSpeedupfortheRadixSortBenchmark...............3-15SortTimeperKeyfortheRadixSortBenchmark...............3-16DescriptionofDijkstra’sDistributedTerminationDetectionAlgorithm...3-17AverageRuntimesfortheMinimalCostGraphColoringBenchmark.....3-18MeasuredSpeedupfortheMinimalCostGraphColoringBenchmark....4-1ExampleofClashBetweenObject-OrientedStyleandHyperion’sDSM...4-2IllustrationofanInefficientHyperionMemoryUsePattern(1)........4-3ExampleofCachingObjectsResidingonOtherProcessors..........4-4IllustrationofanInefficientHyperionMemoryUsePattern(2)........4-5AStartatMovingaThread’sVariablestotheStack..............4-6ASub-OptimalWayToMoveVariablestoaThread’sStack.........4-7ASimplifiedParallelHarnessClass......................4-8IllustrationofanEfficientHyperionMemoryUsePattern...........

x

4344454750505557596365666768

ABSTRACT

EVALUATINGTHEPERFORMANCEOFHYPERION,ADISTRIBUTEDSHAREDMEMORYIMPLEMENTATIONOFJAVA

byKevinD.Clark

UniversityofNewHampshire,May,2005

InthisthesisweexaminetheperformanceofHyperion,ahigh-performanceimplemen-tationofJavacreatedattheUniversityofNewHampshire(UNH).TheintentofHyperionistoallowprogrammerstodeveloptheirparallelapplicationsinanenvironmentthatiscommonandfamiliar:multithreadedJavarunningonapersonalworkstationequippedwithpowerfuldevelopmenttools.Aftertheprogramisdeveloped,itcanbemovedwithoutmodificationtoaHyperionsystemtoberunonaclusterofcomputers.ArepresentativesetofparallelbenchmarkshasbeendevelopedinordertoexaminetheperformanceofHyper-ion,writtenonceinpureJavaandwrittenagaininCusingtheMessagePassingInterface(MPI).Therelativeperformanceofeachbenchmarkiscompared.Ifanychangesweremadetoabenchmarkinordertoobtainadequateperformance,thesechangesareexplained.Ingeneral,thisresearchshowsthatitispossibletoobtainreasonableperformancewhenex-ecutingparallelJavacodeinHyperion,butwritingJavacodethatadherestoHyperion’sDistributedSharedMemory(DSM)featuresrequiresagreatdealofattentiontodetail.Inaddition,therichfeaturesinamessage-passinglibrarylikeMPIappeartomakeiteasiertodevelopapplicationsthatperformwellonaclusterofcomputers.

xi

Chapter1

Introduction

Parallelprocessingtechnologyisimportantforsolvingvariouslargeproblems.Someprob-lemsaretoolargetoreasonablysolvewithtoday’sprocessorswithoutresortingtosomekindofparallelprocessingtechnique.Suchproblemsincludesimulationsoflargesystems,ge-neticsequencingproblems,fastFouriertransformationsrunonlargedatasets,largesortingproblems,andrealworldoptimizationproblems.

Thequestionofhowbesttoprogramparallelcomputersingeneralhasbeenthesub-jectofmuchresearchovertheyears.Twooftheparadigmsthathaveemergedfromthisresearcharemessage-passingandshared-memorysystems.Messagepassingsystemsallowprogrammerstoexplicitypassmessagesbetweenprocessorsactinginconcert.Shared-memorysystemsallowprocessorsactinginconcerttoaccessagloballysharedmemoryarea.Message-passingsystemsgivetheprogrammertheopportunitytocontrolpreciselyhowandwhenmessagesaresentatruntime,butthisdegreeofcontroldoescomewithaddedresponsiblity:theprogrammerhastopreciselyinsertinhisorhercodecallstothemessage-passingApplicationProgrammingInterface(API)ateverypointcommunicationisrequired.Incontrast,shared-memorysystemstakecareofallofthecommunicationdetailsfortheprogrammer.

Inrecentyearstherehasbeenmuchinterestinconstructingparallelprocessingsystems

1

intermsofclustersofcommodityequipment:off-the-shelfPCsrunningstandardoperatingsystemsinterconnectedbycommerciallyavailablenetworkingequipment.Suchsystemsarerelativelyinexpensivecomparedtodedicatedparallelmachines,relativelyeasytoassemble,andthereexistprogrammingtoolsandlibrariestoassistintheuseofsuchclusters.

Tothisend,aclusterofworkstationshasbeenconstructedattheUniversityofNewHampshireforthepurposeoffacilitatingresearchintoparallelcomputation.Thisclus-ter(calledthe“StarCluster”)supportsthetwodifferentparallelprogrammingparadigmsmentionedabove—amessagepassingsystemandasharedmemorysystem.ThemessagepassingsystemisanimplementationoftheopenMessagePassingInterface(MPI)[11]stan-dardcalledMPICH[6].ThesharedmemorysystemisaversionofJavadevelopedattheUniversityofNewHampshirecalledHyperion.HyperionprovidesaJavaVirtualMachine(JVM)thatconformstotheJavaLanguageSpecification(JLS)[5].

Javaisanattractivelanguagefromaparallelprogrammingperspectiveformanyreasons.Javaisaportable,well-defined,widely-availableandflexiblelanguagethatcurrentlyenjoyspopularitybecauseofitsrichfeatureset.Thisfeaturesetincludessupportforthreads,syn-chronizationmechanisms,andathread-safeclasslibrary.Thelanguageandprogrammingenvironmentarefairlymatureatthispoint,withmanyexcellentprogrammingtoolsavail-able(inparticular,optimizingcompilers,debuggers,andacleanstandardclasslibrary).AparallelapplicationthatisdevelopedusingtheJavaprogramminglanguagehasaverygoodchanceofbeingabletorunontomorrow’scomputingplatforms.

ThisthesisevaluatestheparallelprocessingperformanceofHyperionbycomparingittotheperformanceofafairlywidespreadmessagepassingimplementation(MPICH).Asetofbenchmarkshasbeendevelopedforthispurpose.Eachbenchmarkhasbeenimplementedtwice—onceutilizingathreadsparadigm(usingJava’sthreads)andonce

2

utilizingamessage-passingparadigm(usingCandMPI).Alltestshavebeenrunonthesamehardware.TheMPI-basedbenchmarkswerefairlystraightforwardtoimplement,buttheJavathreads-basedbenchmarksbenefitedfromadditionaloptimizationsduetothewaythatthreadsanddistributedshared-memoryareimplementedinHyperion.Allsuchmodificationsandoptimizationsarenotedanddiscussedinthisthesis.

InChapter2wediscussHyperion,MPI,andthebenchmarksthathavebeenwritteninordertoevaluatethesetwoparallelprogrammingapproaches.InChapter3wediscusstheperformanceresultsofthebenchmarksthemselves,includingspecificmodificationsthatweremadetothebenchmarksinordertoimprovetheirperformance.InChapter4adetailedexaminationofthelessonslearnedwhileportingJavacodetotheHyperionenvironmentarepresented.Finally,conclusionsarepresentedinChapter5.

3

Chapter2

BackgroundandRelatedWork

2.1

TheStarCluster

TheStarClusterisacollectionofcommodityPCsconnectedtogetherusingaprivate,high-speednetworkswitch.Actinginconcert,thisclusterofequipmentisaparallelcom-puter.ThetypesofparallelprogrammingenvironmentsthatthisclustersupportsincludesHyperionandMPICH(bothdescribedindetailbelow).

Thehardwareforthisclusterconsistsof16PCsconnectedviaa100MbsEthernetswitch.EachPCisa662MHzIntelPentiumIII(Coppermine),configuredwith512MRAM,SCSIhard-disks,runningLinux,kernelversion2.4.4.

2.2

2.2.1

Hyperion

Motivations

HyperionisanimplementationofJavaversion1.1thathasbeencreatedattheUniversityofNewHampshire.ThecreationofHyperionaddressesseveralmotivations:First,Hyperionprovidesanexecutionenvironmentthatcanharnessthepowerofaclusterofworkstations.Second,Hyperionimplementsacommonprogrammingenvironment:Java.Third,thisprogrammingenvironmentisstandard—forthemostpart,therearenolanguageextensions

4

toadjustto,nonewAPIstolearn,andnorequiredsourcecodechangesthatwouldpreventtheprogrammerfromusingsomeotherJavaimplementation.

2.2.2Hyperion’sCompiler

Generallyspeaking,Hyperioniscomprisedoftwoparts:acompilerandaruntimeenviron-ment.Hyperion’scompilerdoesnotjusttranslateJavasourcecodeintoJavabytecode—italsotranslatesthiscodeintorawexecutablemachinecode.Thecompilerdoesthisbytak-ingtheoutputofastandardJavacompiler(Javabytecode),translatingthisintoCcode,andthenrunningthiscodethroughanoptimizingCcompiler(GCC).Thispointbearsrestating:atruntime,theHyperionsystemdoesnotinterpretJavabytecode;insteadthegeneratedmachinecodeisexecuted.Hyperion’scompilerisdescribedindetailin[1].

2.2.3Hyperion’sRuntimeSubsystem

Hyperion’sruntimesystemprovidesallofthefunctionalitythatastandardmultithreadedJavaprogramwouldneedinordertorun(generallyunmodified)onaclusterofworkstations.AlargesubsetoftheJavaclasslibraryhasbeenimplementedandisavailableforanyprogrammertouse.Moreimportantly,theHyperionruntimesystemprovidessupportforJava’sthreadcreation,synchronization,andshared-memorysubsystems.Thesefeaturesarecentraltotheworkinthisthesis;thereforethesewillbeexplainedindetailhere.

2.2.3.1Threads

IntheJavaprogrammingenvironment,thestandardwaytocreateanewthreadistousemethodsinthejava.lang.Threadclass.Foranexampleofhowanewthreadiscreated,pleaserefertoFigure2-1.

5

publicclassSomeClassimplementsRunnable{

publicstaticvoidmain(Stringargv[]){

SomeClasssc=newSomeClass();Threadthr=newThread(sc);thr.start();

System.out.println(‘‘originalthreadisrunninghere’’);//waitfornewlycreatethread‘‘thr’’tofinish

try{thr.join();}catch(InterruptedExceptione){}}

publicvoidrun(){

System.out.println(\"newlyspawnedthreadisrunninghere\");}}

Figure2-1:ExampleofHowToCreateaThreadinJava.

Hyperion’sruntimeimplementsallofthemethodsinthisclass,sostandardmultithread-ed/parallelJavaapplicationsdonotneedtobemodifiedinordertorunontheStarCluster.Hyperion’sruntimecreatesthreadsinthefollowingmanner:eachthreadthatiscreatedisdistributedinaround-robinmannertoaCPUthatisparticipatinginthecluster.Itispossibletocreatemorethreadsthanprocessorsinthecluster.Inthissituation,oneormoreoftheprocessorsruntwoormorethreadssimultaneously.Noneofthebenchmarksdetailedbythisthesisusethisfeature,butoccasionallythisfeaturecanbeuseful.1

2.2.3.2Monitors

Criticalsectionsofcode(sectionsofcodethatshouldbeexecutedbyonlyonethreadatatime)canbeprotectedwithJava’smonitors.HyperionfullysupportsJava’smonitors.

1

Anexampleofthiswillbepresentedinsection3.6.3.

6

MonitorsareusedinJavavia:•thesynchronizedkeyword

•thejava.lang.Object.wait()method•thejava.lang.Object.notify()method•thejava.lang.Object.notifyAll()method

AnexampleoftheuseofJava’smonitorscanbefoundinFigure2-2.ThisexamplepresentsaclasscalledIntQueue,whichcanbeusedasaqueuebetweenreadingandwritingthreads.

TheJVMimplementsmonitorsintermsoflocks.Theprocessofacquiringandreleasinglockshascertainside-effectsonthememorysubsystem.Thesewillbediscussedinthenextsection.

2.2.3.3OtherSynchronization/CommunicationMechanisms

Javawasdesignedtofacilitategeneralconcurrentprogramming—notparallelcomputation.Untilrecently([13]),therehasbeennodirectsupportforparallelcomputationinJava.Assuch,withtheversionofJavausedwithHyperion(Java1.1),therearenonativeprimitivesforbarriersynchronizationorcollectivecommunications.ThefactthatthisversionofJavadoesnotspecifyanyofthesecommonlyusedfeaturesisnotserious.Hyperion’sruntimedoesprovidesomesupportfortheseoperations.Currently,Hyperion’sruntimeprovidessupportforthefollowingoperations:

•Arraybroadcastofsimplescalars(floats,doubles,etc.).

•Simplereduction(allthreadscanfindglobalmaximum,minimum,andsumofsimplescalars).

7

publicclassFixedIntArrayQueue{

intm_queueSize,m_numElements=0,

m_head,m_tail,m_intArr[];

publicFixedIntArrayQueue(intqueueSize){m_queueSize=queueSize;m_intArr=newint[m_queueSize];m_numElements=m_head=m_tail=0;}

publicsynchronizedvoidinsertAtBack(intn){while(numElements()==m_queueSize){

try{wait();}catch(InterruptedExceptione){}}m_intArr[m_tail]=n;

m_tail=(m_tail+1)%m_queueSize;if(m_numElements++==0)notifyAll();}

publicsynchronizedintremoveFromFront(){intresult;

while(numElements()==0){

try{wait();}catch(InterruptedExceptione){}}

result=m_intArr[m_head];

m_head=(m_head+1)%m_queueSize;

if(m_numElements--==m_queueSize)notifyAll();returnresult;}};

Figure2-2:AnExampleofJava’sMonitors.

8

•Customreduction—aspartofthisthesisHyperion’sReductionclasswasextendedtoallowforallprocessorstocollectivelydetermineaglobalmaximumandarrayindexofthatmaximum.MoredetailsregardingthisreductionarepresentedinthediscussionoftheGaussianeliminationbenchmarkinsection3.2.1.

TheseoperationshavebeenoptimizedtobeefficientintheclusterenvironmentthatHyperionruns.Forexample,thebarriersynchronizationoperationusesaminimumoflow-levelmessageexchanges(itusesalogPalgorithm).2

2.2.3.4DistributedSharedMemorySubsystem

Hyperion’sDistributedSharedMemory(DSM)subsystemcomplieswiththerulessetforthintheJavaLanguageSpecification(JLS).AJavaprogrammerwhodealswiththreadsmustcompletelyunderstandthememoryconsistencyrulescodifiedintheJLS.Inadditiontounderstandingtheserules,aJavaprogrammerwhowishestousetheHyperionsystemefficientlymustalsounderstandcertainstraightforward(butfar-reaching)propertiesofHyperion’sDSMsubsystem.Alloftheseissuesaredetailedinthissubsection.

EachthreadthatHyperioncreatesonbehalfoftheuser’sprogramisrunninginthesameJavaVirtualMachine(JVM).Inturn,thisJVMisrunningonaclusterofworkstations(theStarCluster).TheJLSspecifiesthatallofthesethreadsruninasharedmemoryarea(calledmainmemory.Hyperion’sDistributedSharedMemory(DSM)subsystemprovidesmainmemorythatallofthethreadsrunninginthedistributedJVMcanaccess.MainmemoryiswheretheJVM’sheapresides(objectsthatareallocateddynamicallyuseheapmemory).3

Hyperion’sruntimedoesn’tofferastandardbarriersynchronizationmechanism—therealityisthatHyperion’sreductionoperationsarethemostefficientcommunicationsoperationsavailableinthesystem.Developerswhoneedastandardbarriersynchronizationoperationonlyhavetocallhyperion.Reduction.sum(0)andignorethesubsequentreturnvalue.

3

2

AthreadrunninginaJVMcannotaccessanotherthread’sstack.Thisisnotablydifferentthantypical

9

AsthreadsintheJVMuseregionsofmainmemory,theycacheregionsofmainmemoryintowhatiscalledworkingmemory.Eachthreadmaintainsitsownregionofworkingmemory.TheJLSrequiresthateachthreadmaintainacoherentcopyofitsownworkingmemorybutdoesnotdictatethatathread’sworkingmemorybeidenticaltomainmemoryorindeedtosomeotherthread’sworkingmemoryeither.

TheJLSrequiresthatthreadsrunninginaJVMresolvetheirdifferencesbetweentheirownworkingmemoryandmainmemorywhenlocksareacquiredandwhenlocksarere-leased.LocksareoneofthecomponentsofJava’smonitors(describedabove).ThisdesignimpliesthatatagiveninstantthethreadsrunninginaJVMcanhaveagreatlydifferentviewofwhatmainmemorytrulylookslike.Whilethismightseemalarming,thisde-signisflexibleandstillallowstheprogrammertowriteprogramsthatarecorrect.ThisdesignisflexiblebecauseJVMimplementorsarenotrequiredtoensurethatathread’sworkingmemoryisalwaysconsistentwithmainmemory(thiswouldgetmoreandmoreinefficientasthenumberofthreadsincreased).Thisrelaxeddesignallowstheprogrammertowritecorrectprogramsbecausetheprogrammercanusesynchronizationmechanismsatstrategicpointsintheprogramtoensurethateverythreadhasaconsistentviewofmem-orywhenthisconsistencymatters.Withoutusingsynchronizationatstrategicpointsinamultithreaded/parallelprogram,theprogramwilllikelysufferfromraceconditions.

Hyperion’sDSMsubsystemadherestoallofthememoryconsistencyrulesspecifiedbytheJLS.However,italsohasthefollowingadditionalproperty:memoryaccessisnon-uniform.IntheHyperionsystem,memorythatisallocatedbyathreadrunningonagivenprocessorislocaltothatprocessor.Accessinglocalmemoryisveryefficient.Accessingnon-

threadspackagesthatareavailablefortheCprogramminglanguage.Suchaccessisnearlyalwaysanerror.ThefactthatJavadisallowssuchaccessismuchmoreusefulthannot.

10

localmemoryissupportedbyHyperionbutincursaperformancepenalty.ThelogisticsofhowHyperion’sDSMsubsystemworksareexplainedindetailin[1].Generallyspeaking,mainmemoryinHyperion’sJVMconsistsofadistributedcollectionofmemorythatisallocatedlocallyoneachprocessorparticipatinginthecluster.ProgrammerswhowishtouseHyperion’sDSMsubsystemefficientlymuststrivetominimizetheamountofnon-localmemoryaccessthateachthreadperforms.

2.2.3.5Low-levelCommunications

MuchofHyperion’sruntimeisimplementedintermsoftheDistributedSharedMemoryonaParallelMultithreaded(DSM-PM2)library[12].Inparticular,thislibraryisusedforitsefficientthreadanddistributedshared-memorymechanisms.ThethreadfunctionalitythatDSM-PM2providesismoreefficientthanthenativethreadlibraryinthiscluster’senvironment(POSIXthreadsrunningonLinux).Thisisespeciallytrueintheareaofsignal-handling.DSM-PM2usessignal-handlingtoimplementpage-faults—efficiencyinthisareaiscrucialtotheoverallperformanceofHyperion’sDSMsubsystem.

OntheStarCluster,DSM-PM2hasbeenconfiguredtouseTransmissionControlProto-col(TCP)asitslow-leveltransportprotocol.TCPisareliable,stream-basedprotocolthatimplementsvariouscongestioncontrolmechanisms(slow-start,detectionoflost/duplicatedsegments,etc.).ItcouldbearguedthatTCPisnottheoptimalchoicefortheStarClusterbecauseinthisclusteraprivateswitchednetworkisbeingused.ThelatencythatTCPisgenerallyunderstoodtoincurmightnotbeappropriateforhigh-performanceparallelcomputation,wherelowmessagedeliverylatencyisusuallyprized.Nevertheless,TCP’sperformanceiscertainlyadequate,andDSM-PM2usesitforthisthesiswork.

11

2.2.3.6FeaturesofJavaNotPresentinHyperion

Javaspecifiesseveralfeaturesthattendtomakeprogrammingmoresafeandconvenient.Twoofthesefeaturesaregarbage-collectionandruntimecheckingofarraybounds.

Havingarobustgarbage-collectionsystemcanmakedealingwithdynamicmemorymoreconvenientandhelpeliminatememoryleaks.Hyperiondoesnotcurrentlyimplementagarbage-collector.Hyperion’sDSMarchitecturecomplicatesthedesignofanypotentialgarbage-collectionsystem,buttheimplementationofsuchasystemisn’timpossibleeither.Withtheabsenceofagarbage-collector,memorydynamicallyallocatedinaprocessrunningviaHyperionisneverreturnedfromthefreememorypool.Obviously,thismeansthatsomeJavaprogramsmighthavetobeadjustedtobemorefrugalwiththeirmemoryusagewhenrunviaHyperion.

Performingruntimechecksonarrayboundarieshelpsflaganextremelycommonpro-grammingerror.InJava,whenaprogramgoesbeyondanarray’sbounds,anexceptionisthrown.Technicallyspeaking,Hyperiondoesimplementarrayboundschecking,asdic-tatedbytheJLS.However,thesechecksintroduceasignificantperformancepenalty.Arrayboundsshouldn’tbeexceededinproperlyrunningprograms,so,forthepurposesofthisthesisHyperionhasbeenconfiguredtonotperformanyarrayboundschecking.Allofthebenchmarksthatthisthesisdiscussesarescientificprogramswithfairlyregulararraybounds.Moderncompilerscommonlyoptimizearrayboundscheckingoutofsuchprograms.Hyperion’scompilerdoesnothappentoimplementthisoptimization,butitnodoubtwillsomeday.So,again,turningoffarrayboundscheckingseemsveryreasonable.

12

2.3MPI

“MessagePassing”referstoanapproachtoparallel-programmingwherebyprocessorscom-prisingaparallelcomputersendeachothermessagestofacilitatetheoverallparallelcom-putation.MessagePassingInterface(MPI)isanindustry-standardmessagepassinglibrary[11].ThislibraryhasbindingstoC,Fortran,andJava.ThedesignofMPIemphasizescor-rectness,portability,andconvenience.Inaddition,MPIwasdesignedtoallowforefficientandscalableimplementations(forexample,itallowsforzero-copycommunicationsopera-tions).MPIitselfisanopenstandard,andquiteafewimplementationsofthisstandardexist.TheimplementationthatisbeingusedforthisthesisiscalledMPICH[6].PleaserefertoFigure2-3foranexampleMPIprogram.

2.3.1CommunicationsOperations

MPIprovidesaconvenientAPIfordeveloperstoprogramwith.WiththisAPI,adevelopernolongerhastocodetowhateverAPIthevendoroftheirtargetparallelmachinehaspro-vided.Thisfacilitatesportability.ThesetofcommunicationsoperationsofferedbyMPIisfeature-rich—sometimesmorefeature-richthanthesetofoperationsofferedbytheven-dor’sAPI.Sometimesthiscanbeveryuseful.AsimpleexampleofausefulcommunicationsoperationisshowninFigure2-4.

MPIoffersarichvarietyofcommunicationsoperations:•Barriersynchronization.

•Blockingandnon-blockingoperationssend/receiveoperations.•Operationsthatcombinesendingandreceiving.•Operationsthatworkonscalarsandarrays.

13

#include\"mpi.h\"

intwhoami;/*whichnodenumberami?*/intnumProcessors;/*totalnumberofprocessors

runningthisprogram*/enum{SAMPLE_DATA1,SAMPLE_DATA2,ETC};intmain(intargc,char*argv[]){

enum{DATA_SIZE=200};

doubledata[DATA_SIZE],data2[DATA_SIZE];MPI_Statusstatus;MPI_Init(&argc,&argv);

/*getnumberofprocessesintheprocessgroup*/MPI_Comm_size(MPI_COMM_WORLD,&numProcessors);

/*getthisprocess’srank(ID)withintheprocessgroup*/MPI_Comm_rank(MPI_COMM_WORLD,&whoami);MPI_Barrier(MPI_COMM_WORLD);

/*barriersynchronization*/

/*messagetypes*/

do_work(data,DATA_SIZE);/*performarbitrarycalculation*/if(whoami==0){

/*senddata[]toprocessor1*/

MPI_Send(data,DATA_SIZE,MPI_DOUBLE,1,

SAMPLE_DATA1,MPI_COMM_WORLD);

}

else{

/*blockingreceive;waitforSAMPLE_DATA1

datafromprocessor0*/

MPI_Recv(data2,DATA_SIZE,MPI_DOUBLE,MPI_ANY_SOURCE,

SAMPLE_DATA1,MPI_COMM_WORLD,&status);

}

MPI_Barrier(MPI_COMM_WORLD);returnEXIT_SUCCESS;}

Figure2-3:AnExampleMPIProgram.

/*barriersynchronization*/

14

/*everyprocessorexecutesthiscode...*/intlocal_sum,global_sum;

local_sum=calculate_local_sum();

MPI_Allreduce(&local_sum,&global_sum,1,

MPI_INT,MPI_SUM,MPI_COMM_WORLD);/*atthispointglobal_sumcontainsthesumofall

ofthelocal_sumvaluesintheprocessgroup*/

Figure2-4:AnExampleofMPI’sCollectiveCommunicationsOperations.

•OperationsthatallowthedevelopertospecifyacustomdatatypeandthenusethisdatatypejustlikeanyotherMPIdatatype.

•Arichvarietyof“collectivecommunicationsoperations,”including:

–Reductiontoasingleprocessor.–Reductiontoallprocessors.

–Broadcasttoallprocessors.ThisoperationisillustratedinFigure2-5.–Scatterdatafromoneprocessortoallotherprocessors.Thisoperationisillus-tratedinFigure2-5.

–Gatherdatafromallprocessorstoasingleprocessor.ThisoperationisillustratedinFigure2-5.

–Gatherdatafromallprocessorsandcollecttheresultonallprocessors(thisiscalled“allgather).ThisoperationisillustratedinFigure2-5.

–Acombinationscatter/gatheroperation(sometimescalleda“full-exchange”).InMPIterminologythisiscalled“alltoall.”ThisoperationisillustratedinFigure2-5.

15

dataprocessesA0A0A0broadcastA0A0A0A0A0A1A2A3A4A5scatterA0A1A2gatherA3A4A5A0B0C0D0E0F0A0A0B0B0B0B0B0B0C0C0C0C0C0C0D0D0D0D0D0D0E0E0E0E0E0E0F0F0F0F0F0F0allgatherA0A0A0A0A0B0C0D0E0F0A1B1C1D1E1F1A2B2C2D2E2F2A3B3C3D3E3F3A4B4C4D4E4F4A5B5C5D5E5F5A0A1B0B1B2B3B4B5C0C1C2C3C4C5D0D1D2D3D4D5E0E1E2E3E4E5F0F1F2F3F4F5alltoallA2A3A4A5Figure2-5:CollectiveCommunicationOperationsinMPI.

–Prefixscanoperations.

TheMPIspecificationdoesnotdictatethatimplementationsofthesecommunicationsoperationsareefficient.Ofcourse,implementorsoftheMPIspecificationareencouragedtomaketheirMPIimplementations(suchasMPICH)asefficientaspossible.

TheMPIspecificationdoesnotdictatewhichlow-leveltransportprotocolsshouldbeusedinimplementationsofMPI.MPICHisflexibleinthisregard,itcanbeconfiguredtousevariousotherparallelcommunicationssystemsasitsbackend(likeIntelNX,MPL,andNcube’scommunicationlibraryetc.).Forthisparticularthesis,MPICH’s“chp4”backendhasbeenemployed.ThisbackendusesTCPasitstransportprotocol.EverythingthatwassaidregardingHyperion/DSM-PM2’suseofTCP(above)alsoappliestoMPICH’suseofTCP.

16

2.4

2.4.1

TheBenchmarks

Overview

Thebenchmarksthatthisthesisdiscusseshavebeenchosenbecauseitwasthoughtthatthesewereafairlyrepresentativesetofparallelapplications.Itisthoughtthattheper-formanceofthesebenchmarkswillprovideagoodinsightintothetrueperformanceofHyperion.Thesearethebenchmarks:

•Gaussianeliminationtosolveasystemoflinearequations.•Ashallowwatersimulation.•Matrixmultiplication.

•Aradixsort,asdescribedin[3].

•Aminimalcostcoloringgraph-coloringprogram.

Eachbenchmarkhasbeenimplementedtwice:onceinJavaandonceinC/MPI.IntheinitialdevelopmentofmostoftheJavaversionsofthesebenchmarksnoattentionwaspaidtoHyperion’sruntimecharacteristics(especiallythepropertiesofitsDSMsubsystem).TheperformanceoftheinitialversionsoftheseJavabenchmarksinHyperionwasgenerallyabysmalandtheseresultsarenotworthdiscussinghere.However,whatthisthesisdoespresentarethestepsthathadtobeperformedinorderfortheJavaversionsofthebench-markstoperformadequatelywithHyperion.Ineachbenchmarksection,theperformanceoftheC/MPIandHyperionversionsarecompared.

17

2.4.2NotesRegardingSystemConfigurationandDataCollection

HyperionisanimplementationofJava1.1.ItusestheSunJava1.1JavaDevelopmentKitcompilertogeneratebytecode.Next,theHyperioncompilergeneratesCcodefromthisJavabytecode.Afterthis,theGCC3.0compilerisusedtocompilethisCcodeintoexecutablecode.Hyperioninvokesgccwiththe“-O6”flagforoptimization.Atthetimeofexecution,Hyperion’sruntimeinvokesmanyservicesfromtheDSM-PM2package.Asconfigured,Hyperiondoesn’thaveagarbage-collectornordoesitperformanyarrayboundschecks(bothofthesearediscussedabove).

TheC/MPIenvironmentusesGCC3.0andMPICH[6]version1.2.2.AlloftheC/MPIcodeinthesebenchmarkswascompiledwithgcc’s“-O6”optimizationflag.

Allofthedatapresentedinthebenchmarksectionsinthenextchapterweregeneratedastheresultofrunningeachbenchmarkthreetimesandthengeneratinganaverageoftheseresults.Usingaveragesinthiscaseisjustifiedbecause,forthemostpart,alloftheruntimedatacollectedintheseexperimentsisrelativelystable.Mostofthecalculatedstandarddeviationvaluesforthecollecteddataarelessthan1.2.Thelargeststandarddeviationvalueforanyofthecollecteddatais32.016(fortheminimalgraph-coloringbenchmark).Thislargecalculatedstandarddeviationvaluedoesnotdetractfromtheoverallanalysisoftheminimalgraph-coloringbenchmark,sincethecorrespondingC/MPIprogramconsistentlyrunsinhundredsofsecondslessthantheHyperionversionofthisbenchmark(overthirteentimesthevalueofthestandarddeviationinthiscase).Theuseofaveragesinthisthesisclearlyconveystheperformanceofeachversionofeachbenchmark.Interestedreadersareinvitedtolookattheappendicesofthisdocumentiftheyareinterestedintheoriginaldata.

18

2.5

2.5.1

RelatedWork

MathewReno’sMastersThesis

Reno’sthesis[16]examinestheperformanceofparallelbenchmarksobtainedfromtheJavaGrandeForum(JGF).Twoversionsofthesebenchmarksareavailable:thefirstisimple-mentedusingamessage-passingstyleandthesecondiswritteninashared-memorystyle.Themessage-passingversionsofthesebenchmarksuseavariantofMPIcalledjavaMPI.Theshared-memoryversionsofthesebenchmarksarewrittenusingstandardJava.AfterportingjavaMPItotheHyperionJVM,bothversionsofthesebenchmarkswererunwithHyperionontwodifferentclusters:theStarClusterandtheParaskiCluster.TheStarClusteristhesameclusterasusedinthisthesis,whiletheParaskiClusterisanotherclus-terofworkstationslocatedinRennes,France.TheParaskiclusterconsistsof16PentiumIII1GHzPCsinterconnectedbya2Gb/sMyrinetnetwork.TheParaskiCluster’snet-workhasmorebandwidthandhasalowerlatencythantheStarCluster’snetwork.Reno’sthesisshowedthatitwaspossibletogettheshared-memoryversionsofthesebenchmarkstorunincomparabletimesasthoseofthemessage-passingversionsofthesebenchmarks.However,theoriginalmessage-passingversionsofthebenchmarksfromtheJGFassumedauniform-accessshared-memoryenvironment.Thisassumptioninitiallyledtounacceptableperformancefiguresfromtheshared-memoryversionsofthesebenchmarks.Afterthesebenchmarksweremodifiedtonolongerassumeauniform-accessshared-memoryenviron-ment,theirperformancefiguresbecamecomparablewiththemessage-passingversionsofthesebenchmarks.Inparticular,whentheshared-memorybenchmarkswererunontheParaskiCluster,withitslowerlatencynetwork,theperformanceofthesebenchmarksbe-camemuchmorecomparable.Themodificationsthatwererequiredforthisportgenerally

19

entailedtransformingtheseshared-memorybenchmarksintobenchmarksthatresembledtheirmessage-passingcounterparts.

2.5.2TreadMarks

TreadMarks[8]isaparallelcomputingenvironmentwithadistributedshared-memorysub-system(implementedinsoftware).TreadMarks’screatorshavebenchmarkedthissystemagainstPVM(anothermessage-passingsystem,somewhatsimilartoMPI)[10].Forthiscomparison,TreadMarks’screatorsusedeightbenchmarksfromtheSPLASHbenchmarksuite.TheTreadMarkssystemprovidesaCAPIandisportabletoanumberofUnixsystems.LikeHyperion,globalmemoryinTreadMarksconsistsofthesumtotalofeachparticipatingprocessor’slocalmemory.

TreadMarksmemoryconsistencyschemeissomewhatsimilartotheschemethattheJLSdictates(threadsneedtouselocksinorderforremotememorychangestobeseenbyotherprocessors).However,onedifferencebetweenthesesystemsisthatTreadmarksimplementsa‘lazy”releaseconsistencyschemewhereasJavaimplementsan“eager”releaseconsistencyscheme.TheJLSdictatesthatchangestoathread’sworkingmemoryareflushedbacktomainmemorywhenalockisrelinquished,whereasintheTreadMarkssystemchangesarepropagatedwhenlocksareacquired.IntheTreadMarksenvironment,accesstolocalmemoryisefficientandaccesstonon-localmemoryincursaperformancepenalty.

TheresearcherswhobenchmarkedTreadMarksagainstPVMusedtwodifferentplat-formstogatherdatafrom;thefirstplatformconsistedofeightSparc-20model61worksta-tionsconnectedbya155MbsATMnetworkandthesecondconsistedofaneightprocessorIBMSP/2.

TheresultsofthecomparisonbetweenTreadMarksandPVMwereasfollows:

20

•EachTreadMarksversionofeachbenchmarkissuedmorelow-levelcommunicationsmessagesthanthecorrespondingPVMversionsofeachbenchmark.

•IncomparisontotheCPUspeedsoftheprocessorsineachoftheseparallelcomputers,thespeedofnetworkcommunicationisveryslow.

•TheobservedTreadMarksspeedupforsixoutoftheeightbenchmarkswaswithin15%oftheobservedPVMspeedup,fortheseventhapplicationthedifferencewasbetween15-30%,andforthelastthedifferencewasalmost50%.

Manyprogrammersprefercodingparallelprogramsusingashared-memorystyle,anda15%discrepancybetweenashared-memoryimplementationandamessage-passingimple-mentationmightbesmallenoughforthemtoaccept.

2.5.3Split-C

TheradixsortbenchmarkthatthisthesisdiscusseswasderivedfromoneofthebenchmarksforSplit-C[2].Split-CisavariantoftheCprogramminglanguagewithextensionsforparallelprocessing.Theseextensionsincludeamixoffunctionalitythatallowaprogrammertouseshared-memory,message-passing,andevendata-paralleltechniques.Split-Coffersthefollowingfeatures:

•ASingleProgram,MultipleData(SPMD)environment.

•Adistributed,sharedmemorysubsystem.Globalmemoryconsistsofthesumtotalofallofthelocalmemorysegmentsoftheprocessorsinvolvedintheparallelcompu-tation(justlikeHyperion).Anyprocessormayaccessanyoftheglobalmemory,butaccessinglocalmemoryisefficientandaccessingnon-localmemoryismoreexpensive(justlikeHyperion).UnlikeHyperion,inSplit-Caccesstonon-localmemorymustbe

21

codedexplicitly,typicallybyusingglobalpointersandarrays.InHyperion,accesstonon-localmemoryhappenswithoutanyspecialcodeandtheruntimesystemperformsthisprocessautomatically.4

•“Spreadarrays”—Split-Callowstheprogrammertodefineglobalarrayswhoseindi-ciesincludetheprocessornumber.Sucharraysarespreadacrossalloftheprocessorsparticipatingintheparallelcomputation.Theprogrammercanoperateonsucharraysinadata-parallelmanneriftheysowish.

•“Split-phase”memoryusageoperators—likeHyperion,Split-C’sDSMsubsystemen-forcesarelaxedmemoryconsistencyscheme.Split-C’ssplit-phaseoperatorsallowaprogrammertopacktogethermemory-modificationssothattheycanbeperformedinbulkoroverlappedwithcomputation.

•Split-Cprovidestheprogrammerwithefficientbulkdatatransferoperations,whichallowsaprogrammerusingSplit-Ctogainsomeoftheadvantagesofmessage-passing.Hyperionoffersasimilarfeaturethroughitsjava.lang.System.arraycopy()imple-mentation.

ThedesignersofSplit-CdevelopedlanguageextensionstoCinordertofacilitateanenvironmentfordevelopingparallelprograms.ThisdiffersfromtheapproachtakenbythedesignersofHyperioninoneveryimportantmanner:ThedesignersofHyperionpurposelydecidedthatdevelopingextensionstoawell-knownlanguagewasnotdesirable.Thedesign-ersofHyperionfeltthatlanguageextensionsmadepotentialusersofasystemapprehensive,notwantingtogettrappedintousinganon-standardlanguagethatmightnotbearoundforthelifeoftheirparallelapplication.

4

Itwillbeshowninsection4.1.1.3thatfeatureisn’talwaysdesirable.

22

Chapter3

BenchmarkComparison

3.1

Introduction

Thischapterdescribesthebenchmarksusedinthisthesis.Ineachsection,acomparisonismadebetweentheobservedperformanceoftheJava/HyperionandtheC/MPIversionsofeachbenchmark.TheJava/HyperionversionsofthesebenchmarkshadtobemodifiedfromtheirinitialimplementationsinorderfortheirperformancetobecomparabletotheC/MPIversionsofthesebenchmarks.Ineachsection,thesemodificationsarediscussed.

Asareminder,allofthedatainthischapterwascollectedinthefollowingmanner:eachbenchmarkprogramwasrunthreetimes.Anaveragewascalculatedfromthesethreesamples.Inthegraphspresentedinthischapter,thevaluesthatarepresentedarebasedonthesecalculatedaverages.Alloftherawbenchmarkdataappearsintheappendicesofthisdocument.

3.2GaussianElimination

ThisbenchmarkinvolvesperformingGaussianeliminationonasystemoflinearequations,representedbya2048x2048matrixoffloats.Twoprogramsweredevelopedforthiscomparison,thefirstinJavaandthesecondinC/MPI.Theinitialimplementationofthese

23

ABProc 00123Proc 14567Proc 2891011Proc 312131415Figure3-1:GlobalDataLayoutfortheGaussianEliminationBenchmark.

programswasbasedoncodewrittenbyProfessorPhilipHatcher.

3.2.1JavaImplementation

TheJavaversionofthisbenchmarkisfairlystraightforward,andyethasbeenheavilymodifiedfromtheoriginalshared-memoryversion.Theoriginalversionofthisbenchmarkusedastaticarrayinwhichalloftheprocessorsworked;intheupdatedbenchmarkeachprocessormaintainsitsownportionoftheglobalarray.Figure3-1illustratesthismemoryarrangement.Bymakingthischange,memorycontentionandnetworkcommunicationaregreatlyreduced.ThissinglechangetransformedthisbenchmarkfrombeingunusableinaHyperionenvironmenttobeingabenchmarkthatperformsreasonably.

AteachstepduringtheprocessofGaussianeliminationaprocessoffindingthebestpivotrowintheglobalmatrixneedstobeperformed(thisisknownas“partialpivoting”[14]).Thisstepinvolvesalloftheprocessorscollectivelyagreeingonwhichrowinthelinearsystemcontainsthevariablewiththelargestabsolutevalue.Itisimportanttonotethatthisprocessinvolvestwopiecesofdata:•thevariable’sabsolutevalue

24

•therowinwhichthevariableoccurs

Intheoriginalversionofthisbenchmark,partialpivotingwaseasytoimplementsinceallofthedatawasassumedtobeeasilyathand.Withthechangesdescribedabove,thisprocesswasnolongerstraightforward.HyperionprovidesanefficientReductionclass,butthisclassdoesnotprovidetheabilitytoreducethesetwovariablesatomically.Tocomplicatematters,theback-endofHyperion’sefficientReductionclassisn’twritteninJava—itisimplementedinC.ThetaskofmodifyingHyperion’sReductionclassforthisbenchmark’spurposeswasaccomplishedwithsomeassistancebyextendingsomeofthethelow-levelReductioncodethatispartoftheHyperionruntimesystem.AtthistimeifaparallelprogrammerneedstouseanysortofcustomizedReductionoperationwithHyperion,theyneedtoresorttoimplementingthisoperationinC,muchinthesamewaythatwasdoneforthisbenchmark.AfterthelocationofthepivotrowiscollectivelydeterminedbythethreadsperformingGaussianelimination,thecontentsofthepivotrowaredistributedfromtherow’sownertoalloftheotherthreadsserially.

3.2.2C/MPIImplementation

TheupdatedJavaimplementationofthisbenchmarkisverysimilartotheC/MPIver-sionofthisbenchmark.ThepartialpivotingcodeintheMPIbenchmarkusesMPI’sMPI_Allreduce()function.Thisfunctionprovidesanefficientandeasy-to-extendreduc-tionoperation.Afterthepivotrowisdetermined,thisrowisdistributedfromtheprocessorthatownstherowtoalloftheotherprocessorsserially.MPIoffersfacilitiesto“broadcast”datalikethisfromoneprocessortomanyothers(seeFigure2-5),butthisfeaturewasn’tusedinthisparticularbenchmark.

25

Figure3-2:AverageRuntimesforGaussianEliminationBenchmark.

3.2.3ResultsandPerformance

Figure3-2presentstherawruntimesobtainedinthisbenchmark.TheC/MPIimplemen-tationofthisbenchmarkisquiteabitfasterthantheJava/Hyperionimplementationofthisbenchmark.Thisisageneralissuethatwedonotaddressinthisthesis.TherationaleforthisdecisionisthatthisreportconcernsthesubjectofparallelcomputationandnotcompilationandJavaVirtualMachine(JVM)performance.Therestofthisreportfocusesonspeedupcalculationsandnotonrawrunningtimes,whichwillinsulatethisdiscussionfromthesedetails.1

Figure3-3presentsthecalculatedspeedupforthesebenchmarks.ThisfigureshowsthatthespeedupobtainedfromrunningthesebenchmarksinparalleliscomparablebetweentheHyperionandC/MPIversionsofthesebenchmarks.Gaussianeliminationisaproblemthat

ItshouldbefurthernotedthattheseresultsinnowaycontradictReno’sThesis[16].ThisthesiscomparesbenchmarkswritteninC/MPItobenchmarkswritteninJava/Hyperion.Reno’sthesiscomparesbenchmarkswritteninjavaMPI(portedtoHyperion)toshared-memorybenchmarkswritteninJava(alsoportedtoHyperion).

1

26

Figure3-3:MeasuredSpeedupforGaussianEliminationBenchmark.

doesnotparallelizelinearlyasthenumberofprocessorsincreases–atsomepointthereisn’tenoughworkforeachprocessortodo.Forthisparticularbenchmark,thisfactbecomesespeciallyapparentwhenthenumberofprocessorsincreasesfrom8to16processors.

3.3ShallowWater

Theshallowwaterbenchmarkisatwo-dimensionalsimulationofanon-compressiblemassofparticlesastheyrotateonahypotheticalflatearth(representedbya256x256arrayoffloats).Thissimulationusesthephysicalpropertiesofmass,inertia,gravity,etc.ascomponentsofthissimulation.Althoughthissimulationisonlyperformedintwodimen-sions,inmanyreal-worldsystemsoffluids(water,air,etc.)thehorizontalcomponentofeachparticle’smotiongreatlyexceedstheverticalcomponentoftheparticle’smotion.So,whilethissimulationonlyconsidersmotionintwodimensions,itisstillausefulmodelforapproximatingthemotionofparticlesinsuchafluidsystem.

27

Theparticularalgorithmusedinthisbenchmarkisdescribedin[7].ThecodethatthisbenchmarkisbaseduponcomesfromimplementationswrittenbyProfessorPhilipHatcherandKennethCain.Acommentintheinheritedcodereads:

Thisprogramsolvesthesystemofshallowwaterequationsonarectangulargridusingafinitedifferencesmethod.Itassumesperiodicboundaryconditionsandusestheleapfrogtimedifferencingschemewithatimefilter.

Thisbenchmarkinvolvesafairamountofcomputationateachstep,inadditiontotheusualcooperationbetweenprocessors.Thisbenchmarkruns1200timesteps,andateverytimestepeachprocessorcommunicateswithitsneighborsateachcompasspointonalogicalgrid.

3.3.1JavaImplementation

TheoriginalJavaimplementationofthisbenchmarkdevelopedforthisthesiswasbasedonashared-memoryprogramoriginallywrittenbyProfessorPhilipHatcher(writtenforC/-POSIXthreads).ThiscodewasnotorientedtowardHyperion’sdistributedshared-memoryarchitecture—alloftheprocessorsdidtheirworkinastaticarrayallocatedbyProcessor0.Theinitialimplementationofthisbenchmarkperformedextremelypoorlyasaresultoftheensuingmemorybottleneck.Anattempttoremedythisdeficiencywasundertaken.How-ever,thedata-dependencieswithinthisprogrammadethistaskverydifficult.Thetaskofsplittingupallofthearraysthatthisbenchmarkusesintheshared-memoryversionofthisbenchmarkintoadistributedsetofarrays(alongwithalloftheassociatedcodechanges)provedtobetime-consumingandcomplicated.TheC/MPIversionofthisbenchmarkhadbeenrecentlycompleted(seebelow),therefore,itwasdecidedthatthemostprudentthingtodowouldbetoconverttheC/MPIversionovertoJava/Hyperion.UsingtheC/MPIversion

28

ofthisbenchmarkasastartingpointmadethedevelopmentoftheHyperionversionmucheasier,becausetheC/MPIversionofthecodewasalreadywrittenwiththeassumptionthateachprocessorwouldallocateanduseitsownmemory.ThecommunicationpatternusedbytheC/MPIversionofthisbenchmarkisveryregularanditwasstraightforwardtocomeupwithaHyperionimplementationoftheonecommunicationoperatorthatthisprogramusesoverandover—sendrecv().Generallyspeaking,thisoperationisusedtotransportprocessorgridedgestoneighboringprocessors.Virtuallyallofthecommunicationperformedbythisbenchmarkisimplementedintermsofthisoperation.ExceptforhavingitsownoptimizedJavaimplementationofsendrecv(),theJava/HyperionimplementationofthisbenchmarkisnearlyidenticaltotheC/MPIimplementation.

3.3.2C/MPIImplementation

TheC/MPIversionofthisbenchmarkisbasedoncodeoriginallywrittenbyKennethCain.Cain’sversionappearstohavebeenderivedfromaC∗versionofthiscode.Infact,Cain’sversionappearslikeitwasderivedfromtheoutputofthebackendofaC∗compiler.ItwasfortuitousthattheauthorofthisthesiscameacrossCain’scode,becauseconvertingtheoriginalC∗codetoC/MPIwastime-consumingandcomplicated.

3.3.3ResultsandPerformance

Currently,bothparallelimplementationsofthisbenchmarkarelimitedtorunningoneither4,8,or16nodes.Otherprocessorarrayscouldbeusedinthefuturewithasmallamountofwork.

TheperformanceofthesebenchmarksissummarizedinFigure3-4andFigure3-5.Thesetwobenchmarkimplementationsexhibitvirtuallyidenticalspeedupfor4and8processor

29

Figure3-4:AverageRuntimesfortheShallowWaterBenchmark.

clusters,butat16processorstheJavaversionofthisbenchmarkexhibitsnotablypoorspeedup.Thereasonsforthisarediscussedhere.

Theshallow-waterbenchmarkhasalargenumberofcomputationandcommunicationssteps.Thegeneralcomputation-communicationpatternfortheshallow-waterbenchmarkispresentedinFigure3-6.

Theshallow-waterbenchmarkmakesextensiveuseofasendrecv()operation.Thisisacommonoperationinparallelcomputation.Inessence,whenProcessorxperformsasendrecv()operation,itsendsamessagetosomeProcessoryandreceivesamessagefromsomeProcessorz.Thesendrecv()doesnotreturntothecalleruntilboththesendandthereceiveoperationarecomplete.Inthemostgeneralcase,x,y,andzcanallbethesameordifferent.

Hyperion’sruntimedoesnotprovideasendrecv()operation,soonewasimplementedforthisthesis(otherbenchmarksinthisthesisusethisoperationtoo).Inorderforthissendrecv()implementationtobesimpleandfast,thisimplementationofsendrecv()does

30

Figure3-5:MeasuredSpeedupfortheShallowWaterBenchmark.

do1200times

computation

Five(5)callstosendrecv()computation

Six(6)callstosendrecv()computationdone

Figure3-6:GeneralComputation/CommunicationsPatternfortheShallowWaterBench-mark.

31

nothandleallofthethingsthatamoregeneralsendrecv()might.Intheshallow-waterbenchmark,whensendrecv()isused,everyprocessorisperforminganoperationthatcanbegenerallydescribedlike“sendachunkofdatatothenorthandreceiveachunkofdatafromthesouth”—everyprocessoractsinlockstepfortheseoperations.Theversionofsendrecv()thatwaswrittenforthisbenchmarkisdesignedtohandlethesecommunicationspatterns.Thisversionofsendrecv()islistedinFigure3-7.

Ideally,asimplesendrecv()operationsuchasisrequiredforthisbenchmarkwouldonlyinvolvecommunicationandsynchronizationbetweentwoprocessors.Experimentationhasshown,however,thatsendrecv()implementedinthismannerintermsofJava’snativesynchronizationmethods(synchronized,wait(),notify(),etc.)isextremelyslowwhenruninHyperion.At16processors,theruntimesobservedfortheshallow-waterbenchmarkwiththisstraightforwardsendrecv()implementationwereatleastanorderofmagnitudehigherthanwhatisreportedhereintheappendices.Inaddition,theamountofgeneratednetworktraffictransmittedduringruntimeincreasedtremendously.Theroot-causeofthisproblemremainsunknownatthistime.

Itwasnecessarytofindanotherwaytoperformthissendrecv()operationbecausetheshallow-waterbenchmarkperformedpoorlyinHyperion.Hyperion’s

System.arraycopy()wasnotsuspectedofbeingthetheculprit(thiswasalreadyhighlyoptimizedforReno’sthesis[16]),andsothisleftreconsideringwhichsynchronizationoper-ationtouse.2Inthiscase,thefastestsynchronizationprimitivethattheHyperionsystem

AlthoughitdoesneedtobepointedoutthatSystem.arraycopy()differsfromanoperationlikeMPI’sMPI_Send()inoneveryimportantrespect:Whencopyingdatatoaremoteprocessorwiththismethod,System.arraycopy()doesnotreturntothecalleruntilitisconfirmedthatthedatahasbeenproperlycopiedtothedestination.MPI_Send(),inthecaseoftheStarCluster,utilizesTCP.MPICH’sMPI_Send()dependsonTCP’spropertyofreliabletransport,andthisfunctionsimplyreturnstothecallerwhenthedatatobesenthasbeencopiedtotheoutgoingTCPbuffer.System.arraycopy()’sconfirmationthatthedatahassafelyarrivedataremoteprocessorisanextrastepthatisn’tperformedinMPICH.MPICHismoreefficienthere.Thislineofreasoningalsoappliestoimplementationsofsendrecv().

2

32

classShallowMPimplementsRunnable{

//thisclassimplementstheshallow-water//benchmark,inparallel

/*shallowArray[]isastaticarrayof\"ShallowMP\"instances.

EachShallowMPisassociatedwithasingleprocessor.*/staticShallowMPshallowArray[];

FloatQueuesendrecv_fq[];//indexedbymessagetagvalue

//....

voidsendrecv(floatsendbuf[],intsendcount,intdest,

intsendtag,floatrecvbuf[],intrecvcount,intsource,intrecvtag)

{

intlower=(whoamiFloatQueuefq=shallowArray[dest].sendrecv_fq[sendtag];fq.insertAtBack(sendbuf,sendcount);

//blockwhilewaitingforanincomingmessage//oftype‘‘recvtag’’

fqarr=sendrecv_fq[recvtag].removeFromFront();intsmaller=(recvcount?recvcount:fqarr.length;System.arraycopy(fqarr,0,recvbuf,0,smaller);}};

Figure3-7:Simplifieddescriptionofasimplesendrecv()implementationforHyperion.TheFloatQueueclassisextremelysimilartothecodepresentedinFigure2-2,exceptthatthisclasshandlesarraysoffloatsinsteadofsingleintegers.

33

offersisareductionoperation(hyperion.Reduction.sum(0)—thisactsasabarriersyn-chronization).Thisoperationisaglobaloperation,sowhensendrecv()isimplementedintermsofthisoperationitcannolongerbesaidthatsendrecv()merelyinvolvessynchro-nizationbetweentwoprocessors—nowitinvolveseveryprocessor.

Finally,wearriveatatheoryastowhytheHyperionversionofthisbenchmarkperformspoorlywith16processors:

1.Experimentationshowsthatusinghyperion.Reduction.sum(0)isthefastestmethodofsynchronizationforuseinasendrecv()implementation.

2.Itcanbearguedthataglobalreductionisn’treallytherightoperationforthistask.Everytimesuchanoperationisused,everyprocessormustarriveatagivenpointandwaitforallotherprocessorstoreachthispointaswell.Thiscanresultinprocessorstalling.Thegreaterthenumberofprocessorsparticipatinginthecalculation,thelargerthechancebecomesthatsomeoftheprocessorswillhavetowaitaroundforaslightlyslowerprocessor.

3.Thisglobalreductioniscalledmanytimesinaloopthatexecutes1200times.Togivethereaderaclearerpicture,eachsendrecv()inFigure3-6translatesintotwocallstohyperion.Reduction.sum(0).So,thealgorithminFigure3-6callshyperion.Reduction.sum(0)1200×((5×2)+(6×2))=1200×22=26400times.

4.TheresultspresentedinFigure3-5wouldbeevenworseexceptforthefactthatfurtheroptimizationstosendrecv()arepossibleinthisbenchmark.Theseoptimizationsde-pendonthehighlyregularcommunicationpatternutilizedbythisbenchmark,i.e.ifateveryiterationeveryprocessorsenddatatotheprocessortothenorth,theneachpro-34

do1200times

computation

Five(5)callstomodifiedsendrecv()--NOSYNCHRONIZATIONhyperion.Reduction.sum(0)

datacopyingrelatedtoprevioussendrecv()callshyperion.Reduction.sum(0)computation

Six(6)callstomodifiedsendrecv()--NOSYNCHRONIZATIONhyperion.Reduction.sum(0)

datacopyingrelatedtoprevioussendrecv()callshyperion.Reduction.sum(0)computationdone

Figure3-8:OptimizedComputation/CommunicationsPatternfortheShallowWaterBenchmark.

cessorcanallocateafixedbuffertoreceiveincomingdatafromthesouth(andrepeatforeverycardinalandinter-cardinalcompassdirection).Suchoptimizationsarenotpossibleinamoregeneralsendrecv()implementation(theMPIimplementationofthisoperationcannotbeoptimizedinthismanner).ByoptimizingandrestructuringthecodetobelikewhatispresentedinFigure3-8,alargenumberofcallstohy-perion.Reduction.sum(0)wereavoided.Insteadofcallinghyperion.Reduction.sum(0)22timesperiteration(asdescribedinthepreviousparagraph),thissynchronizationoperationwasonlycalled4timesperiteration.

Whenthenumberofprocessorsinvolvedintheshallow-waterbenchmarkisincreasedto16processors,theperformanceofthisbenchmarkseemstosuffergreatlyforalloftheaforementionedreasons.Unfortunately,atthistimeitisnotpossibletorunprogramsexecutedintheHyperionenvironmentinconjunctionwithanexecutionprofilingtool,soit

35

isdifficulttodeterminetheexactcauseoftheslowdown.Thissubjectisdiscussedfurtherinsection4.1.2.

With16processorsrunningthisbenchmarkona256x256matrix,theperformanceofthisbenchmarksuffersfromanotherproblemaswell:theamountofcomputationthatneedstobeperformedperprocessorisrelativelysmall.Thisisdiscussedfurtherinsection3.4.3.

3.4MatrixMultiplication

Thisbenchmarkinvolvesarow-orientedmatrixmultiplication(AxB=C).Eachmatrixisa2Dmatrixconsistingof1024×1024floats.BoththeC/MPIandJava/HyperionversionsofthisbenchmarkwerederivedfromserialcodeobtainedfromProfessorPhilipHatcher.

3.4.1JavaImplementation

BeingcognizantofthepropertiesofHyperion’sDSMsubsystem,thisimplementationwasfromitsinceptionwrittensuchthatprocessorsmaximizeduseoftheirownlocalizedmemoryandminimizeduseofremotememory.Theresultingprogramlooksverysimilarinstructuretoamessage-passingprogram.

3.4.2C/MPIImplementation

TheC/MPIimplementationofthisbenchmarkisastraightforwardderivationoftheorig-inalserialCcode.Eachprocessorownsa“chunk”ofthematricesbeingworkedwith.ThisdatalayoutisillustratedinFigure3-9.AfterProcessorxgetsdoneworkingwiththechunkofmatrixBthatitcurrentlyhasinitspossesion,itpassesthischunkofftoProcessor(x+1)modnumprocessors.AftereachprocessorhasworkedwithallofthechunksfromthematrixB,eachprocessorownsachunkoftheend-resultmatrixC.MatrixC

36

ABCX=Figure3-9:GlobalDataLayoutfortheMatrixMultiplicationBenchmark.

ischeckedforcorrectnessattheendofthecalculation.MPI’ssendrecv()isthemaincommunicationsworkhorseinthisbenchmark.

3.4.3ResultsandPerformance

TheperformanceofthesebenchmarksissummarizedinFigure3-10andFigure3-11.Thesetwobenchmarkimplementationsexhibitvirtuallyidenticalspeedupfor2,4and8processorclusters.Whena16processorclusterisused,theHyperionversionofthisbenchmarklagsbehindtheMPIversionofthisbenchmarkintermsofspeedup.TheperformanceoftheMPIversionofthisbenchmarkissimilartotheresultsreportedby[14].

Thereasonforthisbenchmark’ssub-optimalperformanceontheJava/Hyperion16-nodeclusterispresentlyunknown.ThecodefortheC/MPIandJava/Hyperionprogramsisverysimilar.AsmalldifferencebetweenthetwoimplementationsisthattheC/MPIprogramenjoysasmalladvantageovertheHyperionimplementation:Bothoftheseprogramsuse2Darrays,andthefactthatthememoryfora2DarrayiscontiguousinCmakescopyingmatrixdatafromoneprocessortoanotherslightlymoreefficient.TheJava/Hyperionversionofthisbenchmarkattemptstominimizethisdifferencebyusinga1Dstagingarray

37

Figure3-10:AverageRuntimesfortheMatrixMultiplicationBenchmark.

Figure3-11:MeasuredSpeedupfortheMatrixMultiplicationBenchmark.

38

tocopydatatoandfromeachprocessor.Copyingdatatoandfromthisstagingareaaddsadditionalmemory-copyingtoeveryinterprocessor-communicationoperation,butoverall,thisshouldbearelativelyinexpensiveoperation(theadditionalmemorycopiesarealllocal).TheJava/HyperionversionofthisprogramisextremelysimilartotheC/MPIversionoftheprogram—infact,asendrecv()implementationisalsousedinthisprogramaswell.TheJava/HyperionversionofthisbenchmarkmayinfactbesufferingfromthesameproblemastheJava/Hyperionshallow-waterbenchmark.Inparticular,theshallow-waterbenchmarkemploysanO(n2)algorithmandoperatesonarelativelysmallproblem,wherasthematrixmultiplicationbenchmarkemploysanO(n3)algorithmandoperatesonarelativelylargeproblem.Thesetwobenchmarkscouldbesufferingfromthesameproblem,butinfacttheshallow-waterbenchmarkmightbedisplayingthisproblemmuchmoreprominentlybecause,comparedtothematrixmultiplicationbenchmark,itisdominatedbycommunicationinsteadofcomputation.

3.5

3.5.1

RadixSort

Overview

Forthisbenchmark,aparallelradixsortbenchmark(asdescribedin[3])wasportedtobothJavaandC/MPI.Thisbenchmarkisaparallel“straightradixsort”—integerkeysaresortedfromleast-significant-bittomost-significant-bit.Moreprecisely,thissortworksasfollows:Theitemstobesortedareb-bitkeys(inthiscase,b=32).Thesekeysaresortedusinganr-bitradix(inthiscase,r=8).Ther-bitradixspecifiesanr-bitdigitwithineachb-digitkey(muchlike“3”isadigitinthebase-10number“1234”).Thissortingprocessusesb/rpassestocompletethesortingprocess.Ateachpassinthesorteachkeyispartially

39

sortedaccordingtother-bitdigitassociatedwiththepass.Thisstepisaccomplishedbyformingaglobalhistogramofthekeys/digitsbeingsortedinthegivenpassandbythenredistributingthepartiallysortedkeysamongsttheparticipatingprocessors.Attheendoftheb/rpassestheintegerkeyshavebeenredistributedtotheprocessorssuchthatallofthekeysresidingonProcessorkareinsortedorderandallofthesekeysarelessthanorequaltoallofthekeysresidingonProcessork+1.

Theimplementationofthisbenchmarkplacesagreatdealofstressonthecommu-nicationsfacilitiesofaparallel-programmingenvironmentbecausethereisbothalotofbroadcastingaswellasirregularcommunications(communicationpatternsthatcan’tbeknownapriori).

3.5.2Split-C

Theradix-sortbenchmarkisinterestingbecauseitwastakenfromtheSplit-Cbenchmarks.InsomewaysSplit-Cresemblesadistributedshared-memorysystem(likeHyperion)andinotherwaysitresemblesamessage-passingsystem(likeMPI).Forexample,Split-Ccomeswithadistributedsharedmemorysubsystem(likeHyperion).However,aprogrammermustexplicitlydenoteadata-structureasbeingdistributedacrossalloftheprocessors(a“spreadarray”,inSplit-Cterminology),andmustusespecialoperatorstoaccessthistypeofmemory(somewhatlikeMPI3).See[3]foracompletedescriptionofSplit-C.

PortingabenchmarkwritteninSplit-CtoJavaandC/MPIentailedconvertingeveryuseofaSplit-CfeaturetoacorrespondingJavaandC/MPIfeature.Theseconversionsarepresentedinthissection.

Figure3-12presentsanexampleofa“spread”arrayinSplit-C.PROCSisalanguage

3

Actually,thismightalsoremindthereaderofC∗.

40

#defineNP_MAX(1024*1024)

unsignedg_keys0[PROCS]::[NP_MAX];

Figure3-12:ExampleDeclarationofa“spread”ArrayinSplit-C.

featureofSplit-C—thisisanintegerthatissettothetotalnumberofprocessorsrunningtheprogram.Thisspread-arraydeclarationdeclaresanarrayofunsignedintsthatisspreadacrossalloftheprocessors,witheachprocessorinpossessionofalocalsegmentofthisarraythatisNP_MAXinsize.

TheobviouswaytoportcodelikethisovertoJavaandC/MPIistosimplydeclarearraysofsizeNPMAXonalloftheparticipatingprocessorsandthenconvertanycodethatutilizesthesearraysovertousewhatevercommunicationconstructsareavailable.

Split-Cofferstwospecialoperators::=(split-phaseassignment)and:-(signaling-store).Thesplit-phaseassignmentoperationallowstheprogrammertosignaltotheSplit-Cruntimethatanassignmentneedstobeperformed,butthattheconfirmationthatthisoperationiscompleteisnotneeded(yet).Theprogramcancallsync()toconfirmthatallofthesplit-phaseassignmentshavebeencompleted.Byutilizingthisoperation,aprogrammercanoverlapcommunicationwithcomputation.

Thesignaling-storeoperationcausestheSplit-Cruntimetostoreavalueonaremoteprocessorandthensignaltheremoteprocessorthatthestoreoperationhasbeenperformed.Aprocessormaywaitforthesenotificationmessagesbycallingsplitc_poll().

Bothoftheseoperationsallowtheprogrammertocontrolcomputationandcommunica-tioninamoreprecise(andpossiblymoreefficient)manner.Javadoesn’tdirectlyallowthistypeofcontrol,butstrategiccallstosynchronizedblocksmightallowaprogrammertogetthistypeofcontrol.Therewassomeopportunityintheradixsortporttoexperimentwith

41

thisinHyperion,butduetootherissuesinthecodeanothertechniquewasusedtoexpressthiscomputation/communicationspattern.MPIdoesallowforthistypeoffunctionalitythroughitsextensivesetofnon-blockingoperations.However,onlyblockingoperationswereusedfortheportinginvolvedinthispartofthisthesis,

3.5.3JavaImplementation

TheJavaimplementationofthisbenchmarkwasfromthestartstructuredinawaythatwas(forthemostpart)amenabletoHyperion’sDSMcharacteristics.Thiswasafairlynaturalprocess.Forexample,data-structuresthatwerecontainedin“spread-arrays”intheoriginalSplit-CcodewerenaturalcandidatesfordistributingtoeachprocessorintheportedJavacode.AllofthespecialSplit-Coperatorsthatwereusedontheoriginalparalleldatastructureswereportedinafairlystraightforwardmanneraswell.Itwasactuallyhandytohaveclearindicationsinthecodewheretheportingworkneededtobeperformed.Hyperion(andothersystemslikeit)offeraDSMsubsystembutnon-localaccessisalwaysexpensive.Sometimesthisnon-localaccessisn’teasytospot.Split-C’smethodofhandlingmemorydoesappeartohavesomemerit—non-localaccessisallowedbutitneedstobedeclaredexplicitly.

3.5.4C/MPIImplementation

TheC/MPIimplementationofthisbenchmarkissimilartotheJavaversion—“spread”data-structureshavebeenreplacedbyadistributedsetofdata-structuresthatarelocaltoeachprocessor.ThespecializedSplit-CcommunicationsoperationsthatexistedintheoriginalSplit-CcodehavebeenreplacedwithMPIcommunicationsoperations.

42

Figure3-13:AverageRuntimesfortheRadixSortBenchmark(512×65536=33,553,920keys).

3.5.5ResultsandPerformance

Theresultspresentedinthissectionwereobtainedfrombenchmarkrunsinwhich33,553,920(512×65535)random32-bitkeysweresorted.Theradixthatwasusedfortheseresultswas8-bitsinsize.Initiallya16-bitradixwasdesired,butitwasdeterminedthattheHyperionsystemhadinsufficientmemoryforthisscenario.Configuringthisbenchmarktousean8-bitradix(asopposedtoa16-bitradix)doublesthenumberofpassesthatthisbenchmarkneedstoperforminorderforthekeystobecomesorted.Morepassestranslatestomorenetworkcommunications.

Currently,theJavaversionofthisbenchmarkdoesn’truncorrectlyontheStarCluster,duetoabuginHyperion.ThisimplementationdoesruncorrectlywhenrunwithotherJVM’s(forexampleJVMsproducedbySunMicrosystemsandIBM),eveninmulti-CPUenvironments.Areproducibletestcasethatexhibitsthisbughasbeendeveloped.TheJavaversionofthisbenchmarkremainsunusablebecausetheauthorofthisthesisdoesnot

43

Figure3-14:MeasuredSpeedupfortheRadixSortBenchmark(512×65536=33,553,920keys).

possessadetailedunderstandingoftheinternalsofHyperionandPM2.

TheC/MPIversionofthisbenchmarkrunscorrectlyanddisplayssomedegreeofspeedup.ResultsfromthisbenchmarkaredisplayedinFigure3-14.Theseresultsdoappeartobesimilartothosereportedin[3].ThedevelopersoftheoriginalSplit-Cradixsortneverpresentedtheirperformancenumbersintermsofspeedup,insteadtheypre-sentedtheirdataintermsofexecutiontimeperkey.TheC/MPIversionofthisbenchmarkgenerallymatchestheperformancereportedin[3]—forlargerproblems,asthenumberofprocessorsincreases,theexecutiontimeperkeydecreases.ThisisshowninFigure3-15.

3.6MinimalCostGraphColoring

Givenanarbitraryconnectedgraphconsistingofnodeswitharbitrarysizes,andgivenasetofcolorsthathaveassociatedwiththemasetofgivencosts,thisbenchmarkattemptstodeterminetheminimalcostcoloringofthegraph.Or,lessformally,supposewehaveamap

44

Figure3-15:SortTimeperKeyfortheRadixSortBenchmark(512×65536=33,553,920keys).

ofthecontiguous48UnitedStatesthatwewishtopaint.Supposewhitepaintcosts$10acan,greenpaintcosts$20acan,yellowpaintcosts$30acan,andpurplepaintcosts$40acan.Obviously,twoadjacentstatesshouldnotbepaintedthesamecolor.Wewishtopaintthemapascheaplyaspossible.So,wemighthopetousesomecolorotherthanpurpletopaintthestateofTexas.Theminimalgraphcoloringprobleminthiscaseistheproblemofhowtopaintthemapascheaplyaspossible.Thisbenchmarkisatypeofbranch-and-boundproblem,wheretheprocessorssearchthroughacombinatoricstate-spaceinsearchofanoptimalanswertoaproblem.

Bothimplementationsofthisbenchmarkpresentedheredrawuponseveralsources:•BothimplementationsofthisbenchmarkhaveastheirdirectancestorProfessorPhilipHatcher’sminimalcostgraphcoloringprogram.

•TheoriginalserialprogramwasparallelizedusingQuinn’spaper[15]asaguide.Inparticular,“asynchronousalgorithm4”wasusedfortheseimplementations.Inthis

45

approachtosolvingtheminimalgraphcoloringproblem,thereisnocentralizedqueuetostorepartialsolutionstotheminimalcostgraphcoloringproblem,noristhereanysynchronizationbetweentheprocessorswhiletheoptimalcoloringisbeingsearchedfor(andthisisthesenseinwhichthisalgorithm/implementationcanbesaidtobe“asynchronous”).Generallyspeaking,inthisalgorithmaseachprocessorgeneratessub-problems,itplacesthesesub-problemsinalocalpriorityqueueandthenremovesthesecond-bestsub-problemfromthepriorityqueueandsendsthisproblemtoan-otherprocessorinthecluster.Whilethisalgorithmhasthepotentialtoperformveryefficientlyinaparallelenvironment,theauthorofthisthesishasconcludedthatinpractice,programsthatusethisalgorithmaremoredifficulttodebugthanusualparallelalgorithms(becauseofthelackofsynchronization).Inaddition,theasyn-chronousnatureofthisalgorithmplacesagreatdealofstressonthesystemruntime;inthecourseofdevelopingthisbenchmarkseveralbugswerefound(andfixed)intheHyperionruntime.

•Becausethereisnocentralizedpriorityqueueinthisimplementationtostorepartialsolutionstotheminimalcostgraphcoloringproblem,andbecausethereisabsolutelynosynchronizationbetweentheprocessorsinvolvedinthesearch,theissueof“ter-minationdetection”wasanissue.ThesolutiontothisproblemisDijkstra’selegantdistributedterminationalgorithm[4].Byutilizingthisalgorithm,theprocessorsin-volvedinthesearchfortheoptimalgraphcoloringcanefficientlyandcollectivelyconcludethatthesearchfortheoptimalsolutioniscomplete.Dijkstra’salgorithmisdescribedinFigure3-16.

•Hatcher’soriginalminimalcostgraphcoloringimplementationwasentirelybasedon

46

1.Whenthread0becomesidleitinitiatestheterminationdetectionalgorithm.2.Eachthreadmaintainsacountercount.Innormalprocessingeachthreadcansendmessagestootherthreadsinordertogivethosethreadsadditionalworktobedone.Foreverymessagethatathreadsendstosomeotherthread,countisincrementedbyone.Foreverymessagethatathreadreceivesfromsomeotherthread,countisdecrementedbyone.Countercountisinitializedtozero.Everythreadalsomaintainsavariablecalledcolor;thisvariablecantakeonthevalueblackorwhite.Allthreadsstartwithacolorvalueofwhite.3.Whenthread0initiatestheterminationdetectionalgorithm,ittransmitsa“termina-tiontoken”tothreadN−1.Thistokenhasitsowncountandcolorvalues,whichareinitializedwiththread0’scorrespondingvalues.4.Whenathreadreceivesamessagefromanotherthread(thusgivingthethreadmoreworktodo),thethread’scolortransitionstoblack.Whenathreadbecomesidleagain,thethread’scolortransitionstowhite.5.Uponreceiptoftheterminationtoken,ifathread’scolorvalueisblack,thecolorofthetokenischangedtoblack,otherwisethecolorofthetokenisleftunchanged.Eachthreadkeepstheterminationtokenuntilitbecomesidle.Beforethreaditransmitstheterminationtokentothreadi−1itaddsitscountvaluetotheterminationtoken’scountvalue.Afterathreadforwardstheterminationtokenthisthread’scolorvaluebecomeswhite.6.Uponreceiptoftheterminationtoken,thread0canconcludethattheallofthethreadsareidleifthread0isstillidle,theterminationtoken’scoloriswhite,andthetermi-nationtoken’scountiszero.Otherwise,atleastoneofthethreadsinthesystemisstillactiveandtheterminationdetectionalgorithmwillneedtoberunagain.Figure3-16:DescriptionofDijkstra’sDistributedTerminationDetectionAlgorithm.

47

breadth-firstsearch.Forlargerproblems,thissolerelianceonbreadth-firstsearchdidnotscale,becausethequeuesthatholdpartialsolutionstothecoloringproblemgrewexponentiallylarge.Thesebenchmarkswereaugmentedwithadepth-firstsearchatProfessorHatcher’ssuggestion.Thisdepth-firstsearchgetsinvokedinthefollowingtwosituations:

1.ifagivenpartialsolutionismorethan60%solved

2.ifthelocalqueuethatholdspartialsolutionsismorethan70%full.

Adjustingthesetwoparametersmightleadtoenhancedperformance.Thiscouldbethesubjectoffurtherstudy.

3.6.1JavaImplementation

TheJavaimplementationofthisbenchmarkresemblesamessage-passingprogram.Itusesjava.util.Vectorasamessage-passingmechanism—whenonethreadwantstopassamessagetoanotherthread,itsimplyconstructsthismessageandplacesitonthereceivingthread’smessageVector.Theminimalcostgraphcoloringcodeitselfdoesn’tperformanysynchronizationinthismessage-passingprocess—ifanysynchronizationisneeded,itisuptotheVectorclasstoprovideit.

TwoissueswereencounteredinthedevelopmentoftheHyperionversionofthisbench-markthatsignificantlycomplicatedimplementation:theproblemofefficientlypassingacomplicateddata-structurefromoneprocessor’slocalmemorytoanother,andHyperion’slackofagarbage-collector.Intheformercase,carehadtobetakentoensurethatdatapassed(byreference,notbyvalue4)toanotherprocessor’sVectorobjectwasdeep-copied

4

ObjectsarepassedtomethodsbyreferenceintheJavaprogramminglanguage.

48

uponremovalfromtheVector(viajava.lang.Object.clone()).Inthelattercase,be-causeminimalcostgraphcoloringisadynamicproblemandmemoryhastobeallocatedconstantlytoholdpartialsolutionstoproblems,asmallmemory-managementsystemhadtobedevelopedanddeployedinorderforthisbenchmarktobeusablewithlargerprob-lems.Thismemorymanagementsystemessentiallyconsistsofpre-allocatingapoolofthestructuresthatrepresentpartialsolutionsinthisimplementationandthenensuringthatallofthesestructuresareallocatedanddeallocatedfromthispoolinsteadofthegeneralheap.Becauseeachthreadallocatesitsownpoolandneverusesstructuresfromanotherthread’spool(excepttocopyastructure),thisschemeisefficientintheHyperionenvironment.

3.6.2C/MPIImplementation

TheC/MPIimplementationofthisprogramusesmessagestopassstateinformationaroundastheprogramisrunning.Thepartialsolutionobjectthatthisbenchmarkpassesaroundisitselfacomplexdatastructure,butMPIhassupportfordeclaringandpassingaroundsuchobjects.Thisisausefulfeaturewhichmightnotbeavailableinothermessage-passinglibraries.ThisprogramisverysimilartotheHyperionimplementation—partialsolutionsandbest-so-farsolutionsarepropagatedviamessagesinanasynchronousmanner.

3.6.3ResultsandPerformance

Bothimplementationsofthisbenchmarkhavebeengreatlyimprovedfromtheirinitialimplementations.Olderversionsofthesebenchmarkscouldonlyworkwithverysmallproblems.Theadditionofadepth-firstsearchtothesebenchmarkshasmadeitpossibleforthesebenchmarkstoworkwithverylargeproblems.Theintroductionofmemory-managementcodetotheHyperionversionofthisbenchmarkalsohelpswiththisaswell.

49

Figure3-17:AverageRuntimesfortheMinimalCostGraphColoringBenchmark(Graph:35ContiguousStatesintheContinentalUnitedStates,startinginMaine).

Figure3-18:MeasuredSpeedupfortheMinimalCostGraphColoringBenchmark(Graph:35ContiguousStatesintheContinentalUnitedStates,StartinginMaine).

50

Theperformanceresultsofthesebenchmarkswhenrunagainstaparticulargraphcolor-ingproblemispresentedinFigure3-18.Itshouldbenotedthatthisisacombinatoricsearchproblem,thereforesuperlinearspeedupispossible.TheC/MPIversionofthisbenchmarkdisplaysspeedup—someofthisisevensuperlinear.

TheHyperionversionofthisbenchmarkclearlylagsbehindtheC/MPIversionofthisbenchmark.Thereasonforthisisunclear,butwecanmakesomeeducatedguesses.Quinn’salgorithm[15]doesrequirethateachprocessorsendandreceivealotofmessages.Forexample,everytimeanewbestpartialsolutionisfound,thispartialsolutionisbroadcasttoallprocessors.Thisleadstoalotof“messagestorms,”andsendingandreceivingmessagesinHyperionisknowntobeexpensive.Thisbenchmarkworksalloftheprocessorsparticularlyhard—eachprocessorspinsinaloopthatattemptstosolvepartial-solutionsfromthelocalpriorityqueuewhiledealingwithsendingandreceivingmessagestoandfromotherthreadsinthecluster.Sending,receiving,andevencheckingfortheavailabilityofincomingmessagesallrequirecallingsynchronizedmethods,andcallingtheseoperationsinaloopcertainlymustbeexpensive.ItseemslikelythatQuinn’salgorithm,developedonaparallelcomputerwithrelativelyslowCPUsandarelativelyfast/low-latencynetworkdoesnottranslatewelltotheHyperionsystem,withitsfastCPUsbuthigh-latencymessage-passing.Insummary,Quinn’salgorithmdoesnotseemtotranslatewelltotheHyperionenvironment—toomuchcommunicationandnotenoughcomputation.

ItisnotimpossibletowriteanefficientminimalcostgraphcoloringimplementationfortheHyperionsystem:forexample,Hatcherhasanalgorithmthatcreatesalargenumberofworkerthreads(64—morethanthe16physicalprocessorsavailableintheStarCluster).Inthisalgorithm,thebestknownsolutioniskeptinacentrallocationonProcessor0.Wheneveraprocessorfindsasolutionthatisbetterthanthebestcurrentlyknownsolution,

51

itupdatesthebestknownsolutiononProcessor0.Eachprocessorkeepsitsowncopyofthebestknownsolutionsofarbutinsteadofsendingmessageseverytimeanewbestsolutionisfound,eachthreadjustupdatestheirlocalcopyofthebestsolutionsofarevery1000state-spaceexpansions.TheresultingimplementationundoubtedlygeneratesfarfewermessagesthantheimplementationofQuinn’salgorithmpresentedhere.Hatcher’sHyperionimplementationiscompetitivewiththeC/MPIversionofQuinn’salgorithmpresentedhere.

52

Chapter4

HyperionAssessment

4.1

LessonslearnedfromportingcodetoHyperion

ThepreviouschapterofthisthesispresentsbenchmarksandresultsthatillustratetheperformanceoftheHyperionsystem.TheaimofthischapteristofurtherillustratetheissuesthataprogrammermightencounterwhenportingtheirparallelJavacodetotheHyperionsystem.Inaddition,somerecommendationsarepresentedforimplementorswhowishtoenhanceHyperion,orforthosewhowishtoimplementtheirownDSMsubsystem.

4.1.1MemoryManagement

TheoverwhelmingmajorityofissuesthataprogrammerwillencounterwhenportingtheircodetotheHyperionsystemarerelatedtomemorymanagement.Theseissuesarediscussedhere.

4.1.1.1Ownership

Intermsoflessonslearnedwithrespecttomemorymanagementissues,onelessonwasreinforcedinallofthebenchmarksdiscussedinthisthesis:inprogrammingthereareseveraldifferentnotionsoftheconceptof“ownership”andHyperion’sDSMrulesaddyetanother

53

slightlyorthogonalnotiontothiscollection.Forexample,inthecontextofprogramming,ifobjectA“owns”objectB,thismightimplyanyofthefollowing:

•ObjectBlieswithinthememorysegmentthatasawholecomprisesObjectA.•ObjectAhasapointer/referencetoObjectBandwieldsabsolutecontroloverthisobject.ObjectAdoesn’tletanyotherobjectinthesystemaccessObjectBexceptunderverycontrolledsituations.

•ObjectAhasapointer/referencetoObjectBandhasagreedtodeallocate/destroyObjectBwhenitisnolongerneededintheprogram.

•(HyperionDSMdefinition)ObjectAisaJavathreadandObjectBwascreatedbythisthread.ObjectBiscontainedwithinthememoryassociatedwiththeprocessoruponwhichObject(thread)Aisrunning.Otherthreadsrunningonotherprocessorsintheclusterwon’thavelocalandefficientaccesstoObjectB.

Allofthesenotionsareverysimilar,buteachhasitsdifferencesaswell.ThelastnotionofownershipwillbenewtoprogrammerswhoareapproachingaDSMsystemsuchasHyperion’sforthefirsttime.Aprogrammercanendupwithaninefficientparallelprogramifheorshedoesn’tkeeptheseconceptsclearlyinmindwhenportingJavacodetoHyperion.TheparticulardetailsofhowHyperionimplementsmemoryownershipmightalsotripuptheprogrammer.Theseissuesareexaminedindetailinthenextfewpages.

ToillustratehowaprogrammermightendupwithaninefficientHyperionprogram,pleaserefertoFigure4-1.ThecodepresentedinthisfigureisaveryreasonableapproachtocodingamultithreadedapplicationinJava.Inthisexample,theprogrammerhasdefinedaclassthatimplementsJava’sRunnableinterface.EachinstanceofParCompSolveris

54

1publicclassParCompSolverimplementsRunnable2{3staticfinalintNUM_THREADS=16;45staticParCompSolverpcsArr[];!!!6staticThreadthreadArr[];78intm_whoami,m_x[];9intm_y=newint[20];!!!10SomeClassm_sc;111213ParCompSolver(intwhoami)14{15m_whoami=whoami;16m_x=newint[1024];//!!!17m_sc=newSomeClass();//!!!18}1920publicvoidrun()21{22m_x[0]=6;//!!!2324do_parallel_computation();25}2627publicstaticvoidmain(Stringargv[])28{29pcsArr=newParCompSolver[NUM_THREADS];30threadArr=newThread[NUM_THREADS];3132for(inti=1;iFigure4-1:ExampleofClashBetweenObject-OrientedStyleandHyperion’sDSM.

55

associatedwithanewlycreatedJavathread,andeachinstanceofParCompSolverhasanumberofmembervariablesassociatedwithitthatitusesduringitscomputation.

TheexampleprogrampresentedinFigure4-1suffersfromanumberofmemoryproblemswhenrunintheHyperionsystem.First,itmustbeunderstoodthatthemain()methodthatthisclassdefinesrunsinthemainthread,whichrunsonProcessor0.Whenthemain()routineruns,itstartsbycreatingotherParCompSolverobjects(online33),andthenbycreatingthreadstogoalongwiththeseobjects.However,thecreationofParCompSolverobjectsisexactlywherethisclassstartstorunintotroublewithHyperion.RecallthatinHyperionthreads“own”thememorythattheyallocate.Thisimpliesthatnewly-createdthreads,whentheyreachline33,areactuallyaccessingmemoryallocatedbyProcessor0.ThisinefficiencyispictoriallyillustratedinFigure4-2.Sincethismemoryisgenerallynon-local1,anyaccesstotheobject’sownmemberswillbeslow.Thisexamplehelpsillustratehowconfusionovertwodifferentnotionsof“ownership”canleadtoinefficiencyinaparallelprogram.

HyperionProgrammingRule#1ThefactthataclassimplementsJava’sRunnableinterfacedoesnotimplythatmembersofinstancesofthisclasscanbeaccessedinaHyperion-efficientmanner.InHyperion,memoryallocatedbyonethreadislocaltotheprocessoronwhichthatthreadisrunning.ThisincludesmemoryallocatedforthepurposeofcontaininginstancesofclassesthatimplementtheRunnableinterface,orinstancesofclassesthatextendtheThreadclass.Hyperionprogrammersmustensurethattheircodeaccessesmemoryinanefficientmanner.

IntheHyperionsystemitispossibletocreatemorethreadsthanexistprocessorsinthecluster.Inthiscase,someoftheprocessorsintheclusterwillberunningmorethanonethreadatruntime.Twothreadsrunningonthesamephysicalprocessorcanaccesseachother’sheapmemorywithoutincuringnetworkcommunication,andthisaccessisveryefficient.Noneofthebenchmarksinthisthesisusethisfeature.Thissubjectisalsodiscussedinsection3.6.3.

1

56

Processor 0Thread 0Object ReferenceProcessor 1Thread 1Use of\"member\" objectNetworkObjectFigure4-2:IllustrationofanInefficientHyperionMemoryUsePattern(1).

AnotherexampleofamemoryproblemthatJavaprogrammersencounterwithHyperionispresentedonline9ofFigure4-1.Javaallowsprogrammerstospecifyhowclassmembersareinitializedattheplaceinwhichthesevariablesaredeclared(andtypographicallyoutsideoftheclassconstructor(s)).InthecontextofHyperion,suchaninitializationisequivalenttoplacingtheequivalentcodeintotheobject’sconstructor.Becausesuchcodeleadstoexactlythesamesituationthatwasdiscussedintheprevioustwoparagraphs,theuseofthisfeatureinHyperioncancausethesamesortofinefficiencyasdescribedthere.HyperionProgrammingRule#2JavaprogrammerscannotusevariableinitializersforclassvariablesthatallocatememoryinclassesthatextendJava’sThreadclassorextendJava’sRunnableinterface.Instead,programmersmustfindsomeotherwaytoallocatememoryinaHyperion-efficientmanner.

MethodsthatallowtheprogrammertoconfinuetouseJava/Hyperioninbothanobject-orientedandefficientmannerarediscussedlaterinthischapter.

57

4.1.1.2ObjectCaching

ThecodecontainedinFigure4-1hassomeotherqualitiesthatinhibittheprogramfromrunningoptimallyintheHyperionenvironment.Online5astaticarrayofParCompSolverobjectreferencesisdeclared.AnarraylikethisallowsalloftheParCompSolverobjectsparticipatingintheparallelcomputationtoreferenceeachother.Forexample,ifoneParCompSolverobjectwantedtosenddatatoanotherParCompSolverobject,thisarraywouldbethestartingpointforoneobjecttofindtheotherobject.TheproblemwithastaticarraylikethisintheHyperionenvironmentliesinthefactthatthememoryforsuchanarrayresidesonasingleprocessor.2Duringruntime,astaticarraylikethisislikelytogetusedoftenandfrequentaccesstosuchanarraycanresultinabottleneck.Infact,frequentaccesstoanystaticvariablecanbeabottleneck.ThisresultsinafewmoreguidelinesforoptimizingprogramsthataportedtoHyperion:

HyperionProgrammingRule#3Ifpossible,allstaticvariablesshouldalsobefinal.Thiseffectivelymakesthesevariablesconstant,andthecompilerwilloptimizetheaccesstothesevalues.

HyperionProgrammingRule#4Ifitisimpossibletocomplywiththeprecedingrule,then,ifpossible,alocalcacheofthestaticvariablesshouldbeestablishedbyeachthread.Thepurposeofthesecachedvariablesistoensureefficientaccesstothesevariables.ThistechniqueisillustratedinFigure4-3.

2

Staticvariablesgetdistributedinaround-robinorderonprocessorsinthecluster.

58

1publicclassParCompSolverimplementsRunnable2{3staticintNUM_THREADS=16;4staticParCompSolverpcsArr[];5staticThreadthreadArr[];6staticBarrierbarrier;78intlocal_NUM_THREADS;9ParCompSolverlocal_pcsArr[];1011ParCompSolver(intwhoami){...}1213publicvoidrun()14{15local_NUM_THREADS=NUM_THREADS;//cache16local_pcsArr=newParCompSolver[local_NUM_THREADS];1718//neededtopreventracecondition--don’taccess19//pcsArr[]beforemainthreadisdonecreatingallthreads20barrier.barrier();2122for(inti=0;iFigure4-3:ExampleofCachingObjectsResidingonOtherProcessors.

59

4.1.1.3ProblemswithObjectReferences

TheprogrampresentedinFigure4-3isnotcompletelyoptimized.Online6astaticBarrierclassisdeclared.Thereferencetothisobject(nottheobjectitself)isnotlocallycached,whichmeansthateverytimeagiventhreadwishestoaccessthisBarrierobject,itwillfirsthavetoobtaintheactualreferencetothisobject,andthenusethisreferencetonavigatetotheactualobject.Formostofthethreadsinthecluster,boththeobjectreferenceandtheobjectitselfwillexistonaremoteprocessor.Ifthisobjectreferencehadbeencached,theneachthreadwouldhavebeenabletoaccesstheobjectreferenceinitsownlocalmemory(whichisefficient)andthennavigatedirectlytotheunderlyingstaticobject.Developinganobjectreferencecacheforstaticobjectsisn’tlikelytobeahugeoptimization,butsectionsoftime-criticalcodethatfrequentlyaccessstaticvariablesinthismannermightbenefitfromthis.

ThelargerpointinthepreceedingexampleisthatHyperionanditsDSMsubsystemwillsilentlyallowprogramstoaccessmemoryinefficiently.Currently,thebestwaytodetectsubtleproblemslikethisisthroughauditingthecodeandbyturningonafeatureofHyperioncalled“instrumentation.”Instrumentationoutputcangiveaprogrammerahigh-leveloverviewofhowtherunningprogramisusinglocalandnon-localmemory,aswellastheprogram’suseoflocks.Withsomedetectivework,theprogrammercanoftenfindthesourceofthebottleneckbyexaminingtheinstrumentationoutput.Forexample,aprogrammerwhoislookingthroughtheinstrumentationoutputmightnoticethatforeveryaccesstoastaticvariable,tworemotememoryreadsareperformed.Thismightbeahinttotheprogrammerthatasubtleproblemlikewhatwasdescribedinthepreceedingparagraphispresentinthecode.LanguageslikeSplit-CandenvironmentslikeMPIhavesomeprotectionsthathelppreventthissortofinadvertentaccess—inSplit-Caprogrammer

60

mustexplicitlymarkavariableasbeing“global”andinMPIthereisnodistributedsharing.Javadoeshaveoneveryimportantfeaturethatcanbeexploitedtominimizethistypeofinadvertentsharing—Javaisanobject-orientedlanguage.Recallthatinobject-orientedlanguages,accesstoeachobject’smemberscanbecontrolledratherprecisely.Thismeansthatiftheprogrammerisprogrammingusinganobject-orientedstyle,itwillneverbenecessaryforoneobjecttohavedeepreferencestoanotherobjectinordertomanipulatethatobject—theobjectcanbeforcedintousingtheotherobject’smethods(classfunctions).Inparticular,itiseasytolookatasectionofcodethatreads:

local_pcsArr[whoami+1].someMethod(x,y,z);

...andconcludethatsomesortofnetworkcommunicationistakingplace.Ontheotherhand,supposethethesakeofargumentthatinthefollowingstatement,x[]residesonadifferentprocessorthantheonethatisexecutingthestatement:

x[28]=496;

Suchastatementwouldbemuchmoredifficultforaprogrammertospotbymerelyreadingthecode.

HyperionProgrammingRule#5ThefactthatJavaisanobject-orientedlanguagecanbeexploitedtohelptuneprogramstorunmoreefficientlyinHyperion.Javaprogram-mersshouldusetheobject-orientedfeaturesofthelanguagetodisallowclassesfromhavingadeepknowledgeoftheinternalstructureofotherclasses.StructuringcodeinthismannerwillaidHyperionprogrammersintheeventthattheyneedtocloselyexaminehowandwherenetworkcommunicationsaretakingplace.

Whileworkingonthebenchmarksforthisthesis,thisauthorproposedthattheHyperionRuntimeshouldbechangedsuchthatanymemoryallocatedforthepurposeofcontaining

61

anobjectthatimplement’sJava’sRunnableinterface(oranyobjectownedbythisobject)shouldbeallocatedsuchthatitisassociatedwiththeThreadthatisassociatedwiththisobject.ThiscouldbeaveryusefulfeaturetoaddtoHyperion.Unfortunately,itappearsthatitwouldbedifficulttoimplement,sinceisisverylikelythatthethreadthatwilleventuallyrunthisRunnableclasswon’thavebeencreatedyet.Foranillustrationofthis,notehowinFigure4-3,online35aRunnableobjectParCompSolveriscreated(andhence,memoryisallocated).Subsequentlyonline36theThreadthatexecutesthisobject’srun()methodiscreated.AnybodywhowishedtotakeituponthemselvestoaddthisproposedusefulfeaturetoHyperionwouldhavetocontendwithahard-to-solveissuelikethis.

ThecodeinFigure4-3canbeoptimizedevenfurtherfortheHyperionenvironment.Online16theprogrammerhasallocatedmemoryfromtheobject’srun()method.Sincethisisthefirstmethodthateachnewthreadruns,anymemoryallocatedinthiscontextisallocatedinthisthread’slocalmemorysegment.Sinceitistheintentoftheprogrammertousethismemoryentirelylocallytoeachthread,thismemoryisallocatedinanefficientmanner.However,thesubtleprobleminthiscodeisthatalthoughthememorypointedtobythereferencelocal_pcsArrisallocatedinaHyperion-efficientmanner,thereferenceitself(local_pcsArr)isnot.ThisinefficiencyispictoriallyillustratedinFigure4-4.Infact,thememorythatcontainsthereferencelocal_pcsArrisactuallycontainedinasingleprocessorsomewhereinthecluster(recallHyperion’scardinalruleofmemorymanagement:memoryisownedbythethreadthatallocatesthatmemory).Thisimpliesthataccesstothereferencelocal_pcsArrissubjecttoallofthepropertiesofHyperion’sDSMsubsystem:thismemoryislikelytobenon-local,cachingofthismemoryisperformedbutsometimesthecachegetsdirty,etc.Dependingontheprogram’scircumstances,thismightbeabottleneckintheprogram’sperformance.

62

Processor 0Thread 0Processor 1Thread 1Use of\"member\" objectNetworkObject ReferenceObjectFigure4-4:IllustrationofanInefficientHyperionMemoryUsePattern(2).

Solutionstothepotentialmemorybottleneckproblempresentedbytheprecedingpara-grapharecomplicatedbyanumberoffactors.First,anyprogramfromwhichadequateparallelperformanceissoughtwilllikelybecomprisedofmorethanonesubroutine.Second,Javaisanobject-orientedlanguageandprogrammersarelikelytouseitinthismanner—theprogrammerisapttocreatelogicalclassesandmethodsasabstractionswithwhichtomanipulatetheirdata.Thethirdcomplicatingfactoristhefactthattherearenopointerstosimplevariables(thosewithtypesint,float,double,etc.)inJava.

ConsiderthecodepresentedinFigure4-5:inthisexample,theprogrammerhasstartedtheprocessoftryingtorestructuresomeexistingcodeinordertoeliminatethebottle-neckthattheprogramexperiencesasitaccessesvar1.TheprogrammerhasmodifiedtheParCompSolverclasstonolongerhaveanyclassmembers.Nowthatthemembersofthisclassareallocatedonthethread’sstack,thereisnopossibilityofabottleneckonProcessor0.Unfortunately,thistechniquedoesnotworkinthegeneralcase,becausethedesignofJavadoesnotallowthemethodscomputation_part1()andcomputation_part2()tomodifyvar1.InmanyRealWorldimplementationsofparallelalgorithms,havingaccesstosuchside-effectsisrequired.Figure4-6presentsapossiblestrategyforaddressingthisproblem;

63

thisentailsturningordinaryscalarvariablesintosingleelementarraysandpassinginevery“member”variableintoeverymethoddefinedbytheclass.Whilethismethodmightworkforasimpleclasswithfewmembersandmethods,itshouldbeobviousthatsuchaschemeiserror-prone,tedious,andhard-to-maintain.

Thesolutionthattheauthorofthisthesisimplementedinordertoaddressthisparticularobjectreferenceproblemcanbedescribedasfollows:anewclasscalledParallelHarnesswascreated(seeFigure4-7foranoverviewofthisclass).Thesolepurposeofthisclassistocreatethreads,whichinturncreateobjectslikeParCompSolver.UsingtheParallelHarnessclassentirelysolvesthisparticularobjectreferenceproblem,becauseeachreferenceandtheobjectthatthereferencepointsto(suchasParCompSolver)isproperlyallocatedbytheParallelHarnessclassonthestackinwhicheachobjectruns.ClasseslikeParCompSolvernolongerhavetobeburdenedwiththecodethatactuallycreatesthreads.Inaddi-tion,verylittlemodificationisrequiredtoobjectslikeParCompSolver,certainlynothingliketheschemeoutlinedinFigure4-6.TheHyperion-efficientmemorylayoutthattheParallelHarnessclassmakesavailableinaconvenientmannerisillustratedpictoriallyinFigure4-8

HyperionProgrammingRule#6WhileJavamakesitpossibleforaclasslikeParCompSolvertodefineastaticmain()routineandforParCompSolverobjectstorunasthreads,alevelofindirectioninthestyleofParallelHarnesscanprovidesimilarefficiencyforasmallamountofeffort.TheParallelHarnessclasscreatesthreadsandthenallocatesmemoryfor“worker”objects(classeslikeParCompSolver)inaHyperion-efficientmanner.Theprogrammerdoesn’thavetomodifyalloftheir“worker”classestoallocatetheirownmemoryontheirthreadstacksandlocalheap.

64

publicclassParCompSolverimplementsRunnable{

//nomembersofthisclass,sincereferencestothese//variableswillgetallocated/bottleneckedat//Processor0publicvoidrun(){

//var1isusedforourparallelcomputation//routineslikecomputation_part1and

//computation_part2useandmanipulatethisvariable//duringthecomputationprocess//var1andworkArrayareallocatedonthestackinorder//topreventabottleneckfromthisvariablebeing//allocatedinProcessor0intvar1;

floatworkArray[][]=newfloat[...][...];

//Unfortunately,thefollowingtwocallsto

//computation_part#don’tworkbecausethereisno//wayinJavafortheseroutinestomodifyvar1!

computation_part1(var1,workArray);//side-effects!computation_part2(var1,workArray);//side-effects!}

voidcomputation_part1(intv,floatw[][]){

...v=1;

//Note:sinceworkArrayisnolongeramemberof//ParCompSolverthereisnowaytoaccessithere//unlessitispassedinasanargument....}

voidcomputation_part2(intv,floatworkArray){

...

//similarbodytocomputation_part1()...}...}

Figure4-5:AStartatMovingaThread’sVariablestotheStack.

65

publicclassParCompSolverimplementsRunnable{

publicvoidrun(){

intvar1[1]=newint[1];//usedtobejust’intvar1’floatworkArray[][]=newfloat[...][...];

computation_part1(var1,workArray);computation_part2(var1,workArray);}

voidcomputation_part1(intv[],floatw[][]){

...

v[0]=1;...}

voidcomputation_part2(intv[],floatw[][]){

...

//similarbodytocomputation_part1()...}

//side-effects!//side-effects!

...}

Figure4-6:ASub-OptimalWayToMoveVariablestoaThread’sStack.

66

publicclassParallelHarnessimplementsRunnable{

staticintnumProcessors;staticThreadthreadArray[];staticParallelHarnessphArr[];intwhoami;

publicstaticvoidmain(Stringargv[]){

numProcessors=hyperion.Runtime.numberOfNodes();//thisallowsParCompSolvertoinitialize//anyrequiredstaticstructures

ParCompSolver.static_ctor(numProcessors);threadArray=newThread[numProcessors];phArr=newParallelHarness[numProcessors];

for(inti=0;iif(i>0)threadArray[i]=newThread(phArr[i],\"\"+i);}

for(inti=1;iThread.currentThread().setName(\"\"+0);

phArr[0].run();//themainthreadparticipatesheretoofor(inti=1;icatch(InterruptedExceptione){}}

publicParallelHarness(intw){whoami=w;}

publicvoidrun(){

//noticethatpcsandtheobjectthatpcspointstoare//allocatedonthisthread’sstack.//ThisisHyperion-efficient!

ParCompSolverpcs=newParCompSolver(whoami,

numProcessors);

pcs.run();}}

Figure4-7:ASimplifiedParallelHarnessClass.

67

Processor 0Processor 1Thread 1Use of\"member\" objectObject ReferenceObjectFigure4-8:IllustrationofanEfficientHyperionMemoryUsePattern.

Network4.1.1.4OtherMemoryManagementIssues

AtleastsomeofJava’spopularityisduetotherich,well-designedclasslibrariesthatnearlyalwaysaccompanythebaselanguage.Forexample,programmersappreciatethefactthattheydon’thavetoprovidetheirownHashtableclass—thestandardclasslibraryalreadyprovidesagenericone.Thiscangreatlyspeeddevelopment.

ParallelprogrammerswhoareimplementingtheircodeinJava(withthegoalofrunningtheircodeviaHyperion)mightbeinclinedtousethestandardclasslibrariesthataccompanyJava.Forexample,theauthorofthisthesisinitiallymadeuseofthejava.lang.util.Vectorclassinordertopassobjectsfromonethreadtoanother.Thiswasconvenient,becauseallofthestandardclasslibrariesthatcomewithJavaarethread-safe.Therewasnoneedtocobbletogetherathread-safeobject-passingmechanismbecausethesystemalreadycamewithone,completewithextensivedocumentation.

Unfortunately,noneoftheJavaclasslibrarieswereoriginallyimplementedwithHype-rion’smemorypropertiesinmind.Worse,intypicalinstallationsthesourcecodetotheselibrariesisn’tevenavailable,soitisverydifficulttoauditthiscodeforHyperion-efficiency.

68

Forexample,vectorclassesaretypicallyimplementedinternallyasarrays.Java’sVectorclassisexpandable,sothismeansthatwhenthevectorgetsfilledup,somespecialpro-cessingneedstooccur.Typicallywhathappenswhenavectorfillsupitsinternalarrayisthatthevectorallocatesanewlargerarray,copiesalloftheentriesfromtheoldarraytothenewarray,andthendiscardstheoldarray.Overall,vectorsareefficientbecausetheyeliminatetheneedtopreallocateenoughmemorytoholdthelargestnumberofobjectsthattheprogramneedstostore,thereisnoneedfortheoverheadofalinkedlist,andnewmemoryallocationsoccuronlyrarely.Unfortunately,becausevectorclassestypicallyallocatememoryatcertainpointsintheirlifetimes,thismakesthemunsuitableforuseintheHyperionenvironment.Forexample,ifathreadrunningonProcessor3insertsanobjectintoaVectorthatresidesonProcessor2andthisactioncausesamemoryallocation,thenewmemorywillbeallocatedfromtheProcessor3pool.Withoutwarning,theVectorresidingonProcessor2willsuddenly(andinefficiently)startusingmemoryonProcessor3!Itisnothardtoenvisionthatthistypeofbehaviorcouldsooncauseasignificantandhard-to-findsourceofinefficiency.

HyperionProgrammingRule#7JavaprogrammerswhoareportingtheircodetoHy-perionmustnotuseanyclasswhosememoryuseandallocationpatternsarenotcompletelyunderstoodandin-linewithHyperion’smemoryproperties.ThisincludesanyclassfromthestandardJavaclasslibraries.Instead,onlyclasseswhichusememoryinaHyperion-efficientmannershouldbeused.

4.1.2Other

InthecourseofthisprojecttheauthorspentaconsiderableamountoftimedebuggingandoptimizingcodesothatitwouldrunefficientlyontheUNHStarCluster.Atvariouspoints

69

inthedevelopmentoftheC/MPIbenchmarksforthisthesis,thefollowingdevelopmenttoolswereused:

•gdb/DDD—debuggerandgraphicaldebugger.•gprof/Quantify—executionprofilingtools.

•Purify/Valgrind/Mpatrol—memorydebuggingtools•ifconfig/tcpdump/Ethereal—networkdiagnostictools

Thesetoolsweren’tusedveryoften,butoccasionallythesetoolsprovedtobeextremelyhandy.

Incontrast,theonlydebugginganddiagnosticfacilitiesthatwereavailableforusewithHyperionwereutilizingprintstatements,“ifconfig,”andturningonafeatureofHyper-ioncalled“instrumentation.”Instrumentationoutputcanbepost-processedtotelltheprogrammerhowmanytimesnon-localmemorywasused,howmanytimeslockswereac-quired/released,andtheaverage/maximumtimesforeachofthesetooccur.Itiscertainlypossibletodebugandtuneprogramswiththesetools.However,itmightseemstrangethatthesearetheonlytoolsavailable,sinceitisknownthatHyperioncompilesJavacodeintoJavabytecodeintoCcodeintoexecutableprograms.IfthelonglistofaforementionedtoolsisavailableforC/MPI,whyarethesetoolsunavailableforuseintheHyperionenvironment?

TheanswertothisquestionhastodowiththefactthatHyperion’sruntimedoesn’tusethehost’snativethreadinglibraries(inthiscase,POSIXthreads).Instead,Hyperion’sruntimeusesacustomthreadlibraryfromthePM2package.Therationaleforthisisthat,intermsofperformance,thiscustomthreadinglibraryismoreefficientthanthenative(POSIX)threadinglibrary.ProgramslinkedagainstthecustomPM2threadinglibrarycannotbeeasilyusedinconjunctionwithanyoftheaforementioneddebugging/profiling

70

tools,sincemostoftheselibrariesdependupontheprogram’sstackbeingarrangedinastandardmanner.

TheauthorunderstandsthatPM2canbeconfiguredtousethehostsystem’snativethreadlibraries,butthatthisconfigurationisonlysupportedwhenthesystemasawholeisrunonasinglenode—notaclusterenvironment.TheauthorfeelsthatupdatingPM2tohavethecapabilitytousethehostsystem’snativethreadinglibrarywhileoperatinginaclusterenvironmentwouldbeaveryusefulfeaturetoadd.ThissingleadditiontoPM2wouldgoalongwaytowardsmakingtheseusefuldebugging/profilingtoolsavailableforsystemslikeHyperion.

71

Chapter5

Conclusions

ThegoalofthisthesisistoevaluatetheperformanceofHyperionbycomparingparallelJavaprogramsrunintheHyperionenvironmenttoparallelCprogramsthatutilizetheMPIAPI.Thisthesisdetailstheresultsofrunningfivebenchmarkprogramsineachoftheseenvironments,andincludesdescriptionsofwhatchangesweremadetoeachbenchmarkimplementationinordertoimproveitsperformance.Fromalloftheseobservations,wecanmakeanumberofconclusions.

Insomecases,theperformanceofthebenchmarkswhenrunineachenvironmentiscomparable.ThiscanbeseeninboththeGaussianeliminationandmatrixmultiplicationbenchmarks.InothercasestheHyperionbenchmarksclearlylagbehindtheirC/MPIcounterparts.Thiscanbeseeninboththeshallow-waterandminimalcostgraphcoloringbenchmarks.

ProgrammerswhowishfortheirparallelJavaprogramstorunefficientlyinHyperionmuststructuretheirprogramsverycarefully.BothoftheprecedingtwochaptersdiscussthemanywaysinwhichaJavaprogramwrittenoutsideoftheHyperionenvironmentmighthavetoberestructuredinordertorunefficientlyunderHyperion.Javatoutsitselfasbeinga“writeonce,runanywhere”language,butitseemssafetoconcludethatnearlyanyRealWorldparallelJavaapplicationwouldhavetoberestructuredinsomewayifitistoberun

72

efficientlyintheHyperionenvironment.

Manyresearchgroupshaveassertedthatwritingparallelprogramsinashared-memoryenvironmentiseasierthanwritingprogramsinamessage-passingenvironment[16][8].Inthisauthor’sopinionthisassertionhassomemerit:thisauthorfounddealingwithallofthedata-dependenciesinthemessage-passingversionoftheshallow-watercodetobeverycomplicated.However,thisassertionseemstocarrylessweightinaDSMenvironment:inordertoachievereasonableperformance,efficientDSMprogramstendtostartlookingverysimilartotheirmessage-passingcounterparts.AlloftheHyperionbenchmarksdis-cussedinthisthesishavethisproperty.Inbothmessage-passingandinDSMenvironmentsprogrammershavetopaycloseattentiontohoweachprocessoruseslocalandnon-localmemory.

Hyperionwascreatedforuseoncommodityclustersofcomputers.Someofthebench-marksdiscussedinthisthesissufferedduetherelativeslownessofthenetworkascomparedtotheperformanceoftheCPUsintheStarCluster.ThereaderisremindedthatinReno’sthesis[16]itwasshownthatthemovetotheParaskiClusterwithitsdedicated2Gb/sMyrinetnetworkhadalargeeffectontheperformanceofsomeofthebenchmarksdis-cussedinthatthesis.Someofthebenchmarksdiscussedinthisthesis,mostnotably,theHyperionversionoftheminimalcostgraphcoloringbenchmark,verylikelywouldhavebenefitedwithaccesstoahigher-performancenetwork.Suchatestwasneverperformed.However,whenrunonthesameclusterhardware,theC/MPIversionsofthesebench-marksperformedreasonably.Whatisthereasonforthis?Thisisadifficultquestiontoanswer,butonestrikingdifferencebetweenthesetwosystemsisthefactthatinterprocessorcommunicationinC/MPIsimplyinvolvesplacingbytesonTCP’sreliablestream,whereasinterprocessorcommunicationinHyperionfrequentlyinvolvessynchronizedblocks,mem-

73

oryconsistancyupkeep,andverificationthatdatathatwassentarrivedatitsdestinationsafely.Surelythisdifferencemustbetheunderlyingcauseforasignificantamountoftheperformancedisparitybetweenthesetwosystems.

Inthecourseofdevelopingthebenchmarksforthisthesis,thisauthordevelopedmanydifferentimplementationsofsendrecv().ThefirstimplementationofthiscommunicationspatternwasimplementedusingthenativeJavafunctionalityofsynchronized,wait(),andnotifyAll().Benchmarkimplementationsthatutilizedthisversionofsendrecv()(shallow-water,matrixmultiplication)sufferedfromextremelypoorperformanceandgen-eratedahugeamountofnetworktraffic.Theperformanceofbenchmarksthatrelieduponsendrecv()improvedgreatlyafterthiscommunicationsoperationwasrewrittentoutilizeHyperion-specificsynchronizationmethods(e.g.Reduction.sum()).TheauthorneverwasabletodeterminetherootcauseoftheproblemswiththeuseofHyperion’simplementa-tionsofsynchronized,wait(),andnotifyAll()inthesebenchmarks.However,becauseoneofthedesigngoalsofHyperionistoefficientlysupportnativeJavaapplications,andbecausetheauthorbelievesthattheoriginalsendrecv()implementationutilizedthesesynchronizationmethodsinafairlycommonmanner(describedin[9],forexample),theauthorfeelsitnecessarytomentionthatsomedifficultieswereencounteredinthisarea.Itisthehopeofthisauthorthatfutureresearcherscandeterminetherootcauseoftheproblemhere.

EfficientDSMprogramstendtoevolvetowardsbeingsimilarinstructuretomessage-passingprograms.DuringthisevolutiononeoftheweaknessesofDSMsystemsbecomesapparent:DSMsystemstendtonothaverichcommunicationsfacilitiesforpassingmes-sages.Thisonlymakessense,sinceDSMsystemsarenotmessage-passingsystems.IntheprocessofoptimizingtheHyperionbenchmarksforthisthesis,thisauthorfrequentlyhad

74

thetaskofcobblingtogetheracommunicationspatternforuseinaDSMenvironment,apatternwhichwasalreadyimplementedinMPI.MPIsupportsawidevarietyofblock-ing/nonblocking,point-to-point/collective-communicationoperationsthatoperateonanydatatypethattheuserwishestouse.Thepointhereisnotthatwhethertheseoperationsaremoreefficientthananythingthatcouldbeimplementedina“pure”DSMenvironment,ratherthepointisthatitisratherhandyforaprogrammerthattheseroutinesareal-readyimplemented.Thisfactonlybecomesmoreapparentwhenthereiscallforsomesortof“collective”communicationoperation.IfanoperationlikethoselistedinFigure2-4wererequiredinaHyperionprogram,theoperationitselfwouldhavetobedesignedandimplementedinadditiontothebenchmarkbeingoptimizedforHyperion.Efficientlyimplementingsomeoftheseoperationswouldentailalotofwork.

HyperionusestheDSM-PM2libraryforitsthreadandshared-memoryfunctionality.Inparticular,DSM-PM2eschewsPOSIXthreadsbecauseoftheirperceivedinefficiencywithsignalhandling.Eliminatingthisinefficiencyiscertainlyanadmirabledesigngoal,howeverthisalsohasaside-effectaswell:theuseofanon-standardthreadpackagebasicallyelim-inatesthepossibilityofusinganystandarddebuggingtoolinconjunctionwithHyperion.Writingparallelprogramsisdifficult,andoccasionallyproblemsdocropup.Itisnotsuf-ficienttobeabletorundebuggingtoolsinamorefeature-richdevelopmentenvironment(runningonasingleworkstation)andthensubsequentlyruntheprograminquestionviaHyperion–sometimestheavailabilityofsuchtoolsontheparallelmachineitselfiscritical.ThisauthorfoundthatbeingabletousesuchtoolsintheC/MPIenvironmentwas,onoccasion,extremelyhandy.ItwouldbeveryusefulforHyperion/DSM-PM2userstobeabletousesuchtools.IftheDSM-PM2librarywasenhancedinordertobecapableofbeingconfiguredtousePOSIXthreadsinaclusterenvironment,thiswouldbeaboonto

75

programmerswhousethesesystems.

Allofthesepointsaside,Javaisadeservedlypopularprogramminglanguage,andoneoftheconceivableusesforthislanguageisparallelprogramming.ForprogrammerswhowishtouseJavatoruntheirparallelcomputersonclustersofcommoditycomputers,HyperionisagoodJVMtouse.ItisacredittotheJavacommunity,andespeciallytothecreatorsofHyperionthatthisJVM’sperformancecanrivalthatofmucholder,highlytunedmessage-passingsystems.

76

Bibliography

[1]GabrielAntoniu,LucBoug´e,PhilipJ.Hatcher,MarkMacBeth,KeithMcGuigan,and

RaymondNamyst.TheHyperionSystem:CompilingMultithreadedJavaBytecodeforDistributedExecution.ParallelComputing,27(10):1279–1297,2001.

[2]D.Culler,A.Dusseau,S.Goldstein,A.Krishnamurthy,S.Lumetta,T.vonEicken,

andK.Yelick.IntroductiontoSplit-C.Technicalreport,UniversityofCalifornia,Berkeley,1993.

[3]D.Culler,A.Dusseau,R.Martin,andK.Schauser.PortabilityandPerformance

forParallelProcessing,chapter4:FastParallelSortingunderLogP:fromTheorytoPractice,pages71–98.JohnWiley&SonsLtd.,1994.

[4]EdsgerW.Dijkstra,W.H.J.Feijen,andA.J.M.vanGasteren.Derivationofatermina-tiondetectionalgorithmfordistributedcomputations.Inf.Proc.Letters,16:217–219,1983.

[5]J.Gosling,W.Joy,andG.SteeleJr.TheJavaLanguageSpecification.Addision-Wesley,Reading,Massachusetts,1996.

[6]WilliamGropp,EwingLusk,NathanDoss,andAnthonySkjellum.

Ahigh-

performance,portableimplementationoftheMPImessagepassinginterfacestandard.Technicalreport,ArgonneNationalLaboratory,1996.

77

[7]G.Hoffman,P.Swarztrauber,andR.Sweet.Aspectsofusingmultiprocessorsforme-teorologicalmodelling.InProceedings,WorkshoponMultiprocessinginMeteorologicalModelling,pages125–196,1988.

[8]P.Keleher,S.Dwarkadas,A.L.Cox,andW.Zwaenepoel.Treadmarks:Distributed

sharedmemoryonstandardworkstationsandoperatingsystems.InProceedingsoftheWinter1994USENIXConference,pages115–131,January1994.[9]DougLea.

ConcurrentProgramminginJava–DesignPrincipalsandPatterns.

Addision-Wesley,Reading,Massachusetts,1996.

[10]HonghuiLu,SandhyaDwarkadas,AlanL.Cox,andWillyZwaenepoel.Messagepassing

versusdistributedsharedmemoryonnetworksofworkstations.1995.

[11]MessagePassingInterfaceForum.Documentforastandardmessage-passinginterface.

Technicalreport,OakRidgeNationalLaboratory,1993.

[12]R.NamystandJ.F.Mehaut.PM2:ParallelMultithreadedMachine:Acomputing

environmentfordistributedarchitectures.InParCo’95(ParallelComputing),pages279–285.ElsevierSciencePublishers,September1995.

[13]S.OaksandH.Wong.AdvancedSynchronizationinJavaThreads.Availablevia

http://www.onjava.com/pub/a/onjava/excerpt/jthreads3_ch6/index1.html,2004.

[14]M.Quinn.ParallelComputing:TheoryandPractice.McGraw-Hill,1994.

[15]MichaelJ.Quinn.Analysisandimplementationofbranch-and-boundalgorithmsona

hypercubemulticomputer.IEEETransactionsonComputers,39(3),March1990.

78

[16]M.Reno.ComparingthePerformanceofDistributedSharedMemoryandMessage

PassingProgramsUsingtheHyperionJavaVirtualMachineonClusters.Master’sthesis,UniversityofNewHampshire,2003.

79

AppendixA

BenchmarkData

Thisappendixpresentsthedatathatwasusedinthisdocument.Allgraphscitedinthesetablesusethe“average”valuesfordatapointsonthegraphs.

80

ProcessorsObservedRuntimesAverageStandardDeviation1167.120,167.820,167.440167.4600.350292.020,91.620,91.98091.8730.220453.570,53.500,53.63053.5670.065844.380,44.270,44.14044.2630.1201678.290,77.650,78.10078.0130.329Java/Hyperion1217.392,215.549,216.291216.4110.927

2119.411,118.251,119.623119.0950.739476.487,77.176,78.50777.3901.027877.851,77.721,77.13977.5700.37916115.546,114.980,115.463115.3300.306SeeFigure3-2andFigure3-3forcorrespondinggraphs.TableA.1:DatafortheGaussianEliminationBenchmark.

ProcessorsObservedRuntimesAverageStandardDeviation

135.340,35.190,35.17035.2330.093416.810,16.850,16.76016.8070.04588.660,8.690,8.7508.7000.046164.950,4.920,5.1305.0.114Java/Hyperion176.412,76.212,76.35076.3250.102426.910,26.765,26.66426.7800.124821.489,22.229,21.41621.7110.45016351.615,352.894,351.321351.9430.836SeeFigure3-4andFigure3-5forcorrespondinggraphs.TableA.2:DatafortheShallow-WaterBenchmark.

Processors

124816Java/Hyperion124816

SeeFigure3-10EnvironmentC/MPIObservedRuntimesAverageStandardDeviation56.800,56.800,56.79056.7970.00629.190,29.200,29.20029.1970.00615.560,15.570,15.57015.5670.0068.930,8.760,8.8508.8470.0855.710,5.890,5.7005.7670.10796.986,98.864,98.10497.9850.94562.449,62.488,62.57062.5020.06232.681,32.796,32.66432.7140.07218.440,18.450,18.44818.4460.00513.397,13.407,13.44713.4170.026andFigure3-11forcorrespondinggraphs.

EnvironmentC/MPIEnvironmentC/MPITableA.3:DatafortheMatrixMultiplicationBenchmark.

81

ProcessorsObservedRuntimesAverageStandardDeviation123.0,23.0,24.023.3330.577236.0,38.0,37.037.01.0423.0,23.0,23.023.00.0814.0,15.0,15.014.6670.577169.0,9.0,10.0,9.3330.577

Java/Hyperion1Pleaseseetext.

2Pleaseseetext.4Pleaseseetext.8Pleaseseetext.16Pleaseseetext.SeeFigure3-13andFigure3-14forcorrespondinggraphs.TableA.4:DatafortheRadixSortBenchmark.

EnvironmentC/MPIProcessorsObservedRuntimesAverageStandardDeviation1115.090,117.390,117.480116.6531.355236.660,36.670,36.66036.6630.006420.400,19.880,20.48020.2530.326818.180,17.090,17.87017.7130.562168.990,12.880,9.05010.2702.262

Java/Hyperion1241.005,239.581,239.980240.1890.735

2238.860,239.226,239.582239.2230.3614189.410,202.443,190.713194.1897.1788224.190,222.661,200.738215.86313.12116450.930,440.520,500.440463.96332.016SeeFigure3-17andFigure3-18forcorrespondinggraphs.TableA.5:DatafortheMinimalCostGraphColoringBenchmark.

EnvironmentC/MPI82

因篇幅问题不能全部显示,请点此查看更多更全内容

Top