From 57977f3ac135d9e349ce283e26c6f6304c089d34 Mon Sep 17 00:00:00 2001 From: guoyi1 Date: Tue, 19 Sep 2017 21:02:40 -0400 Subject: [PATCH 1/8] Main algorithm finished --- README.md | 14 ++- img/result1.png | Bin 0 -> 16875 bytes img/result2.png | Bin 0 -> 16588 bytes stream_compaction/common.cu | 61 ++++++------ stream_compaction/cpu.cu | 129 ++++++++++++++++--------- stream_compaction/efficient.cu | 168 ++++++++++++++++++++++++++------- stream_compaction/naive.cu | 109 +++++++++++++++++---- stream_compaction/thrust.cu | 40 ++++---- 8 files changed, 375 insertions(+), 146 deletions(-) create mode 100644 img/result1.png create mode 100644 img/result2.png diff --git a/README.md b/README.md index b71c458..1d179f6 100644 --- a/README.md +++ b/README.md @@ -3,11 +3,15 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* (TODO) Yi Guo +* Tested on: Windows 8.1, Intel(R) Core(TM)i5-4200M CPU @ 2.50GHz 8GB, NVIDIA GeForce 840M (Personal Notebook) + +## ScreenShot +These are the test results of all the method I implemented. +![](./img/result1.jpg); + +![](./img/result2.jpg); + -### (TODO: Your README) -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) diff --git a/img/result1.png b/img/result1.png new file mode 100644 index 0000000000000000000000000000000000000000..8b034f5b748b854d3121f113b9492dc9076bfce5 GIT binary patch literal 16875 zcmdsfcUaTs-Zuz}l;x-*yVNQ;5FsE&MntT%P=;EW$`A%HDTWc>K;`@X;5&x*Hft<7b= zQ2atfL`3HM?~b1m5n1<(h=^#`#&uu|-|W6F_$C^3%KVr}L5uPy@Qb*Ysgk>4LT{qdU1bUz_>-??Cw0OJ;;>tE;Q*uMJV zz{VYShQE8dUeosDCAaQdTTSsp-&MfGVshX71BSRxJvxZ#`dFLCWYzDzPuhKZ5_imF z`b^rlgW*d{tT0+Ef67(xX%@p;7;8?_b~SJnjNthb+X@z*2C$Nsh1+UpDw{5IPTRA7 zEJqVLtNVqV8!P8M%|q`!YeHK+7GG5m*w6oP#$?+8HAXEOIv?enJ=`BVu{ZOJq_|L5 z;kMfpRAN)0%YQvs+r@~iA>2NI_q7B^cI@V5CubRum3ab+GyQ0mtjL+}r5$8zG@xs#a@f+PZqypW2LD2vh+{(#`me=!3SNF!3;7?mb=;vvo>Ea;%ZMeH;`1=BG{-*F#6bDS)}M3E2VMEp zHmJfN2~%($L9NGP1%(l~2?w?JtvKHREw3 z%i^Qnu5s)EDmUy*)h@YDf~$|u8OE{ZwGS4!su?1PFKiZ;P$%UHoEko>%Xd+#!DEbp z65QEMu&>Psm@lh2-4*PDo;*Pk4~bxEIgIP92>kIrs#*AW^X4H z;K_)?hZ~wNCs4`wBvjYv$fsFVb9h2o$hr-T1xe+`)8Wv&tgcm!rE6`jUs`}fr*G^<|@r7<}R zv}@hU!-oTFRjc3|4jG1yNw}aFNYBMG58{Kb+$1r|2u$Y5dgj(9ua@P92PhA=5#&_% z$5TUjq&vRc@T;t{MAGGJdpJCjbFtiMm!EuRH58288FbF16aTfRDgSUm?Hh{3+0Qj6 zgr|oWh35Lm1a{QP#g*z(Rj*z99J>_ivaslsE-~g%nSLV8BWsGgmvy4U=76Cqg>O?{ z6L6@RAby3wbj4XkwDk8B86BZ{NpyZRfE5hSZY!9UVreUngtU>31P8+rjYE=o9uZ6_ zL1phTED`VWaZ>DaaV)6fSFs{@qd;-JN?uy>bCN6CN1=pFkcNBt^$Y~pKZscTV5lq& zb}3x-O2d}eOoX6$UjvtRF%!KovqI%bP9_MR8scXtv6H@9ne`7pRa7|+sxo*QNqLx| z@HiIJu#}e+N6IDCeZuwdzGrKo^>27Yu%t4%i3P47CwKmKA7Tw$RcNAw%>R@y%WreL zR8~qi5Ba5dc3>>r-;mgJ>lgZ~A%~APmk6;@l{dWnWhK#p#<%)#du#?dk{-{wyFb7_ zQOwyLvx0jzoUr)ngYTGxZQQF7Hc69?OG{|um67Gq1<_dL0z!mYsa!sa;>r$pK@h3D z?V-yW-esd9kM6R1nYh4EYYcutiY{h={rdJFXERh8lB(vc7+(|g}`Jhu}H(Q z4(lWqN{KM&%V^$1hzh)exf>Y5;M{UmY;5RW^qw2W@=lyH8I(5;35mE*(l%1fqhH~b zJsH-nt~qO>9wC`zPl~{!S?qbxNU-C3wU@^xo$(8I`8fqD1m-KO^xO%GhG3Yzu=He- z4SE*E($y95V?yRS1g()S@xm zF^@rEw__cd{%`)MZ%h;PVY$nz-8TszQqcsD^}R}D2yALnQc{fD2)sT5k8lG`OiM|r zkx*kGk;up3T14Vh@!)K-VD@_F`o5SmK_z=UOV^s;FS%!)RE$`{zuqw@Jd7?(Rhk&dW$=sd{}aN=Qdx^xNmEiBV_jpSgP?k!nws z8Ag45pISaVTr^N+giTB}npGZbR&S8D52`Cac(%&PC+2ESs40bc0S4n7t@p?8Xw_)5ABp0 zg23pwQ1F;#!>a?A5kj04U=8SWFX#?!c;2dyWwS)>WT4mg%zgE8ihuW4U06}v;;$YV zUa`DQeDAcB9^AYQy8B-@TXv> zKY!k<&pcEGHq^*Ym8PJB>+(3`IrpUckNm9d@8GB9KOL1MmM2vjVq`S8StjrK+vnj! zFX|?;SH@#HtyX>#KJir?U;UbHVP*ZC`n79A=&dHdxR;caU)3;z2h>WcwFzIpiv#Q4 zO83tOsyK3uTh*J+n9lDX&Lz_W)Qt9ychl2`)lJPuZQt=0Z>DZ-s$+eHc}}x^*Eepa zrx72;xG@gXNBnGZnsJidYs@;t+gjX)P8;4DGj@`_i6rSc#62t*ns0K!Cc=uRF_qS^ zm#jU~n>}VQ<`E%67WHGJm5fA-O=Uf+tpuC&>L$LzOz)?fW#%ZfisVVOdHL{V64EQ$=M8!|s_IpUy1OASj{_L7P23G5#U;E>BjAut$w? z#^f6_oym}!RMj#eRq{Wl(*ZU7-VJi-$qB;vSKA?wm&8<7_;e%q?6#q_;k%?;|CtZwNrfT ziue)a(3iBF#ggy^Keoo|ErN5#*V+}OE-klu`IuO^R%zhfVF9&nRaJ;**q;uis^uCL z-SGu?`#~};)a~j%liWQ!-KLRfi~0#ufdmdGvhS6WOy15Fi&sh^2!)lfS4yCs*vY-d znXq5nvOzfvM8{nO#YEArVSefSfZAJmslgQ*IkWpvzG}3bpMy#R<@Qy=yG|Br9bKxF zI>!>D$8?&_x5-~hD(@{k^7OO{)Bqm+ubw%%uK|A%#fX&q^-Kd?C1k%i$oA+d*L(=_`J%G4z%Hq*wWe!peAsW1lQt>J)`b z5GqN23f-zn+U6U_!Hp$Mfp!tOJHX#6Mn7LP?-6QyVScf>?2)gxQP z(&VzEuHK}O3H7LJ2E2Yh-CkzGw`#b0AhT^fDf^DS%yJPCzbn2vQh%v? zkT_+IZna_D9BfL@9?feCRCvG)UyxZGOMwvsBeb&1$~!we|8USHB-5L#){6>^x$kv_ zZm63jnVxh778G;eELPYjtdjI`x-JPEg|8ZAjc9>h zd+>R+FyhF5al6NLuBU|kvzsm~$H8I$rsigNWMB!^?n}qfBUMzkx*S}5Ho3dD5Qnxl z`X(S^lgXlIgiQ;k`*x`p3$ZJn?dyxdyum~sX9#axlw-g&%)(PAB@#KYus%{vt>tacJOPiD+Nt4 zb!5;ED+PfMX!#@sCYpT`I{{vPrQ1GaP0I6{9v!^kiLZ72mL@F~m{&PWJD<`tpv>@< z+@+R)C*d(Wn7a_K=BA53C^v0%f|hn3+)X;AguJP^8b%(q29x7LF0V~F@c2fj8KO^D zT2HUDD&uE^1MDqU=c@+N$D&VpQ|ULCoGQ{hB095bOXEgvrzu;uz8LjNt(nbsu_T#m zx#X3xm5a#%UKx7~;a_X+7jKcjq=9;zON-RG(4|{4&>7U8Z%|ZX!1s=LvVR-^vnL7` zr{DcrLt8tomixvKF*>;A!FLZ^YLB?+vuTDN@|g@nP>v0ulGLmy3z zCTF+01ntd@O*K<+ci1qk|JdCd8J^BX6jM$Hwdw@L7&49(c-rO?bqI;ViAisAidBrk zG}@dV{L!2Q7T;~@i%^kip9D8@xcU3$@P!|G8g=Ro#%xct2HM2b>J;%>81{1xx7xi9 z7-vbRE3F}taQ?ckbzt?s-y&sde&J}4lF7$omZMoab+bZWE_8uOzrN3p{47DiKa#R8 zZ$<^SQ^bE)sm#7m_eRm3g{k{|L45$2s!6(6La)xJDo(_2Qoa;GpH3!xS6Yt@0JYdJ zv7Cdjq7F{0hpLxRp)LSc56L`l2=gxfdF7k@vB5|%fv%spywVTgh9>#d#(e+R=6YbV z$mFT8J&i<#6UT6WVsj1vDJo66E-s5@ty68` zE;Qvzcg?+1URJa-%;+$HU4s`Y8Y85`r?dm`LcDg%oppLnUCgU3saxExPj`M$dAT%A zcx9&Sac@r!*vK+F+uV3mOrHZRN)tEqS}!f4vH)x(k)K8`MAi zQd;j+qc88#p`vQs4D|u1Ocw{umc1&N7HN>(Bpmd6i3_Tg1dTjL{1x+}b{B?fC8GHY<3gu)lA|mf+5P?Uve!{Y z5i(QUxX`CKWz6>7;3ZSeRoQyg(6}59zOI3LfOP?&k(`ARg`XV>5-v)*zEpb25^ZWG ze^$nzh*Jj4Z-)cQ1K9N@w2(#(x_xvG0Dv8Z zM&dRi!V+zHn>4Hi5THKq<)KuucW;8dYkI&b4!gq6o{-FcldqnCBe|v|N5EySKH;n2iD0@4N$lBlYmkiO^ydCCc^5++@BU4_;rak zH&d9NqD2mKpq>HK3Y3iHao1n&BK0S4_w?r5*vk)1^j>oBlDDU@Z`L|G?AP|cxip}6 zw&N6Rkz~_m^NT{)UF3M0!T?k|HE+aXOh#$){IAp)Rq+~0^M}~?9HDZmz(zi&Qjj(- z&M0OG2kBQzUGBZ;GHFoKvo_5Y1}Zej<1NOaCqkGPgr}39Ji)j#Gu7B?;M*;iuH8K| z6JGh~^Tmu(!A!Rp92w{DJ3Oo-moG&A;cJL%0~!OCXvu8j*=>_10sao8%QkcG*7rTq z+d(PcdoAZvTI811ZKLq?^{5{LzMp%?4SMMVZOiA>Rk-Un7{5|t(81Cc9{Jl|Cprv7 zcY$7$0h(3NVW~lfj}zRSz*@KPx#6hJg`hxxhjY@+3j4+K24bIee;q;uLaYu6C%XudreX>yoEixUwtu71Sb+@M|k#HnmE$$n5KpQ2B?b z)lh$`w6BAhofUPzo1=G}@|EKo^nA~5i<>49W32fd)e&kRl&=s0Qc@V`d62*?0@#XK5T#BdmXbzb@S*f$4?6fArh51c+47f6SGDn5>ErLPJ8>J9ZhjHiVjNrWyB4>Xg;nOmbuvnvW2w$!`$Wfqx0r=vS0mso|q5m7>B`D8W_iQ7#U zlN%h@=Uvtb}stlS08+FeY5PUnR+?+e$u75WkcG$oKfKxOxy;CM#W=C)eaj78&?eI4O>cj8f4dN}ybgQbPl6E3!bw&yw-@K`y0BH6 zYHbN1dsZ$))j$!y=g5&ZZ8l{5Ff&ToVqw>-MuMj4SWCVT29D|AuZ{eo-S$msAIZ^L zGl7Y5alj{3!D$d{;2f?kZ+s!HVj1zMhod@C@&WzfV@X&Z*_0z4LQ8r1VCVJM$A-|a zf)|3Fd}Mc+A+=8Toio-Tg~6P|M>1)+rt7h(H)2z>^*n#*^<6tHWlKzeK|sdHpeL+o zgUV^D(W{SE9B0hd;ty-^D(U2>=Z@A+gAcr#tYa0v06^{3#1lBwbpd`;OhlG8Gz!eZ zRrDoRQR}GFhBHk<*eaG<7g~@%_QbtAiK{neb@hXCAZVVeNU#RFk)rhCY7KJyLuDBH zw5+GN*7AjMC0O6A5q5oCf+mL6F4hDJ-Kv^YN!NDM0%h{n{ui5|B2YHG^FZ-&Ayw1Z zhq}Aj;auSWGv6s?c*8ZV!ihm0*t#LR^L>T)I zKDY8#zdNr(5ijH2OGe-r*bN{ucjNYy?)E$E+AuQ1)v%@ z=dI>XgxsFqVy|R}`WW3{gv0H%IWQtK7oFvi35wK=5SE26^q${Zc0CpYx|`(#kC^*w zFfQVBnj-*evyHIIAiddaPEo%z^-QG@ml=NIXGmihH@2(MCZ1>Q4AdNHJ^vVBdt0_{ zfr?YaW`JMXG$g?pkN|nEbFuL=swe-B1@;cG7S2sbYAUl#*FSZ&*-pNY`>qnj_L-v?MRB z{-!(p1BBMv@*Js?25r&A-koat;nQsq$WnZbaD9ePR>5xDk3E+*Xr+9OYX_huen~@N zkfdvdx-8Wt#T%%Nn%XQe!xE@2n|C_hozf z=}FwVHQgRoM7RwY5VHKv7KdAo2t5eAm3q2W)nCGP=SZiA?3e*xb45!H<1t|Ow}oe& ztB<49HR8h;ITs^4-xhK=uY7ZB`de3@97j^v&4+K8!UFL!#dS0D@ald1ucNlEFRLkw zqolqH~ygCzJsb; zjiMSuZUk!lE~M{cSw`nKuVepKeAlYd#P#REDUdo$wqb-d`B*zegdBpw!q?Lgw0~@E zvVEsFb#m`C2iOtJF934!(uB)2gW|AzS|2igu7k2lFckh&q{6uMd@;pR?Q>i0ES$&y z@RK>x!s-CGl5+EbGUM<_R06nIgCV#WK&Eype`-%8#<6q+xjgx3M0{NjkytRVgP@I! zc8+F#7n2uUkdpE+IW9C@j$;#YxDtg7wjcwv;AcO)HDFWOl39Pb8`K1Zddy2QVE0ZR zeSRa(V10(4tyNafk^{=YOF2jttgFyh=qlWhpIYl!_p)6TKK_&R#jf(bs(ooG#%~(2 zsg&8Y)XqZf+}Lmc^QIatIF7YZwMIcP-vZ*P2_OAwlit2_s_fO-&#vbC- z0OpMi10%a*Yv1it6Wv}nBizdm?H4V(pNLsK7ywGe50v2g&?-cZ6ktAh#Tl;@S4&L* zq9CITU$f3Oq(V5m`qwldRSFw)_8s3F*z=}eaD+B{Z@9tGB zN@6s3W{KNbp7j&HG?_d*X7&Z3rfFYZx;|a%)p_B8C}_7Vv+d;rg_WP+QQFF5i>^(7SYHRats#Cl$^uM}h_G=B zpRSKSYbMgK*tA}J$KYO1ZIq{G#d~K3-F=WCIk6{Z_VE52OaP$3kQv>0_zU6YhPC0C z_8Je;!~jm92yDQ&Laok_wxo6yl;vX;xgIot5JMph5T?@%-5mmU*|l#vj&^V7vDQE@yP`p2*Wg03ef8zGq>EenrfK^%;w?BP zYl>ywhNTjWvTOh7PtIlzk;;~tSKO0t569Aug)})Vz+|W)YiHi6xt|EY-{vp3Lo(h# z&QVK}sK@(!PoJth`mc^Cq%Vrw!L@a41#;j?Sp1_q--nZ=CQAYh8n{lcH_dEl3F@!9 zqxnO9(K(?De(I{cti!bEkB3Rt*5` zfH*u5;Fw2zWRDl9GHRCr z)79S-HNCzM*`+yn5qbCH-QDK>UqIzWW0I;UQ`mcQLusFv50GI0mxM;3w0O*+ra80I zz--+&uOm1JHIJ#zm1OJ=$p2;fSO*g6n{Nw=_&-~^@|S>R>{1sRkxDsj7u3E`RRq@W zzQY2YxOCeX-AC?mKw}1;aFKn$f`^G&qqjp$dyQv%O+p1rJ!|&&Y$K++P@8b=$dAG7x!Mo(PcIw0Uh)EPZ#hBP zWfp(o9hJqcc-N9eJR8j+XrO)5>7f~P*pB`@sfhs%sQ?D_|BItsC4)PWMFjJtc zwv&oXQHaffqo#udRiSs(gFbC#+nzWong3aER2Hxudc>#ONoKfJz1N8*VDw#vxT1wA)qbY6`f@&tlX9 z9a4K#XMusKcFk=UTUDm;zJ7EhBn__8LaEZNXSY=AgqnB(-H*Jr+l6ybow7-!EGAG? zG^Wg=BFiv!8(Y|-hx!Y{ejYZIg$mtj1$28nj>=xkXZUaw*vb}apiKh|>8vwJx-?CS z4dt1i16E5J9)Lp}K^hNoMH3ByRF5oWCN@-mynDO$_=9eL121^?kG;#EPp|(3 zKRvSY5mZ0L*Xu?d*uP4F9)m|lw=N$^w|185CjIN4eC@&vKOT`re-N4$Z zdx9Xd7Scv4ULH~gB#)Cd)&W_wqmMsPe_Wdv4x>Pf@rXe+0oi8^@MbVE9@6nc zFSdZH7j{(PvqE)dnY%uJJy1LUv1|KvU`5joq2GWsROhearSg~39K}BC0nZ@te+hV1 zj0-&H9I!l_>Vi2xkYJwXV4E0_Jxn>El=>}i&hq2EhJ@5!lgAfeRj)WIjJS^|H30O( zAcyvj1W*Ezj~NvR5)cRGlrmq&&0w#eOTeC5m9@76s$U|gn(Li{ z7x5og_tN2cIRvu#GHI&bcs9AAOyj0aooRoUT*CU#vUOGtO{jLD+y=z{*hB%X=v#_1 zpp{eBn1@Uriw@uG8vFNj@2N)6ThJ^%NZJ75c%yr_ucLH`FeJ3z`DB<1B+iv<;B*T< zEo9^uXO1b2H^aOfM(?feZCcbm=57s?39Xu&=@uzqBWWrK-DJan7BdhgD;7mb>z2h) zS5Qev^ChEWFq4vbZbt4G>CRLGqZ$KF4M-1G(b$WlC#}1eE#4hY;4W;N(DTytqH56XHl^ zMzn3Hv(Ul8Rq$rX-1b`X8qs)Wit$rxkT-x5Uc80Mw)Eh{p0A>7DIheVSUX!N-RleJ z!wFl508(64dkQ&r_*D_GuWR_AXQV3;^Bu2(mtq zz}1uYJDmX>g1mG4s(kaAjtNWdf$mfwM?^BXdiclrv|qxVE$4qw=oOYRJbo<5f;i>}$Me@tl z9;#H6FHJ-xE;~a|35-cpkgslJY^pFrQ5$tt?su#K;6q}RA4TGyVkERox@Il^I*2#9 zU*%$gWd5Ek=6HD_lh%w7tjS+7HHmKn%!7_8fcVw6T#)kt@t<&DC8Qphg~T#o!iQrI zEZQ}K=o&=dA+;g@N@y|&#D-d;)E}RsC1+oNg{x#Pbf6)?1yV@|Fh`5(fD=B@m+L4u z=C2U$DVhUDPnfF1{fD;h&o(S?6QF(sYI-!8)W1IO73|w(R^j~4#ZeOsKhV&u$IkpM zCdE9`BCwxZsImdVALO*DwnI5TZ*a1_9NLc-D$7tmUO&K!g-|5?T$P6QxDya6aG+ZW zUxqJ;>h4{J9R(4-9->PFPL%-#kK&^`b3Yh?yk5x&3uKg>g&{Q{CYhQ;=_;4sl*^Kr za=Rew&6)zdhdQ4M#Ep_S!3*4b!tfqV&T%NXe&P{byw;6#|B|lA39#Vsqg-I&*XOBx zgJ*^ki7{X$sjO;qHQ)S;K=|(|0I*b@gqCo|f3<|glf+TP%zBVh2(t#D~)1w!qJatn8rv2XyiNFh?G4N(ol=uY1_V zIOP~jCKSGa0@3@ezz>khW4j<7{F`q)C8SzPqJQ5vj&wFFx(Qi7?S;|9T3Y|u1_|U# zU)F~x|K46+S+`M-t-%|nF}e?p^mRZk4m0P_=2n2WOTuf?@OK>md^bqEzo+&5T0ZVA%Q!9j^z zEpjoLYyIWS4R)_dthRsPo9B5~6o6-AdqTguGpT3P9&k7BmD;uTt6=Kx4gz^UWfg7$ z3v*g^{eLl!t4M!`Vd2|7f#8;Td-(ZP;7k2of@N&YSL=lmd_v->PSJe=Rym-N3a#QT zL7skxH?;KKgmPP&3SFdwbkOc2Qu$VEm?q|1A#GcE1SKM8KnMpJOki-g8QJ{;>;*Xh zbJy_PLpC=y$IDHp;ZG& zAa>QX@iL@sbp~3c$sazpVGModrr{yuF=Qx7X}~}G;Bg_~VnFxHpZn9#Q9K6-o0YV{ zqb%4&VH5csdaz0K8y$p^NJhXx^fE(hG}B|vjj03C;hqqIqd=3&L*H2CP{hGr2Kcm9 z_?Z=PJ9+9mXE`BCJ4(=nZTaQ9dFjZ|r0#+e6y-Uyf`CLfc_0v5B50Iv+G4HP$- zA)#o%gn9yy4F0+F{}aV|NKd9yrBDvWo2e@^WfAYqWM3KXE~Q8t%ts5UOr@J)Xh)#b z&cZOiKkWtoD&XxE<7*GaXV%XEhk^^Z%Tue>UI|(k2*HI~*F8__#J-2pzF-{@TgnD2 zg5|2e9Wr|C`(AWplYt`OGmBTugXBU?(@Y4D_*;b$)m39-gW>wwnrFi}Whli0#F7;W zRfk)(=3@C5*6@uGH%?rihT7Q=8D#LmG%iEDlw$J3dwmcQy&3Tw6;_!ev}SOQnNV&m z@9~b>ku_c*Y&bvv{sFAg%xrS@B5374Ag!`ZNDM$fzuLPI$!d-f9^FOe@{Y(?-q_U3e)PXUY^q!F-Z{{x zCM-C4aKhKW)`ni^TM0uatNzvLU|^R42{GZ6IT+=Gkg?TKs8{sYeL)`clmtGGfz|)Z zu24!1<$s~N?rCS`&t3yVvf4w>61zc4-IuAy9O`N%Ikm|T~ zyD%Egf8HjNrvgG7;KMp8FLVAUfuh--hHC!e*@*k9e7Aw5$xMPgS?)HaKrQ^em>R@? j;&3Vq!Q!$&4PIfazAN2+5BLunM85yl`gp-Hx9k50O1C*% literal 0 HcmV?d00001 diff --git a/img/result2.png b/img/result2.png new file mode 100644 index 0000000000000000000000000000000000000000..2d0102f6c027629ceaae0f883af35c28bb12711a GIT binary patch literal 16588 zcmeHu2~?A3+IEmCRa?`37DYhP4sErd;!;2bk|Gst8AWA@tf@mO79$>i5_U+yE`V##mC=}}TPj-KN0EJo*4S(oY7r-Z{`jS?{i&p4?T{}?p`-X$? z<`qBNJ+>%RVbW6Np@r~%@zLFGp(xa{I^<8QA?U{EDAY%XKKaRnsR8 zGo{ZT)%e6n#*irsb9pr6_B?v#1%Cs|$ zT~pqyYo7LQK5^%@Di@O1`Eu{d6<#I#+TOHZrf=8xK*wCCvTd@yKiHSk#I&l%qVp<|O@E_(LTUapj3AUxp z1b?uv{Ok{5M+0bPa>M$`YrGkgq3yTtqGq#&Xxq4xhV=(Y{Be49OQ@cYn{2aaqEN88 z5sUp6PI8$v-L&Y&;*2&$yW`mLYS*&cceCo3=xj?EJxCD2vqK5N)<5s)IW6Tb4(7y? ztNT6IH+BhKCn~S54wltS)p8QKUvOz};+sqH%N0+GVI*{v`P}Sn3eoVqM?%VozIEr)HoqOrkVS6P@MT7%;7Cu-Da}YDao4wBBPHUmMRz7UY zXCf=}rALjpkGk!WXVp23vS(Lj$XzGtkKN(54v9f|?9t1fBk9Xa8ILQ=Mm$`MtK9mo z77eFV)c5Q0&vhP>C>M)Y)kx7eMiNf(*~qBf-c3f9_4%CJ0|9Vu$i1d`a4nyvH6l0N z^^L@C|K{kC%U3*t0;b<*ydKM#K;v$qv5MM~%ogJlf|Ba%h98RD*Gs*(4Nq_l{qr`C zMYmViJ$QWZP;W~EjN$#@epj^EU%p3_f|^}Le*cb3@d)Q@TIuyS9_z5@=CdVnW#!|^ z`tBGu96sOi=rkv7amn?!%S!@E_Enu2aDBSoSMF+-Y$dA=4Uzp%g=5EhJGl4cJXqzr zBd>8i{D0J}3%U4ETHU}&``~0x9y8F@CKf-GS)b8F{Gi7bV~U?VggshmA!b{nU+3Lr=$n(g~E2JAIC>K;DabFCc+@3r1s&PaW)goSBcZV6~C*DmRiNE5* z@ao{=^_;gD@-&*Kb9fcMU%MfohnN^Pk~LjCW6j#ix#2+&HZ^W0#%&6(V3hirhq>rY zC5LBLczo39T+&r zVK@6N^$({r(1rzx73QsBr@YDytP7&VrBBCG6-6NfJr?K+#@VKFlZ%1l3%K!Tuh%az zj>o#vH;%y$3U9YeXEd#a)vcF3oRTHCN#*@d4eJ5!oK5O}nD@y2q5K{F*gcMT5gGbC z#Gy)y*6@a|DaxT_WG1Bad|}xJ8wg)aw{$wwA#@bpw)I|RW$wyc8^7sOGt9lBK$Ox) zaqu*)_vR{c-vfva+cY!YKXg_re<9pqr)w*a5Z5f9{MvwVk8|}kh;3=S%{NkA9?877 zN%y-wu6#1_!2d^%w(rqs>KANNpPZz|tM;D0Zdr2ML}oc3f38E(gv2^t=9bN|{I49h zqEM-i{&7X1V&y$l`A@8vMW`U7pDaF-li4R6e}_GLIg_Uo>v)Y7Q54u4J+mR`!|LD)j$Pjm+@R2wwRDlg z=*xK_O;AquuJU~620G&ct<;R{yu!RT)I#^{mwP;~1%`gqU0t>+eN>c4g z&3_!o9eY51x<8`TMb4jfmr;@(%%V4NL%7y|tMpH}CQs@O8ZayHJd_M!yR|ky zzq%t-uO~HJPRoPEsErDLjA>l*u)$?RRm{8b`lf~hR`}2NFSo`B(~{vRkoa7umn7@U zsMBLPPwRdU0elte2S=*uNx>c`H``-0W?(=z+W$0n537kb7k}J{b$746ABpYd7%=2j zG-q`LZjCpTgonuTLr)Nns=oC%`Px2sA*$xMA8ShnPX~o^{IwZ4OH-N!T%fHSJ4saz zI4z4hopG>(;v3<#<`pE=h4qAwwgfw5Gf6>Q#!Y)tapO+kr|TmUn>SUzQ&rbBwY+nS zgfBzmGG7kN$-EM?CO8=#-s|L!entuTLRBury?s7NmT0YPyT7g^xn6I?B*w`>)!Va^DfPq{n2!Q1|W7uUjroFm90 zy4)~yCGW>rufdxYPXEZ;EUi|@79{rnP<7275DRklDx=bE3E0DL;qM`dlB3Q1RP|Wq z8Sb!(p*UD}8$Eu!rzDtB56OP4SXdI~vbEI9$*H508CV4nMZQLE_~B_~r?Fw_K^;dN%@S8uJwBp;pGGzLR$9ssWv)^ zSs`W)BBOLC$S$;)sa%7l6HLvvR>_*qEYc|7bv7t(^K{nG}x&AyD+s>Ef(DJVJz7x{-_`U{VV{xvXo zY8jHeHgn@AuEUTBl+lF$o!b)jlcsj31ZsEfl#VWk+ptB^n7uL?@+g8M?{WH-+U`ps zSX}>f{VA-;nM>l)H!X=5Xw3!#F+MXC&5h$s+Pj-S`wW7KSt-=sd1d6J^SV8pi5K@~hRT;KmS=${MK zk+D?#WS~~8*~n~uOfA2K#%@^pSo}$xnGJIhiq$Mc_ua%baRfc)88hO^VJymXCpGDF zH6rdj47r^a-*||@?+47sOYJrY=yaGz_7?20Y3n}^;!!btVZk?wePwbrAoeIe#bfeZbIMu_%9 zU9GMf09`Tdg5V6TE@KX+=vRVxfvhC6SDTzaH;F^^dI~ks-=B5j@xf_| z7y+FX!4~{-q^708y2-+>)7&&`N{+cpkrj9M> zNx~kCR-SW8Xh;kWSO=$KP!|2Ycq%ZE|9OxJUT@pugUA7zi%UXJ5UcV>(mE_qowsqy zqu9v-oi@sjq*tLRxfIgEq?xUa=XsY6k=Zz$4p1C5pxG$v?C`jxzH&aUb?30NBY+e`m$|( zZgez#Td_-U(FNUQLZAqt690X`y#Zn*ah%S%eUe;FqPw`^5}J z4S}7+N2+NkPA{r6NfG2$S5=ox9pRC*3x&wO0NwKa>PiG4+Iagx?&^~Q86}R2V4k~X z#pJ_tv*Ov?1qtc;rk)q;^sc2Qeg4z}<^3@R2lb74c7Zi%|0+niY77&2A9tk$ptt|c z$2$ffDo+_FCd}jl@fY`4eoa`Id zsbk(Pq!4UFhRhg9&ZgeKi^A$=0=csx=T{a7O5C6leMg;o4OPRLtY3)g(nX$?6j@-- zOXZ#V2;ak7617m1`HGd^>}#p`IaY=Nz*5iW>4*zdQC~bKYVN{B%*OZ(k|9GV-GS{1 ziCc;MU_JSnE+fCdT5)9#QzUx^-q%5OMRaqQp<=R3w+#MteGok ze#S!moRj)@-z`C5Q&kySnKi5ul;j&_>hL8M>+5;K>)##x=f@4Z@=|5nzt0L?^xKz7 z_0_Cc>l@nifsB!pP{DrnAR%~XK(|SmJLVuPpoLE@u6mays!I$2MZV~?V27o=n`o%Q zPXIz(F^!7N6jBfF5+%|P^w}=UUWt?-dq+g{>f$t0=ViuiOepS1PGxlRI!=TOprC|c zhlqeZ_}l{*qql_w9UE!Bi$V=i4e?QU!hl=po`F7-RBz(=6%Sy_5_P|~k2^Fh^?der zd{5z#=?9N*ykSp05;4+L zLwZ&=P}+FVD!7yn%}sWTlSb&WqWDG+ZUwjhO;R!wxJFadq~~n(gl>>{w7&B`v7m_Q zB_;Hio8;xKuqXJSfp#E8U1qpOFlG+>JL}yF{VZaSZ!nj{ZYCCNMrw%oDPPTfoDKR8 z=sVMjrUqXi_t19YLW>2E=Zo1)9uc5wCUXe#*ZT|Z8 zXOR29Al9)>bt8Wa$g1KaX)R5VXFdBi0-*-rKUY1*!zPzy4L(mNvVnc5xvI$6_uaYcQ zdMrg{-lg87PpQrzcl!Te$EdDHGMxYCV^$Eae(NC#mr_l3d!~<+U+8-n;q95;xN8kQ zXY7F`k=+w|f#x(d&?{V!YGg#W!cjN&3!Slw`jMy@Vn~iltkkv29^%n!*I~3(?r7t)U6HHZ*);=BEtY zok+HivMjhuFUh7P!K!U$_*fBhdvelV`i4-M6MZWyGuQCL4AH~7I$gogUxU^AO~0Ah zG47`L4rj%P*;Cp7sVc=E6l!mF=@uxA0g@U$1 zr~p&XzMjGEe`JQ`*Da)Dah}qzw*zaJQz7cTQg>3o-XBy!U*eamJ#9On0porx(^e1N}Pumw00 z$B4ma&&tOOtlT!(s20Y@V@)#qPH%~*8n#{Y5I}-6Hg4H#Wart3PEE{E8Cw*}7O#=V zq)@(K`>pajJ1T$+lz9zGq%yjd8{LY9T)F!1;EGc|WBU#IO~m{)t#&+-8$X?C&;Z}A zKkj2a@p!Xnz;6wGZ?~NRf9p5sz?dFutYw+{rtIy@4(iBg5WJ5}l6xGO3OxpEMza!w zF8>&BQ5|fQ8QnASF80=F;`n3-#GB5};R-Sn7nf8W9If7V4u8S)*XCk%Ppzu(QtJu7 zm<1}!9+o~ThWicGE<*JeaR#em%EmAKq_&JtQ)t7quvJ@ zV^S_HOAa*V58*2>dh!~V9FiJiS}v)V*7N}qf8{*Io6jb!veM8|Jq9Bsx*ul)ulKZn z?~K+9QRB~%R~y=JA@c+PKgy)d<=NY*>Ihd@Xk9PaZ7j6wGcx+=MR`QEqek@smyht$ z`?7Zolnpgil-+nOLoD-X^t}k6ZB>oKV9y$)G7@h;IKTi<9_3r|XqxAVk24D$yf$_e z1cj8LoBNEu{9c}HYW;Vhmu^+CJWdmnAv;-0Q}IZxgAl;!)_ie+G!EgPZdDL`y>{tM z#kQ7*Mx#=#Chy8TfTXG{Y9E{-(rLF%AQjaC!{tt=?TGsRd?%hR0UqwTq%qH-g`LVj z>ES^xdr2KxBuSGwMneQghI<&1(HQ1(ux|;vf1gjBPl;<5H-+PT4uRpH0foP{C4#P| zw;o#pPOh^TN3in49S?Jd0OBLCBnICQV-)IHr_Oj&HED(-NIYzIY>gdsP+9udUHJ?cyevlMY9ozZSyzWt>^GJM)>0erx5#SA{W+=oJiH@AG|Cs z8%>l=FD9}`Te4c;`j`>UIdQ^J>SbnXx?7BvuI}55kRnw-3cnHF;naz7%sq4q$KZ;C z%jgHpbxqdcF6HzxEz{cFqOq@nEks)99p@fG`8Bo}BLtFGrcjUtxnvgdE7rbz4k^88z(> zg%YH0Gm;vwOwLMbL`vKOSkq9sb#-NNJ-F{ni~N8-<}(9l^eQL;z(rHP{S8j8CP&C> z`6Es)s!Agg0tuJVj8NUFtI#7pY(z*z1VBro2frc14uda~j;)FE9fs9-)v$goaM?|z zTG$LqKGf5kgzB*YWZh|qu##=6*wzz*-kZ4xmckia?T?Q6F_GKrzyCCMr#K5}S3a{! z?1iv-`O>2}An{fjB(<)k6FWNP^rAe;>!f^UyLDj^=xD)do+T^KB1CFQVJnn@#<0C* z8j2~N*mO5>rq2Sfl8C)(Rn3oUUipoBs^)}hG&m%z6jHhaKWYvd%;A@NoP`@~tUV(r zv!DIM$&ERt0W9TwI^HMDIXc@20Er`L;y6B2`rtFEB{aUOSb(cYVgag}IXiw*P&{?)K7k21LN@hG zxXY7Ny$p>&5}Xn@dTj$)wcR?B@{^_af(nJ7d{8q^^3dKD_o+W}9Lo^kLrH9_RQY9& z4P$4^fA*ae2R2JRY@bla>)7nhhU%Z1I#tDJ(%b=dj8EUwfCvCx|9~duKYG4xQ{W<) zTNn|Qj^QFOO>=>+Wu`N?tHHzzeBg}T{v1|ZpaEp4w|oh7{Nj5_YycwX@Rm);(Dwyu zw2|gXzQg^8@Qg(3F85hN*eGjg!GEj(hY|~Gsk8YdU8}kn3u>sf-%xwPReuxB;~nY^ z3dSU6u+|(ExiU$0d^&?OMVoa4AVeL~_%LA-sA>7~xa# z^z+y*o6^Xb??)12va3o@fa)Q);&$ZXx~k@j9shAp277!csN;QDx#R47^SCPrlh|I3 zNTRQ>pE_^7ZXTRm+N`xcq6!a?bA@?BW&35Wwxq!)O(Wi16I(;nJbKQ-3=i&vCkKsu zBENrAmKB0V_iwECAy1v}0BsH7V#lqe9>%|~=!ph;-uV~7Pq89^`Yl>o(x~fNMz=Xe zExqs?baB_fqbETwh&1h=Bd;iMTur`i7_TBDg6abvShp!S3Q_3pRklh-!gHakE;jw*7DsJHGOg zCT54JI+hrpxzrVvUm!!`ZQYD+zAod;xSctIecdr3HQaj~#{I>V+8GD62ShiPAtPpT zBBF8UBBsS-M6nL4`U=jYTbMcK@d-Wlb=B#2e<`CdGgJ%x{sRG#82mNS|EYi|AP)>T zbCh+BV35`P2b49prD5|W{h@ZdsbW%)RPB-=d?pXB8b*}0r{gg9qY=|nC;rtc(rxDUH1ul4G0RBr@y@_vftUNM zWb1)l8%&_iZCcx~)}>XgN^XrDm}o6sTs1rgQ*iZkzI)%prp#Vo&VAw#_XuJd*=YY1 zG01GFcu$h6FgwQfy{HyIL*g*~+E95I_$C7AK?M1N#1h`7NN+Lz9EkLwSI7Bb5FZJ`d?#{S<;V&KYD@XaFN`IzNjZdoKS6bOhd+ou3>Bv(Q5=mQ3MHvyrkRCN zK4@#D`t z$}`jVuFx<9zI4hGLD1kYm>b}1mb1lv6Uq-1(W0p(>RllAWu5L0+C^koDXAqqWqgE^gdm~CsG5Q>nI#{%9yDswQX1i=cKVDS2njPaJ@p_d<|9N6fJ(t( zkWAQzxIPfB088~9)*zR}pj&OOYVb0b3ki!D@oMN&Lv>a;JqaD~>04r4AWhE+@zPxu zG6+@bCL@C;pVyK;yoz;pWzo#6zpdyj*o>;7uF!Z(b(YS!B0oVS>oj2dd^c1%8!aOS za$tZ)<3zBVB=gdF2aeICA!+Et`Bn^_Sk;Y_xVWGF>W4Ou|Mvv}Xgi>lN?A8H!YyeVJ9%ZMM`eIn< zOzcPaq~csCsID`4(hC3L6NiUme8e$q#&v=2$Xrcc)bZ7MEmTl1Z}uaNxAPSRX}N~> z_@(mPWiv@oypK2Q)lp48$T1vy%chX&HQt%XZ6JY`1~Cro8J>-A7A@vGi0R$38WSrnpiA)r7+|IE!#UnU)~_BN0h;1hGSVWB*>^ zf{rnov55F%KWN3Rx6O^<>fyoL^AJEV?KW7ANFpaP?4-3eU^+i=Syh3gu5xp@zw^MY zH=w5XvWBC0*>5C?s#XRX{gL0O;?^6ANVt@qeMKIhvRnXJ`*JI5YV8;d(9(R_U~{Vu z<|;OY&$NQBHv6txUPdfO2}Ot+)JR+f>h;t0#~nG;tmqBd!@yhs$Ee(hxlyu&w>AK= z;32R!FmL2+?Kny8%BWq}&`hMm2G&oJhe7rR#u1?#k=-CjgxPp46?YHHq535&`!?DS z_8GgCzAhP+2CRoQ^o2Beml>GQ=Zd1{DVydzm+Sa1*YPj4t+Y_3U8_*o$8OXn;bkNl5E<5z;aPfSrX+~0n5KR@tD_oM zbuJ;0j=qla9vQ`b{=9vKU5uTX!2uN_$5x=BunFCurVR9#&O432PqHaEybH|6#tl_3 z3#`+dAtTmWwpOMhkbt95Ha<7{!5cqLfUcT3@t=MOtmJZQKc{DZ2erA~Lg!Y;oJZVn z+&`3v?ixu*Iy~2n!mCcL6YXUq#^j#TRb&hjt$K`sJ~&9Ayrc&)WTA-6@;hec&P!Oj zn$w}X-+(9vA$JUHW!jwqF+StWBB5``>YmnvEl*^WG%yzDr$oODL2RyNI{l%|xiVex zHDY_K({qsNE;Ag?Bj%Gqs05SO{+0 zE>e$&&{JnNGs~0JchBZ12vv z%PT5LJ>fGQYJ+qMwl|h>ffkg$TzO(TRqeVnMb+%8Py{IuS{03|(P?Q*V=8i#u3)qH z`9mq6BwYLlxZ|z1F3IcHCu)oZV5q1He7*fXc|qsNe5tk)Uyxf6>!I$u%n|x^cc!Y0b(Ru`=9)%)0rG zlWI9VoZm}a4DQz9Z)obCpTIpD*S$KC5j!?S=8h)bnaVKWA7o{AP0%ct`|q(mHsnBd z1B8Moa0sVIt5&;cQlF5!)oE9Qix2xf={zvkn^sOma}Dh=Kp^!B>jR60TAtUaLehqc zYY853EYITSJ5sY<@PTuMX!Ew7k-12hz*dT zz5j@ls3t_d8^N7rf^7Upy)BhrV<|B4{Ot)*(yN=*t|GX@?U$(^1%RuL3+Qp zab}WlGj5itg<|USnJ84{0(g$)7gUeDq$$psUxfARq1Vwxb$bYOCLxSiz+p*BVJY~m ztEg(E%eiwOfy@nn^pF$@C#z7g1RC|33tiHUY`tby*({bo#_n)t$+qT^PA6yxDW0xLl{)BJYuD_&ukdgZI0GAWtfKF@ zpFCnld5Svph_CXvN4SB*u{8 zJr?bFQK+iiD5IdCnUrns;JXB95>W{m@v<+5^6ltfp}GxB({>k;3Jeu+`_4v{Y-8w1 zZjHWq6Ix%g?Bjmm%0r&;29atnsKX*#?1RiV=tV&^b%cX9>_u!@KFSQrUG{iOP$J@1 z(??oUoL{wxAV*n3^c=6EH~`{?1{>|kH_4YZe(^u-X9ytxMfSFIy5CWOkJ2oWAjq z)gA3ZwO?Rv)XlMICrlU88VL=y0xxyrCXy*Z=+s;bbzja)g!~SK5d0%^_vA*I79Ht_ zgdQ=Xueuo)eHnwa@~NfsBhlvIScB?g3JfjCLTk6C{mh?GH{(heLKwaHX;XJo;z~jM z?#uXZtim*JZ+4ShVyU-a?Nt0hN72(kW~&6S;Jv9v!^b1M!IDDM`$a2JS{M>K&8V!kh^ZXyL7 z!D_fR6aA!rqXJqO5D#w}xTU?Mj$g8U85e_Tn%YpqW&?J@(Y!stFB53Vxp=b3fQ(SV z|4inI$a4HrU;x=_PjED=3@j&Y(}DA?G)Cn(E_?ipQ(}2o*JmTS9y)uxphNWPOaPx;>OOAF6kq=xh5^7bRF~eQ+LA;ItWU^Nwu40|DZS_2Tuv%~6 zy`Irl8NnTn>WdU~iJ0d0xkZ|Sc_Irum9X5^1{)YYZzD`}XcV3Rv0k0&d>OAuycIen znh)nXm_T((EJ8>GbrV8;yGYNPZzhHM7!cVXHR$Xa>_IG-|6E0!X4x`T-&q7skXC_I u9j5+NmZ2`pjQRh|(tR{bH6vDOop^G6-1W^3MaTmEWar+G={r6<@&5oN*E@Is literal 0 HcmV?d00001 diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 8fc0211..ae96953 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -1,39 +1,46 @@ #include "common.h" void checkCUDAErrorFn(const char *msg, const char *file, int line) { - cudaError_t err = cudaGetLastError(); - if (cudaSuccess == err) { - return; - } + cudaError_t err = cudaGetLastError(); + if (cudaSuccess == err) { + return; + } - fprintf(stderr, "CUDA error"); - if (file) { - fprintf(stderr, " (%s:%d)", file, line); - } - fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err)); - exit(EXIT_FAILURE); + fprintf(stderr, "CUDA error"); + if (file) { + fprintf(stderr, " (%s:%d)", file, line); + } + fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err)); + exit(EXIT_FAILURE); } namespace StreamCompaction { - namespace Common { + namespace Common { - /** - * Maps an array to an array of 0s and 1s for stream compaction. Elements - * which map to 0 will be removed, and elements which map to 1 will be kept. - */ - __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO - } + /** + * Maps an array to an array of 0s and 1s for stream compaction. Elements + * which map to 0 will be removed, and elements which map to 1 will be kept. + */ + __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { + // TODO + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (idx >= n) return; + bools[idx] = idata[idx] ? 1 : 0; + } - /** - * Performs scatter on an array. That is, for each element in idata, - * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]]. - */ - __global__ void kernScatter(int n, int *odata, - const int *idata, const int *bools, const int *indices) { - // TODO - } + /** + * Performs scatter on an array. That is, for each element in idata, + * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]]. + */ + __global__ void kernScatter(int n, int *odata, + const int *idata, const int *bools, const int *indices) { + // TODO + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (idx >= n)return; + if (bools[idx]) + odata[indices[idx]] = idata[idx]; + } - } + } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 05ce667..4c7a86d 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,50 +1,89 @@ #include #include "cpu.h" - -#include "common.h" +#include +#include "common.h" namespace StreamCompaction { - namespace CPU { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; - } - - /** - * CPU scan (prefix sum). - * For performance analysis, this is supposed to be a simple for loop. - * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. - */ - void scan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - } - - /** - * CPU stream compaction without using the scan function. - * - * @returns the number of elements remaining after compaction. - */ - int compactWithoutScan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - return -1; - } - - /** - * CPU stream compaction using scan and scatter, like the parallel version. - * - * @returns the number of elements remaining after compaction. - */ - int compactWithScan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - return -1; - } - } + namespace CPU { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + + /** + * CPU scan (prefix sum). + * For performance analysis, this is supposed to be a simple for loop. + * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. + */ + void scan(int n, int *odata, const int *idata) { + + if (n <= 0) return; + memcpy(odata, idata, n * sizeof(int)); + int layer = ilog2ceil(n); + int oLength = 1 << layer; + + // Uncomment the timer here if you want to test the efficiency of scan function + timer().startCpuTimer(); + for (int d = 0; d < layer; d++) { + for (int k = 0; k < oLength; k += (1 << (d + 1))) { + + odata[k + (1 << (d + 1)) - 1] += odata[k + (1 << d) - 1]; + } + } + odata[oLength - 1] = 0; + for (int d = layer - 1; d >= 0; d--) { + for (int k = 0; k < oLength; k += (1 << (d + 1))) { + int nodeIdx = k + (1 << d) - 1; + int temp = odata[nodeIdx]; + odata[nodeIdx] = odata[nodeIdx + (1 << d)]; + odata[nodeIdx + (1 << d)] += temp; + } + } + timer().endCpuTimer(); + } + + /** + * CPU stream compaction without using the scan function. + * + * @returns the number of elements remaining after compaction. + */ + int compactWithoutScan(int n, int *odata, const int *idata) { + + // TODO + if (n <= 0) return -1; + int num = 0; + timer().startCpuTimer(); + for (int i = 0; i < n; i++) { + if (idata[i]) + odata[num++] = idata[i]; + } + timer().endCpuTimer(); + return num; + } + + /** + * CPU stream compaction using scan and scatter, like the parallel version. + * + * @returns the number of elements remaining after compaction. + */ + int compactWithScan(int n, int *odata, const int *idata) { + if (n <= 0) return -1; + int num = 0; + // TODO + //timer().startCpuTimer(); + for (int i = 0; i < n; i++) { + odata[i] = idata[i] ? 1 : 0; + } + scan(n, odata, odata); + num = odata[n - 1]; + for (int i = 0; i < n; i++) { + if (idata[i]) + odata[odata[i]] = idata[i]; + } + //timer().endCpuTimer(); + return num; + } + } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 36c5ef2..aa23383 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -4,37 +4,139 @@ #include "efficient.h" namespace StreamCompaction { - namespace Efficient { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; - } - - /** - * Performs prefix-sum (aka scan) on idata, storing the result into odata. - */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - } - - /** - * Performs stream compaction on idata, storing the result into odata. - * All zeroes are discarded. - * - * @param n The number of elements in idata. - * @param odata The array into which to store elements. - * @param idata The array of elements to compact. - * @returns The number of elements remaining after compaction. - */ - int compact(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - return -1; - } - } + namespace Efficient { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + int threadPerBlock = 64; + int* dev_Data; + int *dev_Map; + int *dev_Scatter; + int *dev_oData; + int *dev_total; + + __global__ void KernUpSweep(int d, int *idata, int nodeNum) + { + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (idx >= nodeNum) return; + idata[(idx + 1)*(1 << (d + 1)) - 1] += idata[idx*(1 << (d + 1)) + (1 << d) - 1]; + //idata[(idx + 1) * (1 << (d + 1)) - 1] += idata[(idx + 1) * (1 << (d + 1)) - 1 - (1 << d)]; + } + + __global__ void KernDownSweep(int d, int *idata, int nodeNum) + { + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (idx >= nodeNum) return; + int nodeIdx = idx*(1 << (d + 1)) + (1 << d) - 1; + int temp = idata[nodeIdx]; + idata[nodeIdx] = idata[nodeIdx + (1 << d)]; + idata[nodeIdx + (1 << d)] += temp; + } + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + void scan(int n, int *odata, const int *idata) { + int layer = ilog2ceil(n); + int oLength = 1 << layer; + cudaMalloc((void**)&dev_Data, oLength * sizeof(int)); + checkCUDAError("cudaMalloc failed!"); + cudaMemcpy(dev_Data, idata, sizeof(int) * oLength, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy to device failed!"); + + timer().startGpuTimer(); + for (int d = 0; d < layer; d++) + { + int nodeNum = 1 << (layer - 1 - d); + int blocknum = nodeNum / threadPerBlock + 1; + KernUpSweep << > >(d, dev_Data, nodeNum); + } + cudaMemset(dev_Data + oLength - 1, 0, sizeof(int)); + checkCUDAError("cudaMemset failed!"); + for (int d = layer - 1; d >= 0; d--) + { + int nodeNum = 1 << (layer - 1 - d); + int blocknum = nodeNum / threadPerBlock + 1; + KernDownSweep << > >(d, dev_Data, nodeNum); + } + cudaMemcpy(odata, dev_Data, sizeof(int) * n, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy to host failed!"); + // for (int j = 0; j < n; j++) + // printf("%d ", odata[j]); + //printf("\n"); + cudaFree(dev_Data); + timer().endGpuTimer(); + } + + /** + * Performs stream compaction on idata, storing the result into odata. + * All zeroes are discarded. + * + * @param n The number of elements in idata. + * @param odata The array into which to store elements. + * @param idata The array of elements to compact. + * @returns The number of elements remaining after compaction. + */ + int compact(int n, int *odata, const int *idata) { + if (n <= 0) return -1; + int layer = ilog2ceil(n); + int oLength = 1 << layer; + cudaMalloc((void**)&dev_Data, oLength * sizeof(int)); + cudaMalloc((void**)&dev_Scatter, oLength * sizeof(int)); + cudaMalloc((void**)&dev_Map, oLength * sizeof(int)); + cudaMalloc((void**)&dev_oData, n * sizeof(int)); + checkCUDAError("cudaMalloc failed!"); + cudaMemcpy(dev_Data, idata, oLength * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy to device failed!"); + + timer().startGpuTimer(); + // TODO + int blocknum = oLength / threadPerBlock + 1; + Common::kernMapToBoolean << > >(oLength, dev_Map, dev_Data); + + // Here I reimplement the scan part, because in the main function, scan and compaction are timed seperately, + // and I don't want to allocate memory for data 2 times. + cudaMemcpy(dev_Scatter, dev_Map, oLength * sizeof(int), cudaMemcpyDeviceToDevice); + checkCUDAError("cudaMemcpy device to device failed!"); + + for (int d = 0; d < layer; d++) + { + int nodeNum = 1 << (layer - 1 - d); + blocknum = nodeNum / threadPerBlock + 1; + KernUpSweep << > >(d, dev_Scatter, nodeNum); + } + + cudaMemset(dev_Scatter + oLength - 1, 0, sizeof(int)); + checkCUDAError("cudaMemcpy to device failed!"); + for (int d = layer - 1; d >= 0; d--) + { + int nodeNum = 1 << (layer - 1 - d); + blocknum = nodeNum / threadPerBlock + 1; + KernDownSweep << > >(d, dev_Scatter, nodeNum); + } + + blocknum = n / threadPerBlock + 1; + Common::kernScatter << < blocknum, threadPerBlock >> > (n, dev_oData, dev_Data, dev_Map, dev_Scatter); + cudaMemcpy(odata, dev_oData, sizeof(int) * n, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy to host failed!"); + + timer().endGpuTimer(); + + int count, end; + cudaMemcpy(&count, dev_Scatter + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(&end, dev_Map + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy device to device failed!"); + cudaFree(dev_Data); + cudaFree(dev_Scatter); + cudaFree(dev_Map); + cudaFree(dev_oData); + + return end ? count + 1 : count; + } + + + + } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 9218f8e..169230a 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -4,22 +4,95 @@ #include "naive.h" namespace StreamCompaction { - namespace Naive { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; - } - // TODO: __global__ - - /** - * Performs prefix-sum (aka scan) on idata, storing the result into odata. - */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - } - } + namespace Naive { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + int threadPerBlock = 512; + int *dev_0, *dev_1; + // TODO: + __global__ void NaiveScan(int d, int *idata, int *odata, int oLength) { + int idx = (blockDim.x*blockIdx.x) + threadIdx.x; + if (idx >= oLength) return; + int flag = 1 << d; + odata[idx] = idx >= flag ? idata[idx] + idata[idx - flag] : idata[idx]; + } + //int threadPerBlock = 1024; + //int BlockNum; + + //int *dev_Data[2]; + + //__global__ void CudaScan(int d, int *in, int *out, int n) + //{ + // int thid = (blockIdx.x * blockDim.x) + threadIdx.x; + // if (thid >= n) + // return; + // int m = 1 << (d - 1); + + // if (thid >= m) + // out[thid] = in[thid] + in[thid - m]; + // else + // out[thid] = in[thid]; + + //} + + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + void scan(int n, int *odata, const int *idata) { + int layer = ilog2ceil(n); + int oLength = 1 << layer; + cudaMalloc((void**)&dev_0, oLength * sizeof(int)); + cudaMalloc((void**)&dev_1, oLength * sizeof(int)); + checkCUDAError("cudaMalloc failed!"); + cudaMemcpy(dev_0, idata, oLength*sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy to device failed!"); + int blocknum = oLength / threadPerBlock + 1; + + + + /*int nCeilLog = ilog2ceil(n); + int nLength = 1 << nCeilLog; + + cudaMalloc((void**)&dev_Data[0], nLength * sizeof(int)); + cudaMalloc((void**)&dev_Data[1], nLength * sizeof(int)); + checkCUDAError("cudaMalloc failed!"); + + cudaMemcpy(dev_Data[0], idata, sizeof(int) * nLength, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy to device failed!"); + int nOutputIndex = 0;*/ + timer().startGpuTimer(); + for (int d = 0; d < layer; d++) { + NaiveScan << > >(d, dev_0, dev_1, oLength); + std::swap(dev_0, dev_1); + } + /*for (int i = 1; i <= nCeilLog; i++) + { + nOutputIndex ^= 1; + BlockNum = nLength / threadPerBlock + 1; + CudaScan << > >(i, dev_Data[nOutputIndex ^ 1], dev_Data[nOutputIndex], nLength); + }*/ + + timer().endGpuTimer(); + odata[0] = 0; + cudaMemcpy(odata + 1, dev_0, (n - 1)*sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy to host failed!"); + + cudaFree(dev_0); + cudaFree(dev_1); + + + /*odata[0] = 0; + cudaMemcpy(odata + 1, dev_Data[nOutputIndex], sizeof(int) * (n - 1), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy to host failed!"); + + cudaFree(dev_Data[0]); + cudaFree(dev_Data[1]);*/ + + + } + } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 36b732d..707b515 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -7,22 +7,26 @@ #include "thrust.h" namespace StreamCompaction { - namespace Thrust { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; - } - /** - * Performs prefix-sum (aka scan) on idata, storing the result into odata. - */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); - timer().endGpuTimer(); - } - } + namespace Thrust { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + void scan(int n, int *odata, const int *idata) { + thrust::device_vector dev_in(idata, idata + n); + thrust::device_vector dev_out(odata, odata + n); + timer().startGpuTimer(); + thrust::exclusive_scan(dev_in.begin(), dev_in.end(), dev_out.begin()); + // TODO use `thrust::exclusive_scan` + // example: for device_vectors dv_in and dv_out: + // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + timer().endGpuTimer(); + thrust::copy(dev_out.begin(), dev_out.end(), odata); + } + } } From f24030528bcee0eb6f80639a001e11ba94c7e037 Mon Sep 17 00:00:00 2001 From: guoyi1 Date: Tue, 19 Sep 2017 23:31:22 -0400 Subject: [PATCH 2/8] Readme Updated --- README.md | 101 ++++++++++++++++++++++++++++++++- img/ScanComparison.png | Bin 0 -> 17585 bytes img/blocksizeComparison.png | Bin 0 -> 8994 bytes src/main.cpp | 2 +- stream_compaction/cpu.cu | 8 +-- stream_compaction/efficient.cu | 17 +++--- stream_compaction/naive.cu | 48 +--------------- 7 files changed, 113 insertions(+), 63 deletions(-) create mode 100644 img/ScanComparison.png create mode 100644 img/blocksizeComparison.png diff --git a/README.md b/README.md index 1d179f6..8da03f5 100644 --- a/README.md +++ b/README.md @@ -6,11 +6,108 @@ CUDA Stream Compaction * (TODO) Yi Guo * Tested on: Windows 8.1, Intel(R) Core(TM)i5-4200M CPU @ 2.50GHz 8GB, NVIDIA GeForce 840M (Personal Notebook) +##Description. +In this project, I implemented the parallel computing algorithm of streaming compaction. For more details, see `INSTRUCTION.md`. + ## ScreenShot These are the test results of all the method I implemented. -![](./img/result1.jpg); +![](./img/result1.png); + +![](./img/result2.png); + +##Performance Analysis +* *Block Size* +I compare the time cost of scan function under different block size value. The result is shown as the graph below. + +![](./img/blocksizeComparison.png); + +It seems that there is no great difference when the block size value is changed. But there is one thing we need to do. That is when we sweep up or sweep down the array, we should change the value of block size for each loop. Since we don't need to deal with all the elements in the array in each loop, we should adjust the block size for each loop to avoid the waste of computation resource.So it should be something like: + + for (int d = 0; d < layer; d++) + { + int nodeNum = 1 << (layer - 1 - d); + blocknum = nodeNum / threadPerBlock + 1; + KernUpSweep << > >(d, dev_Scatter, nodeNum); + } + +instead of: + + blocknum = oLength / threadPerBlock + 1; + for (int d = 0; d < layer; d++) + { + int nodeNum = 1 << (layer - 1 - d); + KernUpSweep << > >(d, dev_Scatter, nodeNum); + } + +* *Efficiency of different scan method* +I compare the efficiency of different scan method and make a plot below. +![](./img/ScanComparison.png); + +As the plot shows, when the size of array is not very huge, `cpu scan` will be a little faster than all the other methods run on GPU. But when the size of array is very huge, `efficient scan` on GPU will be much faster than `cpu scan`. From algorithm perspective, GPU scan should always be much faster than cpu scan. The time complexity of `cpu scan` should be O(n) or more, but on GPU it can be reduced to O(logn). But from architecture perspective, GPU will produce greater latency when we access the data in the global memory(I save the data in the global memory in this project. It can be optimized by using sharing memory). When we want to deal with a huge amount of data using GPU, the massive parallel computing will "hide" the feedback of data access latency. But when we only want to deal with a limited amount of data, GPU has no obvious advantage ,or even less efficient, compared to CPU. + +But there is an another thing I don't quite understand. That is the `naive scan` takes the most time when the size of array is very huge. I think `naive scan` should be more efficient than cpu scan, but I don't know what's going on here. + +* *Thrust scan* +As the plot above shows, `thrust::scan` is more efficient than the scan methods we implemented on GPU. I think there may be 2 reasons. One is that `thrust::scan` function may use the share memory to store the data and access the data from share memory instead of from global memory. In this way, it will produce less latency because it visits the global memory less times. The other is that `thrust::scan` may make some optimizations on the binary search algorithm. The best proof of this is that the time cost of `thrust::scan` will be much less when the size of the array is not the power of 2, which means when the size value is the power of 2, it is probably the worst case for its algorithm. + +* *Test Result* + + **************** + ** SCAN TESTS ** + **************** + [ 34 28 17 4 6 42 43 24 15 44 27 19 13 ... 43 0 ] + ==== cpu scan, power-of-two ==== + elapsed time: 0ms (std::chrono Measured) + [ 0 34 62 79 83 89 131 174 198 213 257 284 303 ... 24338 24381 ] + ==== cpu scan, non-power-of-two ==== + elapsed time: 0ms (std::chrono Measured) + [ 0 34 62 79 83 89 131 174 198 213 257 284 303 ... 24197 24245 ] + passed + ==== naive scan, power-of-two ==== + elapsed time: 0.057184ms (CUDA Measured) + passed + ==== naive scan, non-power-of-two ==== + elapsed time: 0.057216ms (CUDA Measured) + passed + ==== work-efficient scan, power-of-two ==== + elapsed time: 0.157728ms (CUDA Measured) + passed + ==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.153376ms (CUDA Measured) + passed + ==== thrust scan, power-of-two ==== + elapsed time: 0.156192ms (CUDA Measured) + passed + ==== thrust scan, non-power-of-two ==== + elapsed time: 0.023776ms (CUDA Measured) + passed + + ***************************** + ** STREAM COMPACTION TESTS ** + ***************************** + [ 2 0 1 2 2 0 1 2 1 0 3 1 1 ... 3 0 ] + ==== cpu compact without scan, power-of-two ==== + elapsed time: 0.003695ms (std::chrono Measured) + [ 2 1 2 2 1 2 1 3 1 1 1 2 1 ... 1 3 ] + passed + ==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.004105ms (std::chrono Measured) + [ 2 1 2 2 1 2 1 3 1 1 1 2 1 ... 2 2 ] + passed + ==== cpu compact with scan ==== + elapsed time: 0.009853ms (std::chrono Measured) + [ 2 1 2 2 1 2 1 3 1 1 1 2 1 ... 1 3 ] + passed + ==== work-efficient compact, power-of-two ==== + elapsed time: 0.212384ms (CUDA Measured) + passed + ==== work-efficient compact, non-power-of-two ==== + elapsed time: 0.219104ms (CUDA Measured) + passed + + + -![](./img/result2.jpg); diff --git a/img/ScanComparison.png b/img/ScanComparison.png new file mode 100644 index 0000000000000000000000000000000000000000..7f48b46746c4a515a661617f46f8ea03437f2fd3 GIT binary patch literal 17585 zcmb`vby$^O*DkuS2_>ZkX^~izAdPe*A>ECnbc0HF2`u=rkW{2Q79~h`H%NDP@40y2 z_dDPDXYcDfXaAvIES@!E%rWk9k9*9=;P>(pPtl0cAP~q?DM?W!2;^QD1cDOwGoF7$&O-PtpGzjZA!4vEUQvK8@A>W09UY-R! z!TR$Rf^tuvtLAj>Oix!^g{%MR_g@#EcUOlga&2Y`)YRrKH$OU#Ieh{PZO}QzhjjOH zJb%W1A7V;p{cwy0(k-17tj=gkTQE=oJ`-tzM2V}ZsV(%h%X)cvnQ}rh2WMuOtRJG@ zhd|CE{ZJr~D~TpB)D4vZ`1>}R1p@KCl_Z2f8hxlSAdtU;IFBHZJ@Ws<G@ifN$Jzds;b!}8rs^^gM)_BW@Et21@JF}JcB)`{BY660=d zeMm!MW!f!&(vtsAZDfU|qGrACa|(`&%(%F?L<_;Cn~q3NiKfO{spibxpL_?!DV^7r zgjrcx?yfYA^z25?aKGrER2FSTLshI?^iIIt?b_`V{j~Fpf?;mF5M_Ql7NKu(gy-mPN zA|6YUv3{k~fjNwRCd?bQgf^&T`EV-IU9wuWpQei8TY&Cb6=(L&b7C3t`owE{1Z#qo z*1?9~$;S?lxPzh})21{YbkF%N6Sdtc{G#m-S7!C-wHq_y$Y`dHHBacBC-6)rcHUp; zC_9GFykO^#joZ8*+lTc=t3B#Xr?y6wnU4u79A~|v$4JQ<-v1#}j<91Z=Uc})tmI2k zAanH>xNu_=I4#UUm<58$x|m)%y{DkxRA%nEMX*;_R;=)5)zjP`e=58tFVHO>n6OWu zs3->=;SQE`y*7@Qb+7p_&-`vP5{L|?F3AlxuL2u*ZK?A(Gz6*ST@i<9X#2YZUpi~FkN zC-apwUzPElf1g-3f2dKl`X|cANhy_4)LjeO(I za!F*$aiW)!)@)a_k%Nt$9<30g2q5+)B+r7MY z8@N|oKM(v`&P{eupp%~#o?*;2KEsEJ7=!biKzf~L+Oyt(V@tt1kxW{L#b9PI6p(rbvIFBgSA5Fw!ir((cv?i>oAN(1!f9P) z{vA$y&$TdtDDm+})i1t!pLooMcs*h(lj#T_%{sR`C z{LtSq*I_q_#{ZB;vI;>MRt8mj?RYWyi4Xn&N&mhcZXXYWYbVoiF`T^A_aWBy5q~@h@XYG*F4y3T&hO zw{q%JhooH4{h5crp7vGjXbLsm(40_(>0;NlU7jDPGY#u9xrxEiI!h@pe;QxF)z>l+ z8~*&#YSuDa!qYu6v<@aPP35)af;dDrzFok077W;T>|4bjdD`UNHyC~IWC~tR^YAb` z#z)Bw;<)JG%uGkgwKp0fA*@0!A-)~2#~7-YY_a+=zm7CN*2Xfi;!j2732~tpL9SLh z39V)7_OGuxU*vzqh3CcK&!v|YMp&=)d zVW{$`ERtCNWgR@D?{a8LWm(3bE<0IDS#7D^NMwi^2ZPg}vt%s!_t$Le>J3x?kj9BY11}4+@V&YJ@JMqjuVxPGr-<;jR;N;d$h2Zi~xUNI`Av6^$GF6gdt1(WfT za`_n~R`)-5@n3WBNCt;>u@1L&Mh~wljxv)_QPtoR^<7~UKkk(6i}NHLiXZQZby{EP zXl;*J!+)*yuGI`(M7VM^K9gV2x+>Zp7AMSYsF~iTA-(k7!o;mgAkyB-1V^0^?6_yU zS$k4mYRXuPIibQ*T4s8M%{sW^N5|Gf$ds=;HS`_bx3V9aKS^va@5`LDoIF4uhw9{L zEtFWE7J291nV$I3ta!|-JW_CABY z2Og+xG|T93MYcY(4YAXZx5`Q@*}(g&h405^`Sa|Mq0MM&#hhG~rTvb%(4-`xe{PIk zK26h?EkLRc!Fhx)OiX3**gBL}Cy7I8&~IIvDB8J)(~NCwo6TRKe>)i%VN`D}rZea& zmnnlLmPdw+0dqKex*l~XKvt zE|Q7I=07#o=2T*6-!E>HQN^XBDV^)%0_`yhXZ;b)CBPb96`wYr2$09o5J9w z;M(KXu79A5C=h$1{{vV0A5yFRg$2VpCRZb)jQDX9e?a9dQ6Z4Tge*73v zwIFH?qV=`4sHo_l4=94LpwT2);U;CO5jO)Qc=v?A)-@Zu| zMPc6z;zE2MYHgrql$EI%`ei9nonPicAo{vh0soh%yOUE>S&^|m zJ&ui%gs3*ZuxMy#g8L+^d{H2;@D>*rpFe-Tyu3WP3`_R~_TWPmDe;fPBOs6&W)XAP z*xclO1QB{>X>K0nN}=nq{YII?XK)S%hu)_Sd%$JT9@OI!!y??}jw~mJvgMLkczAi` z6z?W#ic(Poba=6)r6oK(oaY%t2wvm5pWYe@jzp5$Q2KtwFmAwuC)cs)Bdjg5d)6Ny zLhr^Kd>UxOva@NG@%Rb=HGkipK45NQl0_RfJFBZq%!j1q{#$>7afzWHKYo-|e1+uf zcqKJ8JaUBA*4Ejt@R0N!ZFFz1s5LYsB&4{QRhbwK`K^bW{b?deEY3UA79mZ-fct-E zX)!E*O2TR0^wwqPANr^0z=3_6hDS!S_S1j;k{%5BuaCyZf1+k&Wwqe>2LZ1ULMf@_ z!4(k^$>tEfo9w-Z2ck9fj{n;s3yD`k^K%>BUy+Zdt?%xRxfVNbPxjk5t-M3d+Hbt9 zYDRYVMQILVol2V&+AuhrOqrPHZk5Z!Im*OPPMc}-kdn{HA>MV2=J)d5Z8IV|8atc- z99Kq0Mm7ftI7~4yF+lK54L2nEGeu#P5Xiyr`R2ek27W%d?2#Z=KzTwfuA-(k<(Ect zP$+*75(S;sN=e~%XpJ*P9b>_n_EY2(11{4}wc3{o7V3R{v1f~1sH7ckK$xr-U(9|lS zh+@Gpy^Q-KM*_ZHl6Pkkq+1eIl1Jd})1^Xu`4qLZa#5)y+zY@`r=i(sUaHgvYt#h5 zhHexDz+k`MfWthP{e1s1hr!lFX&asi3G$Gx;;AZm4gm&1qUPjN1@iYPe+W#|-Tmmi zHJ*mL*d9%l{pxWP7^mH++98)66KPtvyz=r7xZ+-3b@ZK2k@i1O-(x#R@Q=<0=YUL`@Mf3{edhv`UdYh+OVA+8|5S;wt&+hVL*z|#Kc4s>F-O#>&YAj{|IL%O9j&+gd4ZjK_WhhrdU#%j;t=*}9dc*4KZ|>KuRN^Tuo9z4IVDOhe!XqM7RaDHI2#JVTvkvjL1*aOW zR&FkBVxkN}?)~eL7u@!ui!I<7We2XpYiqO@{4tsZZv*Jj2$84ok80hTR*KxS~ef`WL~RYX*2W1)`?T|vZ~s*_`)*#l`GPz?8o5z9vlXwZ;Ki&F=G=h zAMM=QwG*&BKn_T{JiC$C&@dO*tE|eb{{-n++`lq1L>9%Yt@G)Veh?9`@SMU1-{09l zF|=U>R_FU^J%?9TQT@(A(trPc7fxU~R`7zHoGa@Pn}meSb&o2Eg*IX|d9dj2TVmi_ z6W=a=lsch^+^5vem|WHeQtX{Cg3Yg#kw=UrKO+JA0%P$Nkl<`;*Wk(b9sgFNy(A2T)AkGd)nrTyU018#YwlmnfVn`5!dmL;Y$S4S|$o z<>W}=w%%M{$_;pqqZ0zZ>0Ee-0pcit8npvg6$F#$Q+8JWpD2(ga@05)-TA%K4UJJh z4Wvu`ERlhV=9ZDWZ#b0g@7dxA{BGr5YeuE?BukbCGR$78S`qrZfV5yW{?ej3&U9-KH+9VKQ`U`za-8+u?`>Htd6M613VNX!S&hfhdaRo5>)kkC zZ7rIz>EAtIWn~5Yz{=J2B#0$cMqwuU=3eW2WzSWoV1qsnIVq5^ArRvULmQ{~frnx0 z)ZTwj<(kY9GkA}}kVz)WX z18K*cxX@=e1`CTZ|MKG`NopkxjU3d)g_h844t6yKGD#nl440H(f~RA2H{y7Raw1IolXEd(HF#6A&)5qGk2NueN1#Z+B>swisCrCWjU{eAl zXvCA+GA_!17XO!H2rSXxszIx5XMJG|a3bkq^*lf~aDiNzK;P579AX0eDYr{5x^1FO zu6Eo>#9YvKzBtN~aFED@nMu-Y%cH5k3)9pPo0`TPLt1K#(#fi+h(P{o66lyQOlPWQ z13Q&N+Ko|?1FjiTgl1%%{%MDF1`pm*hU!^He`h&;_zJK#>Fq3ZBE5FG3lxatkF>OR zxUFd-fwGFGgL;i%2jD2Kyc5B+^LHs_6wFO->B4-AR_FTio73RQv~X+r`>N%5zRr^i z;a?LiArTSJG%_L(BgWyHYr2(1$y_7~O+`7vXv#Jq_lFk$1t_Z5ShbryOu)iy!?G_N zVCMsAri$`2@Nn7QDvuqNO!IMEkJt>vu28PzD}Gu0-4b)4Jsl=QUXUAg0e`9tFN%>R zlLryxx3V^nK@6?1o-z$#GlJ^32hrP$K^1Nx;%!{A2JsJKSU4U($)R`DZfD%CU7ojb zmxMu#VK%!qMt)J)B!+%ENmmRbW&O(&Va9Mj12yW!jsZFiJ-sn?qn$`OLomCJb^ z`X||zb3grFa?)WzVxd zmA|-}?E_gn`kKu$g&)JxG>^)FPOB&qZ$t|+?=S1A3X7)1vHV_f&VqhdRuF~R?Z>05 zvbE#4zZq{J0TMg37>ay{k^BU$4lf=;e62o#jLFGqDm?_No#wK;1yog=->orDa|csj zyE6^_ggb?WFExNM3`!sjfvDGt1U|E9GB-D85hEhZ_wpP?CmiV7_abN6Ul>=DGq7o2 zM}pID_3`}PLG8x3pCW?^JZ4EEF_N05L8%IPi4pUfLszo5{YusBw!q}`aiL{iOU~eqw_#PSteA!q7@C{6 z$7MTY?a+r4B*1H$@icT4l3a^71J1rJX0zN8z$ft}(4-``x;YCgD^~#EIZ*wu)X-h; zNGU?@!~2AHtpDEd=%{Ivn46oCke>}CK+@kyb7@(5UF&tbs?2%DhDisYt$M4J*nNJ&Wz;1RKkK_Hont)Y{XlMnxd z0}bW znIuTF3u`ZEb*~O89ZnvzFye@R#G#+|`<*g}0jBXWiTPMc64W%PrY}zu9 zy0TVpj`2(ZD#v(-7Y-X68!=Jun)YI>C>(3c%7!EE%fNymDC`XP4#<=c}>?@RtUzQopcxbd{8pbae6;8%up7s8r$Rs8Vb+d2P)a$=aS z=DCI(p_(&~)Iz_6Ty0der%03JyW1xspq>_bm;(8tEAW?Ba8Zm;V4y~9v86FY%h0Gf z`{>GnR+mFNby3ha8VB{hko!A2-|q>SXI*oy6wLfifz`f>w`KG-?gQs2VYQwJ8PHDC z?MHpNwtQb8hHu`sHu0@y>cTzo%LKXF_CWGR$zVT}mv!H9h5EZ5Us1XVEBQz63ujjO zvKGz!K_|N4N^SzH`mZ7C?T}P$xv|j~R{R9`FWFU4wJ$4oe3ebwJ~&(D@`L}jBt|%$ z?$OCf5>Y6C`>;^~n-KYLiLS+${J~SvOH4SAe2>^pUNFlt9?J+2GT(lt`kShJr+iDl7M$ksN{Q!f%VLiq6$&^w+6|&9t6H=JNTSO zLPUg0sT(vuqIam<=z3^Gt6O%JL*JnV7}R=CG(?i{l@*BEH*em&`xBZ=hv^E{cT9k} z4%m5lF14%V^MRrwN;d%YCAWxq5esf|DyP zm^lz!gUPn{`q4>DDeoWZ2w>e)sD10O5(b-p?17TubKVpCb(cD1_9i5l;8Gy!2S~?& zLq_RNKKMTNdd2itc8oI^{vN=kvn<<>O@wYvZieoAPl3a?PXVY0JX{iLeUF90P(J(P zYl{R-Kl(LS>}~HU-NvAkR5|V9@!NN}&`BGvZwH{uDdcW}>q{rn%@1STKKR@(N*N90 zj7J2}KY)&~Gb6)up=IavPg|ed2l@v{Yh=km;25l@#wBH4kDDm=4W}xsKewXqJ`7`M z`vC%KQ%X7KB|xf7E{S`8r+&R6w5i2vi+I7#?r8$m*Vetg{&NK2wbuN6$d!@pLzd9L zAAM=?oyB!Q3{Yp-eC|3r;pq(Z;yxTWnR%6f_jy!~3R3PFOjf&BwOuha`(}BO{I-f- z+Dy0Wu_u9iEmp#q&zyGSPj&oA40#Yl!5*^V_k^1y(|}yEMoh$&)rb@CcJCDvQypm8 z)*t8F@?Y&Wvz2@tdChJl4}+5L=`i9?6&nSj61-~-HNTy>S>{pzeCXr@F7(9fC0U)- z>*rWF;%GRG#3Up=4}39jgluhhV-*C>+kCFtVqOc}>__TV+tUz--;C&8f0Lc{zTCFZ zdk;+iKy<}>)cqZohRd`-vKUu`m6kh^&9>EkWFhv-{2d)cr(S~yD`av~6PLNOYja~m)&jzVSvzq0SK^JX_Sycb zLiy+);+tR~=*92Cq}_xscn{NiAgIk<*GQ?TQ7Id>QT726X~@!a>Vm?bAptI&;A`Y; zQk9IyruE7_@xz{nCwIvd%Duf6NX14O%kzxU8QQ{YQKghG9oAG}z^|!oI zr3h2zIKQD0{67d8q|Ug~8rs?_PS1LHkEO!0Je)wySs>w+`y~UgOs|YM?NN3=TcO;wWWnRBLW-1}H?UEqiTY_I=;*z<%!U zVx>hbA>*~mLKjutCXUZo&OS&8=$DsN%t(-weefhvgXnK+C;?RU!v~n1qOSMRnj$+D zVR8GP{!*8ZwS+)P7(j2sZFp9{`3M)9WUGtj<86SX68<^4xw$1JN~K6OtM4CfBhv@1 zvpnShpw#AOpObUPoQ=qb6I==srmk@w$3Fsf;CG0Vqa%f)CdR@d8q=<9H+q``%B=<0x6`|FfWgjF^~>{T};3O-pNN^u}YH1ct4; z9%5VB%)k z*jJc^le5BNh!*L;r`uBmYn4hLtPGC^m0iqh3DrNzl%2ZPZaafxT+_H`vZy-yRKPu<*RCIqTS!ZwsZdGb$}Zu7b~bW0?LjU!{WxD6?Gb_c#HE3n$1Zk8JD92OE@G zarwQ96tR|aD=Ij9@PT6eQD48Y`-%F}5rv0Q>3l*xQNw%g7tjb#_w_60vuWg*SGb`q zC%bdDNcJfw*R!_v6Gl)3fR0Fytol9wVg;4V`rW9K`NhS>KY#wrvy!?cVu37WOH!9O ztjZ$|QfWQK$;xW5@{n0WJW;K>9v_BBzlC4t70&6#I?ZB49*m zgmVqOt)ay=5@yrR#y>k$zI+7;tEs68K$Z%muJ#vZz5f&ox>Hb5Ooz-k0g7;a(B7WZ z0NY3Wy+0*K(=FPx_nMiR*1(xd+$i*nz8#k z@?&jP(SbvThhX|XWyPV7fX~^F?7_j8|85{9J|AO&yd(e;Pa<%?`cI3kST-#?A^#21 zh%K*ctKBJ@tG{LAC{Y-6hX5C!cplmU5{z8cE9hK}ljW02!&Cz{9#B~kLm!b{wJ#;R z05?&l$$n;vl$4gzgM))X6^3;1+5Ud4ru=b!@jDa~IZAMmYYSbRV|&BG3xe$BYFJp< z@bK{6T6}JA_Pmc4l8SY5t!t}lhN|5MoOv72;i$+5D={!Iswxd_EiKvEg&mK#bxyjd zj7VmUardV>qeG98u+0tpfHd{nyJh}d3;W~Zsp)&*RqySdp~V_ z=XU5%Xe&wecKbc#!xMPHqM(z7v}s{(F8Zxj&W7~$5HP6sSz5;Xr%ykqsU?F!ywCOw*r^<&3pfv#qsgxfjE8Jtt`s6J^*1FBj%?0H zM|;qUly55g9lKj~P7^k~p7Zs3Nzy3D>a65@Wu?nZjZ+d6sH}kc)E(9jv}mMMs@s_O z(|(}^=tJRG6gh@H5e}yUQztNwYzUa+Rp39KAdD3Bsn*pj; zk;NAE^M=F-0+86=*z1~&Nq%VN?NY1s*}1&rGiZ9I=f2jcFzPZ|sl7!GXTd}#0}+Px zD?=jkV+9sC7y_s+2Ybo9eG41FqCW2Z^y@BD=K>y&dFdYb&U0u3zsoOIE4jAOYXbL+ zww~rDBu;tINVo;52sXbBk@wr6;Q%Pb8IAg2p zHUE%*1TwieNnT&4jFa4sBfT~V?<|KJvKyzV-=nAg9EXVGhOH2EV^sWBwPi4?*OXYRx?e2{+ztsS^k`?F!Rdv}eFqPZwTdYhM_@+m#Q8{+8-+n8=(Z_wi0PsNDV{~)hkS~?ZBspH9ySMJ zN@OUq^UC1pk-04#;y@pfUtM=>@3QS1?p`+C>g+Kr4rEu{d>El?&~oV zh!D>Yh17UHgP4|Ks1q5F<12*E0DU@vLb#2Fg-?0u?PDL;@{KXje7>^fr9}Py!{@4} zSLT?BAeW>s!CK&OJSoB>)%wzXJuUsx?3&D?^Bq~quZyJ}DzAjXrG--?f?z+^TyU$C z@g{@`vJ{<`miB_zInkI^j5e#kyEkC`&)V6IlStpJ#g~?HgY<*cBtRPhil4|jX!|8D zj#aUf?zm^3({1vSJR>mu1=_+&49v_nR^3O=b>Kc~8$@5~qpj_!F|rj7n%(TYB=S3x zd%YVC>$mf%Rlg1Y29gtd@90HZUt6Q%t8I8^{W#$w=DIENRNgH8U>hG#{@QKj?_9z3 z?Kkr6ONG;(Dyi{%0S#3irvwSR*B3?&wU<90@A#yhw=FZ76u!`H$am^0s7!iRku2A( z^`GuN9<}Zs6Ar}x-E0IGIZ58c&~_eg_}Z&;*TYkR(nXKk z<&j+>>=b@L;F{oKlDIrWk-KE zIVGW2w@PU*fo26+AE~OwpQUfS5q%0dr{tD(wV6tpf62e56UiZid0}^64E%m<>n$C& zX#DiV8_L{%r>B^hAaN%fvzmaq^y^oOYwd5B30Ne=z5KU^*CCJfYFBsRmm72W*L^Ny zuFhul&2&2xRH$213KgBK!w*yug>AGfr4QR70DKN2^$ZLRnzv~@ZVtU{r_Y;fZ1cuT zwu&dougaXa<1;R3ZwA{xlm>_5uIUNeVN;RMU5N(gDlVk(npQ1>~(tVk#!i)HL>&E$NBHG zgZRl8aG&ck$1>-iA6+G8uczM426at+a)~0Fd=cYuL5hkC3lh(eh#=KKIOm^#MK7-S zm2;lBmA#54y5GsQtnbe0gz|M2O5uI8F~33Jt#~T5U%%^fv%NdVv)7ENa!j2b4RGCod}rLzwc2f%dPtFLW*vd=sX0%12hO})nz9tZS^J*1*ez_?atB3Z6km!Pn=x~Nds?)+nuOxwq+=BJzGt?PJXJ_keW_V&5vjy;@ zV|)ehgTtRiYEgNWT~lm~IK>ix^gGTN@~h%nV!N00RRT7}!1PX)^{we-zZ$Sz6hi zn{O{RH=Iq$2wqQTN!;k4wKcGpY;YCU?vX4$znvQSr58FnD^#RT36@|_@SE}_XnD{Fzg1` zp5(?(-4t_)WRgC?_YVbS_bL|SJ*WMJmY!rT5dB%j#h)Ajf*|28b0}s#c7r`1+2T!>%st=<@z{*` zFL8naQ$IR70#`LI#(^G%NpAD;aHPhd}rXBx11_iPuo^&(QQY-%?hz_Ujsp4>!rGS zad27(`sFl*{n7fs=#NKQ@Y`wksGJOR$Ef{?S=}QD;eIh zx|<2y^2#cazR@{{z_glMFXOhtDT;&Er*N=Atf}+dVx5NPQ$8|VYif&TWD$DmxDCUE z7l zn72+_N4WbpJ^KSEu16)@?}b_=33Y)a18zC-FjD}K5NX2>Q2#U4iTS3-eEIUDY1sjr zZ27@Z^S*7%+)d|^-BH9@GI=s9=t(SI3b62Rc}G_I_sp^=-WI?o=j;Nxa&7DbFAlUS z>#wAERq}5ZLi?^7oL#*~xX855`CY0!T4HzX#xOe@R7!fErCwe#!DsnHn*rd{Achdj zsU8uiB0^vzE@G8e?#7o55gb>6Yo8k;>g(7>$FNe>*G^SFtAG`Q{>I6me;Bwcx#=pj z;ZlFnvRk)fVNFCB-!4={IJ4h3-3Ps>yp_4uUThkjLsT8RcvN1!J6-XaLM$s>R(QJF zbu|!8Ud+$e^l4A*!`ewwIYpE|8PCnhV+N83lX^`)8^A5>KHJOMjk0j@1-Ddcr&TBv zsn#&1>r1YlDan{#d(e$-o_GLwz!7s8swvBrZ^_&ay+WTvFOiu@8 zRxKMAbrU*BpLyFMRrxzTX-m4|oQ>GlL9bHl&|a_j2V{jnK?bKJt2+oARO z=nsYQ0?HPzRSaPEqf7_x})h_sNm@aR96z?xk>VC-EqzHrix=IO@JE1 zhFk27w^j~m;z_Ij@?kLFyI<)0Fe*O#A0HxxDdr+TKX-d@JM zs7XD~JrukpcXnNELGklfsM@o^^^j`|s6mME@W6n< z=!=U*c5f-vGJ3Ia(Z11X|?)j#~CP z5sj529#g+PP)c}H^VHp5%5Ld>S47uM&C~D~s28MB-#TqNHMc^HMJq+cr$^~t(qoqA zslnD8+g}@VnQYr`AYAdkdncR2`B%#HTu4q=^s=tsl%*dVrAh**z$`z z6stA-Q^IdYv`ZE%Bt9uz*i9xaanf|Fif39WgJ<8{(9s=X#L2M@p&XKaBuco<2%-6M0BpSqQy6MZ=xIJAbUkz*!yH?8Hg8iH>huI8P%G>qxCUD5z{+~+}539bFx zezT&|^yGZPg_m2+d^?YSxMVDN^g7AznQ*7yJUr^NJ!bt>xu3%d!a5?xTGEZ}fhgSM zMThieLJ1Tz|2WpGk`A87y7WU}8P_^B;xE$YdReB1I~~a7?^LsTtUBMf%T00uW;LOduTr=)pDA1U*PwSNox^X{}UWL%=)l3-9YCR z4;*px&GM~{>BAl}@N3PO)MY!k`@|5^0Z5spAJ*am^;=e&a8WpH##bJ}j&$?1PY*FCB*YCh_?{Q=)} zbG3fMpX{n1?ygZ^Ai2&CzjAzlyl?vd_w`FZ>D}P(m4WR%6Br(`5#PC_Y@el{E}+EU z!+BJcS?#h^;zAX>@4FMo;UZDsWoJGQ?zHQuJwX+TlV`;hN(1tmOv0?o!0e1->$ap2 zs?iR!s_JWp89B7LysjP7x2Pg)+UDn|?chpwk>mP940i00n3x!DoPeYH7$zJd>Nu_o zn>bWDgP{Hc+$bk%4E^<(0_Npqv~|n!_c!0r;+?}UGy;;C(5f)YoVt(jE`!6eNERbr zdCe7~$N5syu;olhuF-b9c2tt(e^t!v^4vL+Q77~|_X%v17DHoTxCv)^xrzaY-bD|e z2D4H^sB%Jj7U%o>Xck|Z9%ny?(>Au87e%e;THGc+Ib9sM$|-f*!K+mP6wA)>2sEN- zsmj#&Z_>f0=oFD9-?EZM9e?#Z4~^}Ctm$oynTUy)&WbJqn-uN~H3ZuH$sGLEOTOgQl1+PMf3xIo zjH5FLahNL3$;)~#u;GD->Feik9cO}jc$`2*7I9bIuFm>8jJ+%`z58lJyW>SZ-elg( z=^z%ZrhXo>$sV&9ma_ViDWlwuAVyJiCI!)`B%T~p5%h0XMUU{bBqW3JH>wII9}&BB z1Pi@uYK#$Wg6K;&m?MnATsKOz)_puA>y|TQGdf@RI8APov~4OWteX<8E#HVx(>+?& z$n*M3I>wS2(XHgnq@*{-j9xafrA^_lSzyY0PiP)o$CgQObmBxa?|XgQS>+ftX-Vr+ zX;VJYZkQDTvtteqio4Eb6m)|r$z86Op*HK(cMnR?!)RjZJskTAs*8ohjd`7>XY}aH z0vR*{P(@6uNigwPkiI8%YB3=&d-7%QO_G~(*?V3c1`{9plwufvDFBP;e<=L6Cx^hR z86KI_LkO9)87L;q1o!LeXJvHut>!nV8a7@Oh-^0+H?>nuT>wQa=UTh{v_`2EEbA`&zAu`_ijuv!sx*d8&+XG{Xtid(?xXH&LN}= zT>X{o#d`3yOlTY*kW;eJ%Qz5gCa5x&z2dLC>-dEzkXO_HKfeL}zkO}xf5;5}`{RGV zf?_b89x!k=4Nc8ovFXLdikp%KPEO?!=s6RBH_$G8f@K|NCDC`9l~92jnmo)BjS&cN z-h))}+#18fpOOTM+~iKh*3ZAM#*pFnxh)s_W+de4NBrPdjVD!@;GH|oZPfks1P}Fo zC=FAD^_q#cCJzbiw1EDOU)t03Fg)3&s29VnIyZ7K7JQ*+-*r@elfGhi4&VCD&pN%| za`=yZ`d-A3RpZ$G7HSMPCUo{sa-Zu|uwQ{!!n&NZ;z?+&2o~@|CU631VwHXxYfnnQ z7<;60^Q}Tb27f5v5lCAnx)lR1r;;n3nWPWs6RJG>5F)w%K$}-nKLLy*4f& zg3aFwU;X9Y^OIms;PA-YLDb12_ahh-V2J9Yids63f-iG?;d zHs9J<*;X@a5I+O&Ors9rYuuDGB`~Qd?NlhU$;2;K+zhf*A2V6wFY8gnDoIGJY#yx- zfqua9jYfmF7r3v%tQ{G=m0|VECO@9Ry8l=Y7N^tyqLOA(mhPAQ3oeZ>5$N%^eS2^D$l*p0~qOw03$a(HwF7Or}T8|&jo>hOVv6|pCIaP}c<7J-q8qjEBIwfV! zU!Sma>4^8>Nf-S zLU!AB!N!RBd7q}ZPiDtaU!$|5Jr(w_G$^Tm_a|X&7SWBC2l}h=IG&C zL~~AGxhNS;k+*IZ$aE@JHSkYfj(Aueb=r()eWEC>lHo{-Hexf$vt;zfT%no9u2p{r_I8G0pPMe)s8DdSiku z4Qk`*o$of)OxPtZHRuP0GtjI{f$~uH6UF$9HqrL<%%re(BPY=bqo08sZX`vHQ7nJ& zTR3LE!Jhomo;kIkKB#H&#vr7iSf*$@p-ftnGSj*w^4o_c?^2p6i`Hyn!?eYcoEr3| z(q#}jD?7!5{2#L0L6(poqwI>?&v`?Qzf&L(r+l=SE^#pMiH?UZxh#c;fdVJRVvU}2mo-R;y`8#KFNSd_7M+zg89aivqT?QZ z42}w1`-ER2J44pTslSfAj_`jS%=J}4t!+!QzPZU>it_RsQ2Z$~> literal 0 HcmV?d00001 diff --git a/img/blocksizeComparison.png b/img/blocksizeComparison.png new file mode 100644 index 0000000000000000000000000000000000000000..58a8e1b330778bc1e4ca9f571e3a02401142a98d GIT binary patch literal 8994 zcmcI}byQp1x^HTfQk>!%q_k-97A@`$#R6?1cyV`jDZwE{3lw)RQmj~u1cH>{t{Vy3 z;6>lc-uJwF&mV8Rcg}rxjggVD=3HyeZ+`asecy`E(oiJ8qrn4#Km^K4aylT;?Ew(z zmKyGDU_@&>02lbX<)))31F9UNLjyN=t)v@~)&O5}hj{Edr{Ml4LUNToP|9*J3)*EU0fzF7`s0Pl(`AUqUPlY0K zmQedcJh2Vk0P?Q6sPGBRZH1E=DzTxSj+z~w55K;6zeQW#y7Q2Vx2d@*&2K1Eci!~S z*~+i^S}I9FzeqzOqq4-!!{g@|W58XI+h=i~17s5CU>60sum$K@81JJG_#n3oF)s~+ z$N?z7t<{|lx{aS%lWXtcw?Lp$CP3U^4qR#wC>u`(2LvkqH;2PIxEtnbs;gZ(2}A@( z%QpVF&oDJNH>P3cHSe*CX?&WozcXE1%B2{`;Gygj&93p83LmGku!(054XXt2wtJaf z%snEHrF)T!cX5F9oQ6;uLpZe$w}k8>gXfu{ZyZb;ou%nN@hr9-lv>jehJ_t9QwKN* zw3;*XP1y+ASq*qyo?*&eeWD4;3Y$2lyJ3|%2UMyTXMZm>w~fL}XMWRwOKqsR6~kyg50w@$G3(3@Z@##(K#m-e{g_YZwF1} z={Ln#I@`#ZVq?h4aU1u=vu?&@MhRa%ryV*YJw(cWO_!cgsIk^IW+hYHWkpWcBH80; zLR=De*w=V6TfH&1-E-+{_cwo(4rcrVu1?{Q!m)m3Q!b`XHCElGs>N8d(EbUxP1%%| zQk;m*l`%5uE5|;pYuV4*Kf2mN`gp@Z$v$?1m~rM^lCD`2g#mqqv}c{(xiR8Y#(X1W z-S-uTZvgj`(ma<6?tm9a_%i*Wg1ldG-_AJ7touC;2G7E+v`U$KCCP@MHq=#g!Nz)g zOw@EQZ*<|DZDIz(o`1iLjKx#PG(!HN{l^_Hc}-&eFR+!7R8LxJ(*p=QIpr&iytrr% zamv^d9a9^fJ&6~)G@Fahpj9f*b|$U8b5mnyZ!`0;VJ3d%roBWg5#WSLh`2MV~c~@%v5|?}sR?1u_o5F`KFJ4)% z?@2Win2MOPS6)}J0IdH zyIPuE*2KF=bIMgQkv{Aw*XP~`CEIFfiP%KtZK!(KmKhesH9xw_rHp7L-EV>-sAgmE zbR3_k%r(d-Bv<9d8eHsKq1zJUL^*64Or;^<`BpW$_h)x|XQXTLV(AaF~*{Y$ug2-c1zwrOiJT*8WWt>%O4 zx9!j?E?@REkUXr($O$XcnNo=LV_&3A^OTLd_%xpaBL#W>9Q;(S-&Y4xix}e-UT%(Sfu&1o@t_SZM4v>HS#Ci58(lg49t80C3T2(GJeu#Jyb&ZHJw`4J? z->?knB0N)byE@>Kufp2ll0qd7Ei)ae1Tm2?nlLgR@%8YGu=CY1&EUCeyDk*>S~!-N z)jA^1)wg|`(!j%z5=fcsQG=IyP|Wr!UcJ6j8ky9kV6jmvYI*Oe$e(?$Du13 z)^w)9QS77h{dEj7_3Q6ORymIHn_?gKQ{YxN4g1<7A92yys5P;2G&tsRMN9BD4Ykxh zYL&r8Z2w|Q)H^+!LhBX&t-hL}k@$(ST2(#WRTV0dVE)RvW}K%Y6?hn!{-s%yhf7RR zsAWdEOyOJ|)IrFpL3aE4n_Pc0YnkpVxT|a?7k48&nCB_CA%_?g18w$8osjLF_^j>= z3C$Foba^{{8!Sx84l|b%sETogO;Gq4AokKF)*kEa#UD*5?f5(m+^>WELs3Xx0x0s^ zY5aW2P~~+^9c$%~3#ZsOQADhKsE>VbGu(Bw@ECESt={8@I5P8{o5-9DmD@;k7;CN@ zZjJQ$=xpz%K`z;;q8|sQlE%zok;-(poj* z4Jq-+IhxxM<+K#3#Y#BSyf9!ph!NH>QR6(T$mCh6&Z;G^gDi5-!jlj1y3@5HdRRs` z)Ap1|(uZlR#!ZBCK9;kN7gVh8dMX*@_Dyj{BmuGGM-Dijmb(Xv$18q@RyELy)9@Y z_L?*A?; z&;kVVjr#w|P~Oqv;{0 z#%Y5<9k)M3_l(4;CO%6PNQ&L-ZQBGb_G0Ht<|Fx!|7<8K;^Y1c@(%y8{&Y%S%n!9%ls> z6?`NCk6Hm(mb_=R8X+PqjFbb&wTc7*HUAZ?J6bk`x%+CLOAK+av z%EjxYA0~{SC2xjG7Tc~iYn`tD94OeoTK^h!v;VN+W~briC?Tc&x+6+)FX&qJRPrKp z|MS&~#7{}rbDi2L@Yo{YJsk@Hzn00_U0{xZ}YB&`yN!rnOms0sTpcrfCZxL@CS73{lad>)BjJQ7uo<{+%p!Iv%XFq)sgt7aL}N zr(F-7em^8`xOP7e=;IAK4LUs|!O2oDmjJ4XrMw27ptF(f{oiGYry1Vr8CN0<{;1cm z-uUmau%Ch3N0j@iR=A6<-oHws+Wq&`H(*3J9T~7=#a>XbS>xq*#!E%&NRq{iL3^+3 zl}o8h@AI!J-Qlg5^#|pZOQDlS&sATeJhj(T3Bm2#Uw&`|^y`qw8N{_;U*F7M4y$@yUba+`&mj9# z7W_5>+PEU*&uRIEMMSQi#~+lG2G;wXe~S0cR^A>Rd-eIfyl3B3|I(Kg&e+4qN&sb= zb6-@6yKi}HB$4r|wKr8P7fco{c8+wgJ(yy-Vi-S5UfYP%xVgJZOVqY0!f<{Sa9OY@ zA=Nan2zkTnOT+(&Xql7aktJ z_Rv>Vo5Am`QFJWmmN4HvmriC$N^b`Ahh_s6er2z&E|`n zFiC=DmFCV@KX-W`0Zn2a$9sG0S0BINumvmKnND(%hc!0kg7>Oyeu+sDxunR4oZ~Ql zy)P1azaj2E&)!v61J^Au)E$pnTlz&_%dU2s+GFeC@4Spou$#+q^nL5jR)pr_RYH-_ z-1YG2&E=B&**a0?_3XZ87;XE-QOeD3B8+$LcEEw;=AL^qRYn`Sch^d33ygRzs6wdIYQ+Ie4w zSc&@idOo>+#INIhr*=Vx>#c=d$|2d$a6X8a-}Q(Wm{Rwc6ymn+>tCqc(cFYeR}%Q zBGc!1zLyUOv%y|IMCND7@R81p_UDY?9-@MyrTquKQ>%;g<%p#l?F_?;CfISX&KkNYk0`vcL!2hZa zK%W33IcE$@xp_KAB(Iufx{{_NsX_n()ot1KY6X8f4HisMV~ndO2i$@wmqcn2;2m7! zqwpLm5|L+uZA;qwPq}4)jkELqg6xW)uN(-dJD0z30nb(7Qk%-cwronQ0MO;?M6hK@ z@@R*~#Kc(ThV(&QaX`Ue^9U=^TcVq`y;MUs(1oiALFn_hiPG3ThDYJx+IFQ3S6{9C zUn!!9RTnHqb2vS=QlbSEio#+nsFWG{ulJB%@j;^%SfmW0ld+6aZPf`IskcDE0SzJW z<;-t_3~}TS1xLnaYmZ1uz6n#|uRO4(?-o#Tms8i4S@VmQ1c8o_j(gCvl6+gjDFmEv zzq(kAoJf3I>HCl40Cyh2l;YcfF@pC_*r7XlXqKtdJl<1{Eo*_+Xi^1i!t`;+x}+yG zNCqal#_-co+opzV(%XN4gw|C^JIB{It|$xEDCaNoLy)@ZZHb_v z4cXfJzuwMQ=?!0)XUvm>4J#duq8k;M>GJzJXh}8gNMgiHp8I$+)_Iq6B&{bEV4ZKL ziUZ6dYB4#csae|Z^~qVzImdfsd_>5aHnLlt`iJ(*^wGwLx}`07aFXvbt7z18tV|D- z|HU0p$Bg#M|A4Ork-Dp>(*W}k5p5&Luqqh45tN*0NgpWEta$$Sm z&KBD0Ci(yfb7@t#MU-*66WrTLwww_#r?Z0$thhz!R(>DT?C=PDS#Vell3@Vw)?44K zz_Nmza!tuGm{>5&-b*Dt&!up6K!=v4F``p84f%bjE+}q7Z#|u zdg>rJdij781foBZz@AnGYvgRa&}dGBN^CypkQDRU`|OJ4YyVo-TVF3^RMe5w|N2>o zAGtnh^Pj21Pd)F!@I(cZ@@@jrEv zSbH>@l@#ve3Wx{zV1|r>p4L!!(?hk#2?8KcIlYIoy20cv#uP{rNB-xeKuA0rDG2lo zODkMMq|UU=wc@dwSSicKqoWP`MVq}Zg~?MD30PS%knHIIF6TxZ)7Hg{af!k4OJX2V z!KV$l3jJwimCc_~)eVsr5Twxc%u+ecUH0viTfPCV=UgGK3T{GD9{}2MzIu@7)f%{p zsI5)@1GwHE-v<`E!bk}l9Lk3O8-*WG`WXJo`e9-nqmVa&1T>+qDXSoU$d&n?VtsJr zdEA?Oe>)NzEc_wO>&s%Vm?xgaaA2dCURMOvBwRI{T0^8#*-3wmU`rg9zf9_mK(HAU z;PqKaT>k33Cbgl8Lu7vI_@~G`5`^Mg3$Ds5XAa1q z0dHyQ~I1cEgB48p|=UX6~pLvU)JO^fj;nfftu0-Lp zuDgE~(*0QSF?CO-$MqXP-uP2L-vYh)4>E^K-TzeoSGOjB0g-T2-s;(r(Biap_o8)fj<>Q`qgFT`w28@`#aOkW`&7mUstM(>b z>TTts65ewfjM1m$O=!leEs0ZN%s2Wmb9zRd+8+P)&CeKH78>mwMgQt3#CaoIQIf9$ z_tK#v%gV={0o0#cr+4+y?=Y!k?lZ+XBL|z3LZn3y#$tN+7E#$1nS2+KC~I3*8pYa` zosn>k^6L`$UPCuglGVEeAMi(czXXNpa$uaCF8NQ5?8(r0n+Y82Tjx`Ufnyud3P=G( zha`{c0D-D|#{H}LTS>UocJMR($k%o7OjEND@IUMotc!lgrLw!nTCqDip+zRtSsfG3 z3fQ2A+vKlf$U^OSM6%iaR5#)@KON<}Hjt>ZT3Vb)L!$DZrfK_7^N&F)BA~74VisuE zank^?-AjYRd!^-H-s4i+YZSdZiJ~uhI>NfFY63aNr55g|GB}R1%WAAd%Z;G_bW`JA z|G_X)LP{^I^{wgIIg!ZZJ1b2ztMbnTerPf3+=sbad8G{gP(R)Z_iJy1Z8y*q)YK@x#j8>R4sk!0(@9Z<;&}vCY?2j!@v7x61%j_?+pAT?^iVZBfpmEmYkOwiPtL0z$~U5 z-ci-$stuM4k%-abM4?+dkN(~N-57B?!t`9nImr>V zJ5#hWILj`>bZcwWZ7NlO zk2p3WAi%uu{EntYMA`$^bsHqVzV6)E6Au7-Z)QI31}mfrRO{uYsd23=>|}Vb zI3S9JpT3bu%0BW5N{Nd!#kKrxz!&`^Q0<}Aq`=4|&W*xtA zo5oQXAkY{?lct$H9x30+ZIM86)dzkwZRd#n%guCSw`PHv4P0tbXk6o}=C>ey)maCZ z)o|fuq|x9zGIV%i6)y?N#eFQ$|vgNqPmvt^N*c#wfcR<%h2P#a)R-W@;4NH5eYLMCX+7>9*gQxdsW?uw?eO?aE3*@vf}E(L|K(j?0omo~^;8hs z=|olbTz})syYRB#EMjL1;0Cg=-Nd6on(u5gD)kQr)f;j>3e!PM{{!**Y_iF1LxbXR zg0cWDX#IK6bqfzRHMyZ_T-K0DR_0N7>3pD17QQ)Wn|U|)ztHxmidD%L0zd##me-K0 JlzH?1e*y0f9q9l7 literal 0 HcmV?d00001 diff --git a/src/main.cpp b/src/main.cpp index 7305641..c335103 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 10; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int a[SIZE], b[SIZE], c[SIZE]; diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 4c7a86d..bad1976 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -25,7 +25,7 @@ namespace StreamCompaction { int oLength = 1 << layer; // Uncomment the timer here if you want to test the efficiency of scan function - timer().startCpuTimer(); + //timer().startCpuTimer(); for (int d = 0; d < layer; d++) { for (int k = 0; k < oLength; k += (1 << (d + 1))) { @@ -41,7 +41,7 @@ namespace StreamCompaction { odata[nodeIdx + (1 << d)] += temp; } } - timer().endCpuTimer(); + //timer().endCpuTimer(); } /** @@ -72,7 +72,7 @@ namespace StreamCompaction { if (n <= 0) return -1; int num = 0; // TODO - //timer().startCpuTimer(); + timer().startCpuTimer(); for (int i = 0; i < n; i++) { odata[i] = idata[i] ? 1 : 0; } @@ -82,7 +82,7 @@ namespace StreamCompaction { if (idata[i]) odata[odata[i]] = idata[i]; } - //timer().endCpuTimer(); + timer().endCpuTimer(); return num; } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index aa23383..3ee98f8 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -11,7 +11,7 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - int threadPerBlock = 64; + int threadPerBlock = 256; int* dev_Data; int *dev_Map; int *dev_Scatter; @@ -23,7 +23,6 @@ namespace StreamCompaction { int idx = (blockIdx.x * blockDim.x) + threadIdx.x; if (idx >= nodeNum) return; idata[(idx + 1)*(1 << (d + 1)) - 1] += idata[idx*(1 << (d + 1)) + (1 << d) - 1]; - //idata[(idx + 1) * (1 << (d + 1)) - 1] += idata[(idx + 1) * (1 << (d + 1)) - 1 - (1 << d)]; } __global__ void KernDownSweep(int d, int *idata, int nodeNum) @@ -61,13 +60,11 @@ namespace StreamCompaction { int blocknum = nodeNum / threadPerBlock + 1; KernDownSweep << > >(d, dev_Data, nodeNum); } + timer().endGpuTimer(); cudaMemcpy(odata, dev_Data, sizeof(int) * n, cudaMemcpyDeviceToHost); checkCUDAError("cudaMemcpy to host failed!"); - // for (int j = 0; j < n; j++) - // printf("%d ", odata[j]); - //printf("\n"); cudaFree(dev_Data); - timer().endGpuTimer(); + } /** @@ -91,9 +88,10 @@ namespace StreamCompaction { cudaMemcpy(dev_Data, idata, oLength * sizeof(int), cudaMemcpyHostToDevice); checkCUDAError("cudaMemcpy to device failed!"); - timer().startGpuTimer(); + // TODO int blocknum = oLength / threadPerBlock + 1; + timer().startGpuTimer(); Common::kernMapToBoolean << > >(oLength, dev_Map, dev_Data); // Here I reimplement the scan part, because in the main function, scan and compaction are timed seperately, @@ -119,11 +117,10 @@ namespace StreamCompaction { blocknum = n / threadPerBlock + 1; Common::kernScatter << < blocknum, threadPerBlock >> > (n, dev_oData, dev_Data, dev_Map, dev_Scatter); - cudaMemcpy(odata, dev_oData, sizeof(int) * n, cudaMemcpyDeviceToHost); - checkCUDAError("cudaMemcpy to host failed!"); - timer().endGpuTimer(); + cudaMemcpy(odata, dev_oData, sizeof(int) * n, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy to host failed!"); int count, end; cudaMemcpy(&count, dev_Scatter + n - 1, sizeof(int), cudaMemcpyDeviceToHost); cudaMemcpy(&end, dev_Map + n - 1, sizeof(int), cudaMemcpyDeviceToHost); diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 169230a..8b21cb9 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -11,7 +11,7 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - int threadPerBlock = 512; + int threadPerBlock = 256; int *dev_0, *dev_1; // TODO: __global__ void NaiveScan(int d, int *idata, int *odata, int oLength) { @@ -20,24 +20,6 @@ namespace StreamCompaction { int flag = 1 << d; odata[idx] = idx >= flag ? idata[idx] + idata[idx - flag] : idata[idx]; } - //int threadPerBlock = 1024; - //int BlockNum; - - //int *dev_Data[2]; - - //__global__ void CudaScan(int d, int *in, int *out, int n) - //{ - // int thid = (blockIdx.x * blockDim.x) + threadIdx.x; - // if (thid >= n) - // return; - // int m = 1 << (d - 1); - - // if (thid >= m) - // out[thid] = in[thid] + in[thid - m]; - // else - // out[thid] = in[thid]; - - //} /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. @@ -52,31 +34,13 @@ namespace StreamCompaction { checkCUDAError("cudaMemcpy to device failed!"); int blocknum = oLength / threadPerBlock + 1; - - - /*int nCeilLog = ilog2ceil(n); - int nLength = 1 << nCeilLog; - - cudaMalloc((void**)&dev_Data[0], nLength * sizeof(int)); - cudaMalloc((void**)&dev_Data[1], nLength * sizeof(int)); - checkCUDAError("cudaMalloc failed!"); - - cudaMemcpy(dev_Data[0], idata, sizeof(int) * nLength, cudaMemcpyHostToDevice); - checkCUDAError("cudaMemcpy to device failed!"); - int nOutputIndex = 0;*/ timer().startGpuTimer(); for (int d = 0; d < layer; d++) { NaiveScan << > >(d, dev_0, dev_1, oLength); std::swap(dev_0, dev_1); } - /*for (int i = 1; i <= nCeilLog; i++) - { - nOutputIndex ^= 1; - BlockNum = nLength / threadPerBlock + 1; - CudaScan << > >(i, dev_Data[nOutputIndex ^ 1], dev_Data[nOutputIndex], nLength); - }*/ - timer().endGpuTimer(); + odata[0] = 0; cudaMemcpy(odata + 1, dev_0, (n - 1)*sizeof(int), cudaMemcpyDeviceToHost); checkCUDAError("cudaMemcpy to host failed!"); @@ -85,14 +49,6 @@ namespace StreamCompaction { cudaFree(dev_1); - /*odata[0] = 0; - cudaMemcpy(odata + 1, dev_Data[nOutputIndex], sizeof(int) * (n - 1), cudaMemcpyDeviceToHost); - checkCUDAError("cudaMemcpy to host failed!"); - - cudaFree(dev_Data[0]); - cudaFree(dev_Data[1]);*/ - - } } } From c391a2a2a65dafc1454b9213ea6535e432229b8a Mon Sep 17 00:00:00 2001 From: guoyi1 Date: Tue, 19 Sep 2017 23:38:12 -0400 Subject: [PATCH 3/8] Update --- README.md | 114 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 59 insertions(+), 55 deletions(-) diff --git a/README.md b/README.md index 8da03f5..6ae9274 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ CUDA Stream Compaction * (TODO) Yi Guo * Tested on: Windows 8.1, Intel(R) Core(TM)i5-4200M CPU @ 2.50GHz 8GB, NVIDIA GeForce 840M (Personal Notebook) -##Description. +## Description. In this project, I implemented the parallel computing algorithm of streaming compaction. For more details, see `INSTRUCTION.md`. ## ScreenShot @@ -15,7 +15,7 @@ These are the test results of all the method I implemented. ![](./img/result2.png); -##Performance Analysis +## Performance Analysis * *Block Size* I compare the time cost of scan function under different block size value. The result is shown as the graph below. @@ -52,59 +52,63 @@ As the plot above shows, `thrust::scan` is more efficient than the scan methods * *Test Result* - **************** - ** SCAN TESTS ** - **************** - [ 34 28 17 4 6 42 43 24 15 44 27 19 13 ... 43 0 ] - ==== cpu scan, power-of-two ==== - elapsed time: 0ms (std::chrono Measured) - [ 0 34 62 79 83 89 131 174 198 213 257 284 303 ... 24338 24381 ] - ==== cpu scan, non-power-of-two ==== - elapsed time: 0ms (std::chrono Measured) - [ 0 34 62 79 83 89 131 174 198 213 257 284 303 ... 24197 24245 ] - passed - ==== naive scan, power-of-two ==== - elapsed time: 0.057184ms (CUDA Measured) - passed - ==== naive scan, non-power-of-two ==== - elapsed time: 0.057216ms (CUDA Measured) - passed - ==== work-efficient scan, power-of-two ==== - elapsed time: 0.157728ms (CUDA Measured) - passed - ==== work-efficient scan, non-power-of-two ==== - elapsed time: 0.153376ms (CUDA Measured) - passed - ==== thrust scan, power-of-two ==== - elapsed time: 0.156192ms (CUDA Measured) - passed - ==== thrust scan, non-power-of-two ==== - elapsed time: 0.023776ms (CUDA Measured) - passed - - ***************************** - ** STREAM COMPACTION TESTS ** - ***************************** - [ 2 0 1 2 2 0 1 2 1 0 3 1 1 ... 3 0 ] - ==== cpu compact without scan, power-of-two ==== - elapsed time: 0.003695ms (std::chrono Measured) - [ 2 1 2 2 1 2 1 3 1 1 1 2 1 ... 1 3 ] - passed - ==== cpu compact without scan, non-power-of-two ==== - elapsed time: 0.004105ms (std::chrono Measured) - [ 2 1 2 2 1 2 1 3 1 1 1 2 1 ... 2 2 ] - passed - ==== cpu compact with scan ==== - elapsed time: 0.009853ms (std::chrono Measured) - [ 2 1 2 2 1 2 1 3 1 1 1 2 1 ... 1 3 ] - passed - ==== work-efficient compact, power-of-two ==== - elapsed time: 0.212384ms (CUDA Measured) - passed - ==== work-efficient compact, non-power-of-two ==== - elapsed time: 0.219104ms (CUDA Measured) - passed - +``` +**************** +** SCAN TESTS ** +**************** + [ 34 28 17 4 6 42 43 24 15 44 27 19 13 ... 43 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0ms (std::chrono Measured) + [ 0 34 62 79 83 89 131 174 198 213 257 284 303 ... 24338 24381 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 0ms (std::chrono Measured) + [ 0 34 62 79 83 89 131 174 198 213 257 284 303 ... 24197 24245 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 0.057184ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.057216ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 0.157728ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.153376ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.156192ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.023776ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 0 1 2 2 0 1 2 1 0 3 1 1 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 0.003695ms (std::chrono Measured) + [ 2 1 2 2 1 2 1 3 1 1 1 2 1 ... 1 3 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.004105ms (std::chrono Measured) + [ 2 1 2 2 1 2 1 3 1 1 1 2 1 ... 2 2 ] + passed +==== cpu compact with scan ==== + elapsed time: 0.009853ms (std::chrono Measured) + [ 2 1 2 2 1 2 1 3 1 1 1 2 1 ... 1 3 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 0.212384ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 0.219104ms (CUDA Measured) + passed +``` +## Extra Credit +* *Efficient scan optimization* +Compared to the basic algorithm, I optimize the kernUpsweep From 6ca64cb3a71994a9ea8a7eb909e5d609b4820a60 Mon Sep 17 00:00:00 2001 From: guoyi1 Date: Tue, 19 Sep 2017 23:43:22 -0400 Subject: [PATCH 4/8] updated --- README.md | 38 +++++++++++++++++++++++++++++++++++++- 1 file changed, 37 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 6ae9274..84269f2 100644 --- a/README.md +++ b/README.md @@ -108,7 +108,43 @@ As the plot above shows, `thrust::scan` is more efficient than the scan methods ``` ## Extra Credit * *Efficient scan optimization* -Compared to the basic algorithm, I optimize the kernUpsweep +Compared to the basic algorithm, I optimize the `kernUpsweep` and `kernDownsweep` function by reducing the branches in it. +Instead of judging whether the current index is the power of 2, I computer the index we need to deal with directly. + + __global__ void KernUpSweep(int d, int *idata, int nodeNum) + { + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (idx >= nodeNum) return; + idata[(idx + 1)*(1 << (d + 1)) - 1] += idata[idx*(1 << (d + 1)) + (1 << d) - 1]; + } + + __global__ void KernDownSweep(int d, int *idata, int nodeNum) + { + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (idx >= nodeNum) return; + int nodeIdx = idx*(1 << (d + 1)) + (1 << d) - 1; + int temp = idata[nodeIdx]; + idata[nodeIdx] = idata[nodeIdx + (1 << d)]; + idata[nodeIdx + (1 << d)] += temp; + } + +Call kernal function: + for (int d = 0; d < layer; d++) + { + int nodeNum = 1 << (layer - 1 - d); + int blocknum = nodeNum / threadPerBlock + 1; + KernUpSweep << > >(d, dev_Data, nodeNum); + } + cudaMemset(dev_Data + oLength - 1, 0, sizeof(int)); + checkCUDAError("cudaMemset failed!"); + for (int d = layer - 1; d >= 0; d--) + { + int nodeNum = 1 << (layer - 1 - d); + int blocknum = nodeNum / threadPerBlock + 1; + KernDownSweep << > >(d, dev_Data, nodeNum); + } + + From ae6677de5208c24208a28df51c3ad5da236ff8e5 Mon Sep 17 00:00:00 2001 From: guoyi1 Date: Tue, 19 Sep 2017 23:44:49 -0400 Subject: [PATCH 5/8] updated --- README.md | 57 +++++++++++++++++++++++++++++-------------------------- 1 file changed, 30 insertions(+), 27 deletions(-) diff --git a/README.md b/README.md index 84269f2..edfca1b 100644 --- a/README.md +++ b/README.md @@ -110,39 +110,42 @@ As the plot above shows, `thrust::scan` is more efficient than the scan methods * *Efficient scan optimization* Compared to the basic algorithm, I optimize the `kernUpsweep` and `kernDownsweep` function by reducing the branches in it. Instead of judging whether the current index is the power of 2, I computer the index we need to deal with directly. - - __global__ void KernUpSweep(int d, int *idata, int nodeNum) - { - int idx = (blockIdx.x * blockDim.x) + threadIdx.x; - if (idx >= nodeNum) return; - idata[(idx + 1)*(1 << (d + 1)) - 1] += idata[idx*(1 << (d + 1)) + (1 << d) - 1]; - } - - __global__ void KernDownSweep(int d, int *idata, int nodeNum) +``` +__global__ void KernUpSweep(int d, int *idata, int nodeNum) { int idx = (blockIdx.x * blockDim.x) + threadIdx.x; if (idx >= nodeNum) return; - int nodeIdx = idx*(1 << (d + 1)) + (1 << d) - 1; - int temp = idata[nodeIdx]; - idata[nodeIdx] = idata[nodeIdx + (1 << d)]; - idata[nodeIdx + (1 << d)] += temp; + idata[(idx + 1)*(1 << (d + 1)) - 1] += idata[idx*(1 << (d + 1)) + (1 << d) - 1]; } + +__global__ void KernDownSweep(int d, int *idata, int nodeNum) +{ + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (idx >= nodeNum) return; + int nodeIdx = idx*(1 << (d + 1)) + (1 << d) - 1; + int temp = idata[nodeIdx]; + idata[nodeIdx] = idata[nodeIdx + (1 << d)]; + idata[nodeIdx + (1 << d)] += temp; +} +``` Call kernal function: - for (int d = 0; d < layer; d++) - { - int nodeNum = 1 << (layer - 1 - d); - int blocknum = nodeNum / threadPerBlock + 1; - KernUpSweep << > >(d, dev_Data, nodeNum); - } - cudaMemset(dev_Data + oLength - 1, 0, sizeof(int)); - checkCUDAError("cudaMemset failed!"); - for (int d = layer - 1; d >= 0; d--) - { - int nodeNum = 1 << (layer - 1 - d); - int blocknum = nodeNum / threadPerBlock + 1; - KernDownSweep << > >(d, dev_Data, nodeNum); - } +``` +for (int d = 0; d < layer; d++) +{ + int nodeNum = 1 << (layer - 1 - d); + int blocknum = nodeNum / threadPerBlock + 1; + KernUpSweep << > >(d, dev_Data, nodeNum); +} +cudaMemset(dev_Data + oLength - 1, 0, sizeof(int)); +checkCUDAError("cudaMemset failed!"); +for (int d = layer - 1; d >= 0; d--) +{ + int nodeNum = 1 << (layer - 1 - d); + int blocknum = nodeNum / threadPerBlock + 1; + KernDownSweep << > >(d, dev_Data, nodeNum); +} +``` From 2ee4713e62d6a61033740c5eaa9c7b5dca194f6c Mon Sep 17 00:00:00 2001 From: guoyi1 Date: Tue, 19 Sep 2017 23:46:34 -0400 Subject: [PATCH 6/8] updated --- README.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index edfca1b..f8a6b93 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,9 @@ These are the test results of all the method I implemented. ![](./img/result2.png); ## Performance Analysis + * *Block Size* + I compare the time cost of scan function under different block size value. The result is shown as the graph below. ![](./img/blocksizeComparison.png); @@ -40,6 +42,7 @@ instead of: } * *Efficiency of different scan method* + I compare the efficiency of different scan method and make a plot below. ![](./img/ScanComparison.png); @@ -48,6 +51,7 @@ As the plot shows, when the size of array is not very huge, `cpu scan` will be But there is an another thing I don't quite understand. That is the `naive scan` takes the most time when the size of array is very huge. I think `naive scan` should be more efficient than cpu scan, but I don't know what's going on here. * *Thrust scan* + As the plot above shows, `thrust::scan` is more efficient than the scan methods we implemented on GPU. I think there may be 2 reasons. One is that `thrust::scan` function may use the share memory to store the data and access the data from share memory instead of from global memory. In this way, it will produce less latency because it visits the global memory less times. The other is that `thrust::scan` may make some optimizations on the binary search algorithm. The best proof of this is that the time cost of `thrust::scan` will be much less when the size of the array is not the power of 2, which means when the size value is the power of 2, it is probably the worst case for its algorithm. * *Test Result* @@ -106,8 +110,10 @@ As the plot above shows, `thrust::scan` is more efficient than the scan methods elapsed time: 0.219104ms (CUDA Measured) passed ``` -## Extra Credit +## Extra Credit + * *Efficient scan optimization* + Compared to the basic algorithm, I optimize the `kernUpsweep` and `kernDownsweep` function by reducing the branches in it. Instead of judging whether the current index is the power of 2, I computer the index we need to deal with directly. ``` From 868da7040edcb3a75539b5e409596d18d7b94e6a Mon Sep 17 00:00:00 2001 From: guoyi1 Date: Tue, 19 Sep 2017 23:47:45 -0400 Subject: [PATCH 7/8] updated --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index f8a6b93..7625bad 100644 --- a/README.md +++ b/README.md @@ -114,8 +114,8 @@ As the plot above shows, `thrust::scan` is more efficient than the scan methods * *Efficient scan optimization* -Compared to the basic algorithm, I optimize the `kernUpsweep` and `kernDownsweep` function by reducing the branches in it. -Instead of judging whether the current index is the power of 2, I computer the index we need to deal with directly. +Compared to the basic algorithm, I optimize the `kernUpsweep` and `kernDownsweep` kernal function by reducing the branches in it. Instead of judging whether the current index is the power of 2, I computer the index we need to deal with directly. + ``` __global__ void KernUpSweep(int d, int *idata, int nodeNum) { From b6034a6589e9c93cd9630960540346ee7ada498d Mon Sep 17 00:00:00 2001 From: guoyi1 Date: Tue, 19 Sep 2017 23:49:04 -0400 Subject: [PATCH 8/8] updated --- README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 7625bad..0084aca 100644 --- a/README.md +++ b/README.md @@ -17,7 +17,7 @@ These are the test results of all the method I implemented. ## Performance Analysis -* *Block Size* +* **Block Size** I compare the time cost of scan function under different block size value. The result is shown as the graph below. @@ -41,7 +41,7 @@ instead of: KernUpSweep << > >(d, dev_Scatter, nodeNum); } -* *Efficiency of different scan method* +* **Efficiency of different scan method** I compare the efficiency of different scan method and make a plot below. ![](./img/ScanComparison.png); @@ -50,11 +50,11 @@ As the plot shows, when the size of array is not very huge, `cpu scan` will be But there is an another thing I don't quite understand. That is the `naive scan` takes the most time when the size of array is very huge. I think `naive scan` should be more efficient than cpu scan, but I don't know what's going on here. -* *Thrust scan* +* **Thrust scan** As the plot above shows, `thrust::scan` is more efficient than the scan methods we implemented on GPU. I think there may be 2 reasons. One is that `thrust::scan` function may use the share memory to store the data and access the data from share memory instead of from global memory. In this way, it will produce less latency because it visits the global memory less times. The other is that `thrust::scan` may make some optimizations on the binary search algorithm. The best proof of this is that the time cost of `thrust::scan` will be much less when the size of the array is not the power of 2, which means when the size value is the power of 2, it is probably the worst case for its algorithm. -* *Test Result* +* **Test Result** ``` **************** @@ -112,7 +112,7 @@ As the plot above shows, `thrust::scan` is more efficient than the scan methods ``` ## Extra Credit -* *Efficient scan optimization* +* **Efficient scan optimization** Compared to the basic algorithm, I optimize the `kernUpsweep` and `kernDownsweep` kernal function by reducing the branches in it. Instead of judging whether the current index is the power of 2, I computer the index we need to deal with directly.