NVIDIA GPU Acceleration Practices with Juniper and WEKA
NVIDIA GPU Acceleration Practices with Juniper and WEKA
Learn how to design, build, deploy, and operate your AI cluster environment. Watch this expert panel discussion on accelerating NVIDIA GPUs with Mansour Karam (GVP of Products, Juniper Networks), Anthony Lembo (VP of Global Sales Engineering, WEKA), and Jay Wilson (Architect, Juniper Networks).
You’ll learn
Why networking and storage are just as important to AI clusters as compute resources
The importance of using the right software to manage your cluster
How Juniper and WEKA are delivering the future of AI cluster design and management
Who is this for?
Host
Guest speakers
Experience More
Transcript
0:01 [Music]
0:08 I'm Matt free I'm with juniper on the head of sales for the sales specials team covering Ai and data center uh for
0:15 the America so uh my first Nvidia show but not the last and uh just really grateful for a chance to speak with you
0:23 and have the panel here and also to be partnered with wo so closest to me is
0:28 Anthony limbo so Anthony he's got um over 15 years of deep technical
0:33 experience around storage he's also a global leader for WCA and he's known for
0:38 really leading you know high quality High performing engineering team so P you're welcome thank you uh in the
0:45 middle here is uh Mansour kamam Mansour is the um GL Global vice president for
0:50 Data Center and AI for Juniper he was the founder co-founder of abstra back in
0:55 2014 which got purchased by Juniper in 2021 and uh he's a you know just
1:01 integral part of abstra data center and our partnership with Rea so thank
1:10 you and last but not least is Jay Wilson Jay is a senior architect for junifer
1:15 he's got a deep experience with you know collaborating with different types of architectures including storage and
1:20 networking uh it also leads a team for Architects and J person so thank you
1:27 jck all right well I guess I'll start with uh with you mon real quick so why are we here right Juniper at Nvidia
1:35 event and why are we sitting next to to W right now you could uh just kind of talk about where we are in the data
1:41 center and and and why J yeah no uh well first of all thank you everyone for for
1:48 for being here and for the opportunity uh it's been a pleasure um it's actually
1:54 when you think of AI uh data center is
1:59 kind of the the foundation for everything that's going on with AI uh
2:05 you know when when you're in a gold rush you should be selling kind of the the what do you call them the the pitch the
2:11 tools and like we are we're the the tools that uh that that everyone needs
2:16 as they're building these AI infrastructures definitely you need the gpus but you also need the storage and
2:23 you need the networking and it all needs to work harmoniously together if you
2:29 want to have ability to deliver on your AI applications and so yes absolutely AI
2:36 is a big uh Focus for us it's a big Market opportunity uh just the AI
2:42 workloads for Data Center and for data center networking uh are growing at a rate of around 50% year to year uh
2:51 billions of dollars you know out of a data center networking Market of around 22 billion growing to 32 um kind of
2:58 fueled by Ai workloads and so and so essentially we at Juniper as part of the
3:06 data center team we're very focused on the AI opportunity and on delivering the
3:12 right products and the right solutions for uh for AI and of course we're
3:17 partnering with uh weka here uh from a storage standpoint uh storage is also
3:23 another critical component and you know the way we want to make our customers
3:28 successful is is by delivering these kind of what we call validated Solutions
3:33 end to end designs that bring in all the right components uh but ultimately uh
3:40 deliver a turnkey uh experience for our customers so they don't have to do all
3:46 of the work uh themselves and so as part of these validated designs working with
3:51 our partner of EOS our ecosystem Partners in storage and on the compute
3:56 side with gpus uh is critical to our success and certainly I you know maybe
4:02 you can you can present your U your perspective as well from a storage point of view yeah absolutely so one thing
4:10 I'll mention I like the the Pix and shovels analogy and I feel like there's some commonality a lot of commonality in
4:16 networking and storage especially in the context of what's going on here we've seen a THX plus Improvement in compute
4:23 performance as it relates to to Ai and these workflows but networking and storage you don't hear a lot about and
4:29 internally what we talk a lot about is the infrastructure triangle there's a foundation that you run all of these
4:35 Upstream applications and Solutions on and that Foundation is a computes part of it but networking and storage are
4:41 also big components of that so balancing them out and making sure there's no weak point is is really important because if
4:48 you don't think about those things and you start to build an upstream stack and we've seen this and we we can share some
4:54 perspective on it that Foundation is that Foundation is um is week and the
5:01 analogy would be like you wouldn't build a house on a on a foundation that you
5:06 know is at risk of crumbling so you want to build a strong Foundation because you can't go back and retro retrofit these
5:13 easily yeah I mean it's all about balancing all of those uh resources and
5:19 the components it's kind of uh if uh if you're networking uh is is is the
5:25 bottleneck then your gpus are left stranded if you don't have enough storage
5:31 capacity you may have way too many gpus that you can take advantage of uh and so
5:38 you especially when we talk about the cost of these components um you know it's it's really
5:44 sad if a GPU that's cost $40,000 is left you know kind of unutilized and so and
5:50 and not to mention that job completion time like the time it takes for actually
5:55 training your model uh can get delayed by weeks or months if your if if your
6:04 your setup is not uh optimized and you don't have the proper balance between
6:09 all of these different components so this is really critical that's great so Jay if I can
6:16 ask you as well um from an architectural standpoint what mistakes have you seen
6:21 customers making when they're trying to roll out you know an AI cluster and how do we solve those
6:28 together didn't here the question was what mistakes are customers making uh I can start with what mistakes we've made
6:35 I think that's a better even a better starting point because that's that's freshing my mind
6:41 um one of our biggest challenges and I don't know if this is will be common across the user Community or not is
6:49 there tends to be at least inside of juniper multiple teams that want to use the infrastructure it's not just one
6:56 group of people who say I want to go and do a training run and get my data set
7:03 built for my inference it's literally five different teams inside of our company all Ving for the central
7:10 resource what's interesting is everybody wants a little bit different when it
7:16 comes to how they want it to to be programmed and they may actually be
7:21 running a completely different model instead of running Burke somebody else might be running something like a llama
7:28 2 right uh you might be running a GP T3 or gp4 so depending on the workload the
7:35 environment needs to change and Mark and we've been Our Own Worst Enemy up until
7:41 probably about 3 or four weeks ago where our Engineers were literally just flipping and changing and morphing
7:48 configs and people were crossing each other up uh we've had this fantastic tool all along called abstra uh which we
7:55 purchased which you know Monsour was one of the founders of and
8:01 we've come to this point now where we can eliminate at least that being a contention because each team now has
8:09 like this Baseline config and it's super simple from an abster standpoint for
8:14 them to roll back to say you know what this is this is the setup I need to use I want this config and what makes it
8:22 super simple is it doesn't look at it from a box from box to box to box perspective it literally looks at it
8:28 from the entire infrastructure perspective when it comes to the networking side it looks at the
8:34 backend Network it looks at the front end Network it looks at the management Network and it says these are the settings that all need to be applied now
8:42 not box at a time so that's helped us tremendously come over that hump and I I
8:48 can only imagine when you get into large customer bases uh particularly in Enterprise which is my focus uh
8:55 Enterprise customers saying you know what this cluster needs to be used in this manner we might want to to run this different model we've got these new set
9:01 of parameters that we want to try and being able to have this config that you can easily apply fabric wide or to
9:08 multiple Fabrics is is fantastic so I would say that's the number one hurdle is when you're thinking about deploying
9:14 as a customer start from the very beginning don't start later start at the very beginning thinking about what are
9:20 the possible use cases find a tool uh hopefully you would pick something like abstra but you know I'm not a
9:25 salesperson I'm an architect find a tool that will allow you to easily morph your
9:31 environment as your needs change because if you don't have the tool you're going to spend a lot of wasted Cycles with
9:37 those GPU setting idle and that's not something your company really wants
9:44 yeah maybe one thing I'll add is that uh uh I think what this highlights is the
9:49 importance of using the right software to manage and
9:54 optimize uh your the all the various components in in your cluster and make
10:00 them look like one system and certainly abstra is a tool I spent 10 years uh
10:05 actually of my life focused on so uh we're we're very pleased with uh the the
10:11 the traction abstra had but also just generally the difference it makes for our customers and certainly for AI uh
10:18 it's become like almost something that uh is necessary because exactly for the
10:25 reasons J uh Jay mentioned uh every setup has a different set of of parameters has a different criteria in
10:32 terms of uh what makes it perform well and what you need is an a software
10:39 solution that ultimately has the ability to optimize a cluster in real time and
10:45 then verify that it's set up the right way by extracting the right Telemetry and running tests in real time and so
10:53 that capability that that software automation capability becomes that much
10:59 more important when you need all of the parameters to be balanced across all the various components in real time on a
11:05 continuous basis and the cost of not doing that again is uh is very high both
11:12 in terms of how long it takes you to uh to train your models and how much of
11:17 infrastructure is sitting there stranded and so like yeah I agree with that yeah I can give I can give you a a
11:24 really concrete example of that inside of our development lab uh down here uh
11:30 Sunny we uh have found that the tuning
11:36 parameters when it comes to the quality of service the class of service need to be slightly
11:42 different on whether you're running a Burt test or you're running a DL dlrm
11:47 test and being able to quickly know that you're going to be running that dlrm
11:53 test oh and by the way here's this Baseline config already built and it already knows what those two new
11:58 parameters should look like when we're running that job load makes it super simple to to not waste Cycles going
12:05 around reprogramming everything and trying to figure out what things what how things need to be tuned so it's it
12:11 is really important to have those tools in place and one just like just to for
12:17 clarification the setup that Jay is mentioning is we have a large uh GPU AI
12:24 cluster uh wi is part of it in our uh in our PC
12:29 back at headquarter and uh this it's integral part of us becoming experts at
12:36 Ai and and helping our customers uh build these AI clusters it's how we
12:42 deliver these validated designs that are fully tested we know how to guide our
12:47 customers we're not just selling a networking component But ultimately it's about delivering the solution that we
12:54 know will work for our customers they don't have to go build it or buy it
12:59 proprietary from one vendor we're bringing in all the best of breed components including of course our uh
13:06 Juniper uh switches and the management software Etc uh storage from our
13:11 partners U but then when the customer uses it and deploys it and operates it
13:17 they know it's going to run optimally and so that's kind of like part of the Investments that we have done in order
13:24 to have the ability to deliver that for our customers just wanted to clarify yeah yeah that makes sense I'll give
13:30 maybe a storage specific lens or a wacka specific lens of some of what we've seen
13:35 and I'll I'll try to give it a networking slant as well you know gets our perspective the question was around
13:41 what mistakes are made I that's maybe a heavy word because I think all of us are exploring in real time but what we've
13:46 seen is uh the things that come to mind around growth and scale if you think about how these INF how these
13:52 architectures look like from a networking perspective if you account for what growth might look like they
13:59 there is a way that if you don't design things accounting for how you expect the storage cluster itself to grow how many
14:05 clients are going to expand uh there's pretty massive implications to what might you might need to do at the networking layer and we get maybe kind
14:12 of Trapped into that sometimes where from a storage standpoint we're very focused on storage we're focused on
14:17 scaling on the storage side but if you're not planning for what happens on the network side if you quadruple or
14:24 more the size of the backend storage cluster the same thing with the number of clients
14:29 there's massive implications to what kind of topology do I use and then monitoring the health of this is also
14:35 something else if you add components at what point do you achieve hit congestion and these environments also have levels
14:42 of congestion that I think are most especially from the Enterprise bace are
14:47 just not used to seeing yeah yeah especially when you talk about storage you know packet loss for example
14:54 in the network can be disastrous right so uh making sure that the network is
14:59 set up in a way where uh it has the capabilities the features to make sure
15:04 that whatever happens no packet is lost and it's done in the context of this uh
15:11 uh balanced uh balanced setup right between the right amount of networking
15:16 for the right amount of storage and the right amount of compute so you know there are some things that if they
15:22 happen like packet loss uh could really have devastating consequences when it
15:27 comes to the performance of the cluster and that it's really important that whether it's the software or the whole
15:33 setup they're avoided at all costs yeah specifically our setup on the
15:38 WCA side we're using your PX client yeah and that client we've had tremendous
15:44 success with that the PX uh client on the storage uh we are getting almost the
15:50 theoretical rate that supposedly according to Lea and amll that you should be able to get out of that
15:56 storage and we have like EXP exper zero pack of loss now am I saying
16:01 everything's a Panacea and everything's beautiful every once in a while things will glitch but it's usually when a team
16:07 is switched over one of the teams has taken over and they haven't reset something to the Baseline like they
16:12 meant to and they'll say all of a sudden we're getting half bandwidth or something it's like okay did you make
16:17 sure that you really flipped everything over using the golden base config for
16:23 you and it's like oh yeah we we Ed we grab the wrong basic confit flip that
16:28 over and we're back to we've had very little issue and all with with
16:34 interfacing with the w storage and the POS yeah there's there's another and again I'll give kind of the lens on just
16:41 architecturally something that I think about whenever whenever we're building a design is how much agility do you have
16:48 in that design and just specifically around an AI ml workflow and you know just previously deep learning actually
16:54 taxes environments significantly but in gen we're seeing and when I say agility
17:00 the storage lend lens is can I adapt to multiple different kinds of IO patterns
17:06 of data access patterns and what happens if I flip from one to the next there's implications on the networking side but
17:13 the implications from a storage perspective are substantial if I have a if I'm designing for large streaming
17:19 reads and then I have a workflow that a researcher decides to use which is becoming common that has a completely
17:25 different kind of IO pattern it's heavy small random or small random rights or metadata or all of the above how can you
17:32 absorb that what infrastructure changes do you need to make and a lot of times if you're not thinking about those
17:37 you'll get pinched into a corner where the way out is not a clear-cut thing it's not something that can be patched
17:46 easily well on on top of that I mean this whole Market is moving fast right so with new cards coming out gpus
17:54 Fabrics I mean what what do you see in the next 12 months how that's going to affect you know WCA uh and Juniper um
18:01 and especially around the new announcement around the W pod how's that going to help with customers yeah uh
18:06 I'll take a cut at that first but there's it's interesting I can answer that in a few different ways but so new
18:13 cards new technology there's one thing that's for sure we're going to see a lot of new hardware technology coming out
18:20 there's new technology coming out from an infrastructure perspective and something that's just Salient based on
18:26 this this conference and what I've seen but Al over the last year for us as we've seen AI based workloads with
18:32 either geni companies or GPU clouds GP as a service companies uh is just is
18:38 basically Cooling and thermal so new heart then this is we have a we have
18:44 situations where sometimes our our customers get access to Hardware before we do and then we will basically do a
18:51 live qualification which is a great thing to be able to do so as these new cards are coming out what we see is very
18:57 quickly there's this pivot to I'm not buying the old stuff period I'm only buying the new stuff and we're in a
19:05 position where we need we're going to qualifi live in place and there's some interesting things that show up like one
19:10 of them is uh thermals on some of these car like you you get all high bandwidth
19:16 networking cards all the latest generation of everything and there's some unexpected things from a data center infrastructure uh sit position
19:24 that come up uh so there's that that uh that kind of hovers over it but but it goes back to agility the question is
19:31 let's say a new nvme drive or a new networking technology is released and
19:37 comes out from a hardware perspective how quickly can I take advantage of that and we've seen this EXT what I'd call
19:44 Extreme uh competition to get the latest and push for the latest faster than
19:50 everyone else right it's a new networking technology comes out how fast can we get it new gpus how fast can we
19:56 get it new switches how fast can we get them so how quickly can you integrate those into an environment is something
20:02 that's important to consider so all these new technologies coming out it's how do we absor we absorb these and then
20:07 leverage them in in real time yeah yeah uh completely agree I mean the the field
20:12 is moving uh really really quickly and uh we are having Innovations every day
20:17 and new hardware new gpus um of course Nvidia announced their most uh recent
20:24 GPU uh but we're also seeing GPU announcements from other uh vendors I think that uh you know what's going to
20:30 be really important is to have for for when you're deploying AI clusters what's
20:36 going to become increasingly important is the ability to leverage the latest and the greatest as part of an open uh
20:44 ecosystem and so as much as possible one needs to avoid kind of locking
20:49 themselves into just one vendor across the entire stack um and you know from a
20:55 networking standpoint you know we have a lot of experience with this uh back to the even HBC days where you
21:02 know you have technologies that are specialized and that are more proprietary like infiniband but at the
21:07 end ethernet is the technology that ultimately wins and so we're very focused on making sure ethernet which is
21:14 you know where the entire ecosystem is where all the investment is in the industry where you have lots of choice
21:20 from different vendors gets to the level of performance and exceeds the left the the level of performance of proprietary
21:27 uh Solutions so this is we've uh announced our 800 gig uh switch which is
21:34 the first 800 gig uh on the market uh on par or beating um in infiniband this is
21:41 where why we have all this uh AI specific features around you know think of the network as a freeway so you have
21:48 congestion control and you have load balancing to make sure that there are no that there are no bottlenecks this is
21:53 why we're part of the uce the ultra low ethernet consult U uh and that's where
22:00 all of the standards are being written so that we do have interoperability across all the various uh Solutions uh
22:08 and when you deploy ethernet the choice of the network that you deploy and having an open uh network
22:15 is what will enable you to support the various gpus on the market because ethernet will be kind of the standard
22:23 Network or the standard way to connect to every one of those uh Solutions and you're going to have an ability to
22:29 support the best of greed storage and so you know essentially you know one thing
22:35 to one one one thing to think about is how do I set myself for Success uh so
22:42 that I can leverage without changing my entire setup or my entire workflow how
22:48 can I leverage the latest and greatest technology the latest and greatest speed
22:53 and feed greatest and latest Innovation that's been available uh to me I think
22:59 that that becomes a big uh and increasingly important consideration let me since you mentioned
23:04 Ethan I'm going to take it from an architect standpoint uh if you're not following it
23:10 uh architecturally if you look at how Ethan is trying to further address any concerns when it comes to AIML and
23:19 granted these concerns have been around since the HPC days uh we've been trying to do things forever uh I have a
23:25 background HBC uh but uh it was with uh fcoe if you recall that several years
23:31 ago when we got the lossless Q just about every vendor introduced the lossless Q SM different spaces FC and you know we
23:40 introduced three different standards for it to make it work and it turned out the congestion didn't work it it just it
23:46 never worked from the get-go uh but if you look at what's happening today right they've they've published a DC
23:52 qcn uh and anybody who's doing AIML and talking ethernet talking DC qcm it two
24:01 the one of the standards the PFC came from what we did with fcoe the other
24:06 standard was was introduced as part of uh trying to get into toin notification
24:12 with over IP but you also need to realize that there's three other
24:19 standards currently in Mars so there's there's actually a DC qcn plus there's a
24:25 DC qcn Plus+ and now there is an hpcc so just about every vendor on the
24:33 market including Nvidia even though they're out there promoting infin man is participating in these three new
24:41 standards trying to break out is are all of them going to make it I don't know will one of them make it more than
24:47 likely so it really is important to stay up on what's Happening and we as a
24:52 company at least Juniper and WCA seems to be in partnership with us on this
24:57 we're making sure that we are in tune with what is happening in that space so ethernet we do believe will be the truth
25:04 the light and the way the way is always prevailed in the end and time will tell
25:09 one one comment on that I I'll just give the pers so we've been in the high performance space and Ai and ml are a
25:15 subset of that just since we've existed and what I'll say for sure is if you
25:20 think about a what I call like a legacy HPC mindset think about a national lab or a skunk works commercial organization
25:26 that has an HPC cluster with a parallel file system the it's always been
25:32 infiniband but what's clear to me over the last probably the trend is more like three or four years more and more and
25:38 more ethernet is emerging as an option in those HPC architectures and
25:45 especially from a Greenfield uh environment perspective if you look at an Enterprise customer that moves
25:50 they're almost always Ethernet there there's no discussion about infin ban so there's this shift that we've seen
25:56 happening and more and more it's emerging as the first thing that we talk about yeah I I will contradict the vice
26:02 president one little bit I've been doing HBC a really long time before in F ban was called Marinette before Marinette it
26:08 was called dolphin there we go so there's always been something that's come along and in the end it's always
26:13 been called ethernet so actually something you anyone can do is
26:19 you can go to top 500. org this is where all of and then you can look at the list
26:25 of the top 500 uh HBC clusters yep and what you'll see is that the first page
26:30 maybe like you know there is a bunch of proprietary and and and then you see a lot of infin band the second page you
26:36 see some infin band and then start starting to see more ethernet by the third page you know like it's
26:42 overwhelmingly ethernet right and it's kind of like it's always been uh the case it's really hard to bet you know
26:50 you bet against ethernet at your own Peril right uh it's it's where you know
26:55 the entire Market is where all the Investments are made made uh and where the entire ecosystem
27:01 is yeah probably the biggest challenge if I don't know if you're infan people in here or not is infin Ban to me and
27:08 again my my history is also storage which is why I asked to set up here uh if you think like fiber channel if you
27:14 have a background of fiber channel fiber channel is credit based and it it expects you to acknowledge and tell me
27:21 how many buffers you have and how much space you can handle and infiniband is extremely similar when you look at infin
27:29 and that's what ethernet has had that's what it gets knocked for is the fact
27:34 that it doesn't can't account for the buffers so when you're designing from an
27:39 architectural standpoint the things you need to think about is not only can I get it there can I get it there in time
27:45 of manner you also need to think about believe it or not how long my wire is
27:51 and this is really critical when you're building these clusters because you need
27:56 to think in terms of my delays my buffer overheads and stuff
28:01 and how much how many packets or how much of a packet can fit on a wire for
28:07 the distance that the wire is so there's a little more to architecting when you
28:13 when you get into the ethernet side of it but it's not insurmountable it's it's things that those of us who particularly
28:19 played in the HPC space for a long time have had to deal with and thought about and it's no different you have to think
28:25 about it in the a AI Ami space excellent thank you I got one more
28:31 question and then we'll open up to the to the customers and folks in the room um when it comes to like Partnerships
28:38 there's a lot of new logos here right and um you know the AI 100 or 200 there's a lot of a lot of alphabet soup
28:45 right so how do we collectively Juna how are we helping our partners with their
28:51 customers any suggestions for a partner U well at least
28:58 maybe if I if I make from our perspective again this I go back to the uh concept of validated designs and uh
29:06 and uh and Management Solutions like abstra which are multivendor um the way to empower our partners is to
29:14 allow them to deliver TurnKey solutions to their customers while offering choice
29:20 right so you know the partner you know partners that just sell a proprietary
29:26 solution from a toz from one vendor and there are quite a few of these ultimately you know all of the the the
29:33 profit is being made by the vendor right but when you're when you have a solution that is more configurable uh that has uh
29:41 where the the partner can uh can provide options and uh you know ask questions to
29:49 the customers and based on their preferences or their specific use cases offer different uh different you know
29:56 the flexibility of different solutions um that's where I think the partner can add a lot more value uh and then they
30:02 can be you know kind of trusted in making those decisions for their uh for their customers then they can start also
30:09 adding some services on top uh if needed or some development on top and so you
30:15 know the concept for us of uh of an open validated design I think is what we've
30:22 seen um be really helpful for our partners in in in having them in turn
30:28 help their customers deploy the right Solutions yeah from so from our
30:35 perspective I completely agree with with everything that manur said it comes down
30:40 to flexibility that we offer so there is an enormous amount of flexibility from a wer perspective because we we do provide
30:47 software and there are many many different kinds of configurations uh that might make sense
30:53 in different environments so from a partner perspective uh we provide software and we provide designs
30:58 regardless of which Hardware you choose which Cloud vendor is in the mix
31:04 something else that that comes up is just something that we notice is there is this there's this kind of ping pong
31:09 that happens between cloud and on Prem so it's it requires flexibility but it's
31:16 providing enough flexibility to build custom designs when it makes sense and then also at the same time there's this
31:22 balance with ease of use and so I think of abstra and what it provides at the network level that's something that I
31:28 think is is sorely needed when you look at how to make something turn key but have the flexibility to change
31:34 components how do you make things easy but take advantage of the flexibility that you have that's in place so you on
31:41 the WCA side it's we can take advantage of Hardware in real time we can run on just about anything out there is what I
31:48 would say uh but that can also be we don't want to just put any option out on
31:53 the floor so we need partners that can help us craft what makes sense for particular particular markets uh so it's
31:59 a two-way street uh that we're very interested in and then Partners from a networking perspective we're we're
32:06 focused on the data side of things on the storage side of things and the networking piece can get complex very
32:12 very quickly especially when you're developing something that's brand new so the the the overarching theme is how do
32:18 I take advantage of the flexibility we provide but also wrap that in a way that is very very simple and
32:25 predictable excellent thank you yeah I was just going to say for me when it comes to the partners it's really about
32:31 building The Right Use case because if we don't have that right use case the the partnership at least in my
32:37 experience I've been doing this a long time usually falls apart so as long as we identify the right use case and
32:45 partner is align with that use case the partnership usually works out really well great thank you all right I'll
32:52 pause let's open it up to the audience so when design a compute
32:58 Network for gpus where work goes over hundreds of
33:05 gpus latency Network which you know typically we talk about inand how do you
33:11 how does J kind of works you feel like you can replace in
33:18 those situations yes yeah uh I mean the answer is
33:24 absolutely yeah we can and the at the end when it comes
33:30 to uh AI workloads U what these are like very
33:35 much batch like there is just a lot of data that go needs to go to like specific gpus at one time right and if
33:43 one piece of one parameter is missing you can do the computation right so a lot more it's a lot more about having
33:49 the right Bal the bandwidth and and also balance with what the GPU can can take in right at the right time so it's not
33:58 necessarily just uh just latency it's about making sure there is no packet
34:03 loss but that's critical and it's about having the right bandwidth uh and the right AB ability to kind of get the data
34:11 all in one place um and uh you know this the discussion about low latency versus
34:18 deeper buffers I mean that's uh that's discussion has been going on uh for a long time and you can mix and match uh
34:25 but we I mean from a ethernet standpoint you do have very low latency uh
34:30 Solutions uh in the hundreds of NCS uh which is what our 800 gig uh switch uh
34:36 64 Port 800 gig switch uh delivers very much on par with what uh Nvidia has uh
34:43 but then you also have deeper buffers uh buffer Solutions uh which can come in quite handy uh from a just both from a
34:51 rtic standpoint but also to absorb uh you know congestion if in in the process
34:58 of sending all the traffic it's kind of like having bigger dampers you know on on your car uh and uh you know the
35:05 combination of these two has been proven uh to be really powerful right and and
35:10 capable uh in uh in in meeting the performance requirements of these AI
35:16 clusters and that's part of the testing that we do day in and day out day out in in our in our internal Labs right and in
35:24 our PC environment yeah sort of go on top of what Monsour was saying if if you
35:31 uh look at the workload particularly uh a BT workload that we've been running in our AI lab uh up at the corporate office
35:40 here we get uh infin ban comes in somewhere somewhere around 2.5
35:46 2.6 uh job completion time latency for that bur um that's uh 2.6 minutes 2.5
35:54 minutes to run that entire Bert training model uh we can do that on ethernet and we can do it in 2.6 2.7 there's a little
36:02 bit of Jitter in there so we're almost identically in line with what you can get with an IB and again it's not about
36:10 latency really when you get in that HBC world is absolutely about latency because MPI is is very latency sensitive
36:17 RDMA RDMA is more about drop packet concerns because when you drop a packet
36:23 in in RDMA then you you you just don't resend that packet you have to send
36:28 multiple packets back to get back to a checkpoint more or less again think like fiber channel where it has it has to do
36:34 like a reset and come back to to a time and point so the bandwidth is is
36:40 absolutely critical uh in in the ethernet world the buffers being able to
36:45 make sure that you tune the parameters correctly for your workloads is definitely critical because I can tell
36:51 you what we've seen if if we're not tuning the parameters correctly that 2 6
36:57 2.7 will jump to 5 minutes y so it is really really important to understand
37:05 what your model needs and that's one of the reasons I'm really happy that we took the time as a company to invest uh
37:11 in building out that cluster because now our Engineers who are building our products can sit there and look at this
37:17 and go oh quality of service and class of service really does matter y oh these
37:24 parameters like PFC and ecn also matter
37:29 so maybe just add maybe a couple couple more kind of considerations one is uh
37:36 especially when your network becomes super large uh than having like a centrally kind of uh controlled where
37:44 like flow per flow you're controlling which is essentially how infin band Works your your network uh you can from
37:52 a scalability standpoint you you run into challenges um and so so I'd say that's
37:58 number one and number two is that again the ecosystem right like when you're building an Ethernet solution you have
38:06 Monitoring Solutions from many many different uh vendors the state-ofthe-art uh you have an ability to monitor uh you
38:14 have uh flow-based Telemetry so all of these come in real Handy like for example with abstra we use all of that
38:21 right so that we can optimize the performance of the cluster um and so
38:26 again the ecosystem system comes in uh handy right when you're there trying to
38:31 deliver the most Optimal Performance as part of your cluster and that comes that ultimately has an impact on the
38:37 performance that you can deliver that one question it might be slightly annoying so I know how ethernet
38:45 is kind of Swiss army knife networking can do a lot you know maybe not as well exactly as infin band for some things
38:51 but one nice thing about ethernet it seems to be like it could be all fix all people to scale well but would there be
38:58 a situation where at a scy where you get really mad is it possible that something
39:03 like op smm for ethernet because of the reason I'm sitting that where you smack me is that sometimes I from a security
39:11 standpoint not having IP on and be able to go direct and have complete control deterministically if we all path where
39:17 all the pets go might be useful do you see that being available on ethernet
39:22 having a mixed model where it can act more like it fit a band when it needs to and act more like ethernet scale the
39:30 future well I'd say you know the okay I don't know if you guys have but you know I I was part of of uh the early efforts
39:38 with sdn uh which were open Flow I mean have you probably know open Flow um where it
39:46 was all about controlling every flow uh and bypassing like your distributed
39:51 protocols that uh you know have been working well for the last 30 years
39:58 um and you know my my lesson from kind of me going down that path was that wow
40:04 there is a lot of wisdom in these distributed protocols there is a lot of wisdom in what the internet protocols
40:11 you know of what this community has built over the last 30 years um so you know to me I find it
40:18 hard to to to uh identify you know uh use cases where you
40:25 know trying to kind of move away from like your standard protocols uh can uh
40:32 can really have a big impact I mean with it was like for example in the context of maybe uh you're a monitoring Network
40:40 right uh there was you know maybe controlling every flow was was helpful
40:46 maybe open Flow had a had a play there uh but I'd say you know generally when you're talking about a network
40:52 especially in the context of AI I really don't see you know a benefit to those uh
40:57 to those approaches I think it's much better to just stick with the standards but I will argue that you can
41:04 be so again I I tend to speak my mind fig you're say that I could argue
41:12 that being completely distributed and just letting it do what it needs to be
41:17 what it thinks it needs to do is the right thing and I particularly think of this
41:23 um I I was working with a customer this was a few years ago and they it was was not AIML it was HPC and
41:31 they wanted to spray packets spraying packets is not goodness in RDMA it is
41:38 not goodness so you not only do you need to have a special Nick that knows how to
41:44 reassemble but you better hope that nothing gets dropped so yes is is
41:50 ethernet fabulous and I would love for everybody run ethernet because I'm a junifer employee absolutely right but as
41:56 an architect there are just some things that because you can do it doesn't mean you should do it and right now I would
42:03 say anybody who says I'm going to build my AI ml cluster I'm going to use ethernet and you know what I've got
42:08 these Nicks that know how to take these packets and reassemble them so I'm just going to spray them for my most
42:13 effective use of my bandwidth more power to you because I think you're in for a really quick
42:22 reality because as soon as you lose one packet think about everything that was
42:28 sprayed and all of it has to be retransmitted from the reset point so so don't let the protocol necessarily the
42:35 fact the fact it's distributed the fact it's open great but there are just some things you just
42:40 should not do I'm sorry that's my opinion yeah thank you uh one more
42:48 question
42:55 y sorry did you say one
43:04 more key differen between EMC pure and
43:09 the EMC EMC think yeah so I think the the one the the first thing that comes
43:15 to my mind when I hear that is the parallel file system or not a parallel
43:21 file system so that's the first thing because I think because there it makes a big difference uh if you need to use
43:27 something like MPI and so that's that's the first thing right so WCA is we're a
43:33 parallel file system so we can operate in that space and generally like if
43:38 we if you if you look at those two or at least comparing those two platforms and
43:44 us or what EMC Dell provides and and net app a lot of times we look at that and we say well one of us is one of us is
43:52 maybe in the wrong space uh because of the criteria so if you're looking for a parall file system that delivers ultra
43:59 high performance and you're trying to push the boundary like that's what we're talking about at here at GTC then W is
44:06 very strong at that right and it doesn't have to be you don't have to be using mpio to get the benefits of w it's just
44:13 we will the software was designed to saturate the underlying Hardware that we're running on so it's one it's it's a
44:20 it's a different kind of uh it's a different kind of use case in my mind uh
44:25 now there is there some there could be some overlap there when it comes in when you talk about a nonpar parallel file
44:32 system workflow and if you look at the differences I'll just focus on the differences right it's you're going to
44:38 have to make a decision on what makes the most sense we're software and we run on a variety of different Hardware
44:44 platforms so we're partnered with Dell we have a strong partnership with Dell so we can run on a Dell server platforms
44:52 we can intergrate with Dell Object Store as well so again it kind of goes back to what is the right solution for the use
44:59 case if it's high performance you're pushing a boundary and you're building a modern AI ml workf flow then maybe I'm
45:07 biased I mean I have a w shirt on you we should talk right because that that is the space that we're in and that's what
45:13 we're very very focused on uh but if you're looking at if you're looking for something that is more that has a number
45:20 of features that have been built over tens of years really in those platforms
45:26 and you have an rise style workflow that's not something that we want to to
45:31 to go after so again it's well what is the what is the fit and what is the use case look like and what makes the most
45:37 sense so I hope that helps clarify it but parallel F we're parallel file system and those platforms are and is
45:43 kind of the first criteria and then you go into details and variations based on the use
45:50 case excent thank you any other questions from the audience
45:57 than well panel thank you very much great job appreciate thank you [Applause]