almost 2 years ago - EVE Online - Direct link

Transcript (by Youtube)


1s uh hi guys so my name is nick herring
3s i'm the technical director of
4s infrastructure for eve online at ccp
6s games which is a very long title for
8s cosmic plumber um we're going to be
10s going over kind of eve online
13s the last 20 years of development what
15s that looks like what we had to work on
17s to kind of modernize the eve online
20s tech stack at least from the perspective
21s of the server side and the network
23s portions of the client side
25s um
27s and we'll go over more of what the
29s topology is in the evolution of that
31s topology kind of how you've originally
33s started and how it's gone from
35s uh
36s from 2003 to what we have now
38s and
39s we're going to talk about how we tried
41s to fundamentally or are fundamentally
43s changing how we actually work on eve
45s online
46s there's multiple pieces to that but the
48s two biggest ones being a technical
49s aspect and a cultural aspect
51s and the cultural aspect is a pretty big
53s part of it
55s and hopefully
57s we don't have to go too fast here
58s because right after this we have a round
59s table um but the round table is more for
62s anything else so any kind of quasar
64s specific stuff we can talk about here
65s hopefully at the end of this if there's
67s time for questions uh and then
70s afterwards in the round type we can talk
71s more about other things like quasar and
73s easy and how they interact and what
74s makes sense there and the any kind of
77s other technology that we're using on the
78s server side
80s so we can start with
81s 20 years of eve development uh it was
84s released in 2003. you guys all know this
88s right now there are roughly over 2
90s million changeless in perforce
92s that number is probably growing faster
94s and faster as we add more and more
96s automation into the ecosystem so there's
98s less and less humans actually making
99s changes to the code base
102s and
102s we've kind of added a little reference
105s of how much code there is and i've added
107s this silly reference of the ue4s so if
110s you take the code base of unreal engine
112s 4
112s you can kind of get an idea of how much
115s of that code is is being used there it
117s means absolutely nothing it's just fun
118s to think about
119s um
121s and so if we think about like the cec
122s plus plus that's where a lot of the
124s rendering code is that's where the
126s simulation code is
127s and that's a lot of where the the glue
129s is from uh
131s c to python marshalling and back and
133s forth
134s so that's roughly 1.7 million lines of
137s code
138s then the next up would be sql so a lot
141s of eve is run by basically sql procs a
144s lot of the logic is
146s unfortunately um
149s and so we can see that
150s just our sql code alone is the size of
153s ue4
156s and then if we continue on
158s where a bulk of the logic was written uh
161s in python in stackless python
163s um we can see that there's 3.4 million
165s lines of code a little bit more than a
167s ue4
169s and then if you weren't worried enough
170s yet
171s we have roughly
173s 53 million lines of yaml
176s um
177s this is 24 unreal engines uh worth of
180s code code's a strong word but um
185s this this is what holds everything to do
187s with uh how the universe is authored any
190s anywhere from how the spaceships are
192s made uh and authored as far the
194s attributes are concerned and there's a
196s lot of work done there and and this this
198s would have looked a lot differently uh
200s from
201s probably about three years ago i think
203s where we had a team internally that was
205s jamming on moving all of that out of sql
208s into a binary file system that we could
210s work with an author with that way we
211s didn't have to deal with things like
213s promoting between databases and instead
215s the files traveled with the actual
217s branch itself
219s so that's just a little bit of idea of
220s kind of
221s the momentum that is eve online and what
224s we have to deal with when we want to
226s make a foundational change to how the
228s system works
230s and so looking at the beginning here
232s it is deceptively simple
234s uh most of you guys know about this i've
236s seen this in a different form
238s uh we have the concept of soul nodes
242s that is part of the monolithic code base
243s that we have soul nodes can take on any
247s role as it were it could be servicing
250s corporation requests alliance requests
253s wallet uh the actual literal location of
256s the solar system the simulation within
257s that so we can swap in and out of those
259s roles and a lot of the orchestration
260s usually happens on the soul node level
262s uh then we also have the proxies which
264s kind of dedicated to
267s managing the connections coming into the
269s system and that's important because
271s ultimately that wasn't the like
273s that wasn't the original version of this
276s and and we have some other ccps here who
279s actually just rejoined that were in the
280s original team that worked on this they
281s can probably pick this apart a little
282s bit more but proxies didn't originally
284s exist they were built out of necessity
287s basically as soon as people went over
288s 100 people on eve they were like we have
290s to do something about how this connects
292s because ultimately
295s this represents kind of the the topology
297s of carbon io so everything is
300s is a mesh network so it's a guaranteed
302s one-hop mesh network which becomes a
304s quadratic problem almost immediately
307s when you're trying to deal with resource
308s management and those kind of things but
310s it's very powerful in the sense that a
312s lot of how eve works is about
313s deterministic routing
316s that's contrary to modern technologies
318s around things like rabbitmq or nats or
321s kafka those types of things where you
322s have more dynamic routing slightly
324s different but
325s eve does a lot of things where it knows
327s your character id
328s and by virtue of knowing your character
330s id it doesn't need to ask the ecosystem
332s where to go it can make a very educated
335s guess and get to the right node for the
337s right information and that's very
338s powerful early on
340s and all of this was kind of
343s tackled around
345s carbon io which is kind of the glue for
346s all of it at this point in time
348s and this is a homebrew protocol written
350s in in python so a lot of the networking
352s calls are all pure python and that's
354s where all the traffic is going and being
355s shaped and those kind of things
357s and that goes back into the desktop
358s client over the internet and those types
359s of things
360s another big part of this is the
362s combination of this with with i o
364s completion ports
365s and io completion ports are important in
367s this regard
368s because with stackless python stackless
370s python only does one thing at a time
373s and it's something that i have to remind
374s every engineer it can only do one thing
377s at a time
378s engineers try to do modern techniques of
380s like distributed locks or mutexes
382s whatever the case may be but it doesn't
383s actually matter stackless python only
385s does one thing at a time the other
387s terrifying part is that it's not only on
388s a single core it's on a single logical
390s processor
392s so we're reinforcing fleet fight nodes
394s for example we're basically just trying
396s to throw as much clock speed at that
398s node as possible it doesn't even
400s actually matter how many cores we put on
401s it
402s and that's kind of one of the ultimate
403s limiting factors that had us start the
405s conversation about
406s what became the idea behind quasar and
409s how we start teasing that problem apart
412s um but i o completion ports are also
415s important in the sense that it it it
417s makes a step towards
419s deferring the management of those
421s sockets from python because the more
423s stuff that we get out of python the more
425s one thing at a time that we can do
427s um and so that defers to a kernel and
430s auto completion force is a is a nifty
432s trick it's similar to um
435s polling uh socket polling in linux if
437s you're familiar with that um there's a
439s push pull paradigm asymmetry there um
443s but that's how it works on windows oh by
444s the way all of this is on windows
446s um
451s um and so
453s as this evolved and and these players
456s grew and more and more things started
458s being built inside of this is that part
460s of like the python that we were talking
461s about grew and grew and grew under a
462s nice little folder called script which
465s basically became a large portion
467s of of the of the actual code base
470s what would end up happening is during
471s releases
473s um people started noticing that like
475s forms would go down or web pages would
477s go down like i think the eve wiki at
479s that time would would go down
481s because every release would publish
483s information about the details of eve so
485s i think back then we didn't have the sde
487s and we didn't have any of the apis
489s and so the reflex to that was to build
491s the xml api they're like hey this is
494s this is taking down our web servers and
495s that's bad because when we do releases
497s we want everybody to see all the new
498s information everything we're doing
500s so they built up the xml apis to kind of
502s protect from that
503s and what i
505s i have i haven't really found the exact
507s person or the exact reason this came
509s into existence that's the original
511s reason for it but i don't think people
513s realized at that point in time that what
515s they were doing was effectively making
516s one of the biggest retention mechanisms
519s in eve because it allowed you guys to
521s build on top of that then things like
523s evmon were born
525s importing things through like eft and
528s all those kind of things when we had all
529s that static data but
531s ultimately
533s this was all done xml over http um and
537s it was it was basically a read cache
539s nothing nothing fancy there and i think
540s if anybody remembers the xml api the
542s cache timers on that were horrendous
545s it was i think multiple hours on the
547s actual liveness of the data
551s and so
553s this got into things like uh
555s managing skill plans through through
557s even
558s and and you can kind of see an echo of
560s that with skill plans in the game right
562s now which oddly enough is connected to
563s quasar so
565s it's
566s serendipitous that one of the first
568s full features that was 100 on quasar is
572s actually the same third party developed
575s feature that was built outside with the
578s original xml api
581s and so as this kept growing we kept
584s adding more and more things to the
585s ecosystem we started getting
588s oauth 2 because then we had added a
590s launcher and we needed other websites to
591s federate with other information that
592s might be there
594s and so you know then we started getting
596s oauth 2 over http and that ecosystem
598s started to grow
600s and i've kind of simplified this so that
601s monolith services kind of represents
603s souls and proxies that's the the og
605s cluster if you will all services started
607s growing into a suite of dot-net
610s applications um for managing
614s uh payment information or various other
616s things in this suite of uh of services
618s was actually
620s technically speaking the first
622s external service to the original cluster
625s and so they started dealing with a lot
627s of the problems without understanding
629s or not they understood it without really
632s knowing about kind of the paradigm that
634s a lot of people talk about today when
635s people talk about
637s monoliths to microservices like that is
638s a thing that almost everyone has heard
640s something about at this point in time
643s only a handful of people knew about this
645s or had completed it right the entire
647s world was at this point in time dealing
650s with this problem and nobody had a name
651s for it yet
653s and so this was roughly the first
654s cluster of services that we had live
656s outside the actual eve ecosystem in
658s their own sustained way and
661s also to note all of this stayed inside
663s of our data center so this was still
664s inside of
666s basically metal boxes right next to each
667s other
669s but the problem with those is and this
671s is one of the problems with people that
673s implement micro services
675s is that ultimately like yeah we got a
676s bunch of micro services and they're all
677s connected to the same database oh that's
679s not how that's supposed to work
681s like you've just created another
682s monolith for your database right
684s um and this is ultimately the trap they
686s fell in right
687s that was one of the tricky things
692s and so
694s as we started looking more into kind of
696s like the surface area of how these
698s things were growing
700s this then came on to introducing crest
703s and my history here is a little blurry
706s because i was actually coming into ccp
708s when dust was winding down
710s um but ultimately crest was built
712s because of the concept of dust
714s ultimately the the orbital bombardment
716s was done through a crest call because
718s that went from the psn
719s uh uh network into our network and
722s that's what basically coordinated the
724s the orbitable margin strikes um
727s the interesting thing about crest is
729s that it was then trying to adopt
731s mentalities at that point in time and in
733s this case it was very very academic
735s mentalities this was a hypermedia
737s restful json if anyone remembers what
739s that is it's basically a
740s self-documenting self-referencing api
744s and the idea behind that was that a
746s robot could then navigate the api and do
749s what it wanted to do
752s humans unfortunately were doing that not
753s robots
754s so that became a lot of paperwork just
757s to use the api and then on top of that
759s dealing with all the changes and the
761s breakages there
762s the implementation here got more
764s interesting those
766s those nodes for crest were effectively a
768s soul node with more logic on them and so
771s they were susceptible to the same
773s scaling problems of basically the one
775s hop mesh network and also implemented on
778s top of stackless python
781s so
782s it also paved the way for right
784s endpoints i think crest is the i can't
786s remember the first right endpoint it
787s might have been like autopilot
789s or eve mail i can't remember exactly the
792s first one
793s but this also paved the way internally
795s for this because this was a big deal
798s not many people cared and i used quotes
800s there because i'm not saying that nobody
802s cared about it but they weren't really
803s bothered by the fact that people were
805s scraping data and not taking down the
806s service sounds great when we started
808s building up crescent like hey we're
809s gonna allow players to
811s automatically affect things through the
814s api
815s and then it became an entire civil war
817s internally on like what that meant and
819s how we should go about doing that
820s obviously the cat's out of the bag but
822s at that point in time
824s it was more about
826s isolating it to the difference between
828s what we could affect in the universe and
829s what was localized to the player
832s so that still stays to this day there's
834s not anything you can do even in esi
837s where you can affect the universe
840s it has the illusion of that but you
842s don't actually affect the universe until
843s we introduce things like
845s actually manipulating market calls that
847s affect inventory then things start
849s affecting the universe but most
850s everything if i remember correctly
852s is about endpoints that can only affect
854s the state of your character like
856s autopilot contacts eve mail
860s i can't remember all of them um
863s and so this set that precedence there
866s and this kind of introduced yet another
867s point that we were
869s building on and so
871s the problem with this growing surface
873s area
874s ultimately became performance like we
876s were talking about this is all built on
877s top of stackless python we could only
879s scale vertically not really horizontally
882s because the more nodes that we scaled
884s horizontally the less connections we
886s could deal with up front and that became
888s a problem that equation basically didn't
889s work out i think when we did the math on
892s that originally that number came to
893s around a hundred thousand
895s um
896s now we haven't been to that number yet
898s um but uh ultimately that was the the
901s proponent of that like that was what was
904s powering those decisions
906s um
908s and this gets into cyclist python oh the
910s gill right so python in general rather
912s stackless or not doesn't matter has what
914s they call a global interpreter lock this
916s is what forces it to do the one thing at
917s a time but it also makes it very
919s powerful in the sense that you don't
921s have the complications of any
923s concurrency paradigms or primitives that
925s you then have to coordinate there's no
926s synchronization ultimately because it's
928s only doing one thing at a time
932s the database was also a problem here
935s um
936s a big part of why we implemented a lot
938s of the tools that we have today is
940s because when we started introducing easy
943s it basically became the scapegoat for
945s any problem that came up at any point in
947s time
948s to the point where i had a tally board
949s in the office of not easy or easy
954s and so the database becomes a bottleneck
955s for this because ultimately it's the
956s same problem
958s we have all of this concurrency
960s happening at a single location that can
962s only scale up to a certain degree and
964s that's that's why you read all the dev
965s blogs that we have even the recent one
967s about the hardware upgrades where we
969s literally have to throw metal at it to
971s solve some of those problems because the
973s complexity or the density of the actual
975s operations being done in the database
976s can only be mitigated by
979s faster light in this case
983s maintenance uh is another big one here
987s in order to change anything
989s in xml api that was great because it was
990s a standalone service we didn't have to
992s worry about
993s tranquility going up or down however
996s it had the side effect of if anyone
998s changed anything in the database the xml
1001s api didn't know about it so there was a
1003s lot of thrashing in the sense that
1005s endpoints would go down and break and
1007s various other things would mismatch with
1009s certain attributes or whatever the case
1010s may be we still have this problem with
1012s easy right now in various different
1014s places that we're still combating but in
1015s a different way
1017s um
1018s this gets into deployments um
1020s crest could not be affected unless we
1022s change like brought down tq ultimately
1024s uh that's one of the other big pieces
1026s about what we're modernizing it
1028s um
1029s uniform criticality this gets back to
1033s what me and ccp tuxford were talking
1034s about in vegas and this this talk is
1036s basically
1037s a status update of the
1040s talk we had in vegas
1041s where we were talking about the concepts
1043s behind this and more the technology that
1044s we're using
1045s and the developer experience that we're
1047s targeting uh less about where we're at
1049s now
1050s and and ultimately what the cultural
1052s changes need to be to achieve that
1054s um and uniform criticality in this sense
1056s means that
1058s everything is priority one
1060s and that's a problem
1062s hey email's not working
1063s well email could not work to a point
1065s where it starts cascading failures
1067s inside the cluster
1069s well now email is definitely priority
1071s one but that's the silliest thing to
1073s have as priority one we would rather
1074s just turn off email instead and deal
1077s with that problem and then turn it back
1079s on when it's ready to go oh
1081s unfortunately eve's not built in this
1082s way however our teams have become
1084s exceedingly efficient
1086s at
1087s building and working in this way but
1089s that is an immense slowdown into how
1091s they build into what they can
1093s build and then we get into the
1094s development aspect of this
1097s domain boundaries became a huge part of
1099s what we started talking about because
1100s ultimately
1102s when you start building something on top
1103s of a soul node domain boundaries
1105s instantly get blurry and if i could
1109s oversimplify what has happened over the
1111s last 20 years with the eve code base
1113s you combine things with
1116s a dynamic language like python
1119s maybe some of my little personal biases
1120s there you have the same you have the
1122s same database you have a single
1123s deployment mechanism and what happens
1125s over time is it doesn't matter how well
1127s you organize or build the code base
1129s because that's the thing about eve is
1131s all the core components are are well
1133s designed in the sense that they're like
1135s what people used to call service
1136s oriented architecture which is now
1138s microservices we've been doing the same
1140s thing since the 70s everybody just keeps
1141s calling it a different thing
1143s [Laughter]
1146s we found the old guy
1151s and so
1152s ultimately it didn't allow you to
1153s actually build those boundaries
1155s everything kind of blurred together and
1157s it became this thread that you had to
1158s pull at which caused all these side
1160s effects
1161s these side effects that no one did
1163s intentionally it's just kind of how they
1164s happen because if you can't isolate the
1166s domains that you're actually working on
1168s you can't really take responsibility for
1170s just that piece
1171s i mean we can talk about how many
1173s different types of mission systems that
1174s we have in eve online
1176s that's because when you go to look at
1177s them like i'm going to add this thing
1178s and you look at it and go nope i'm not
1180s touching that
1181s because it's connected to so many other
1183s things and it's almost always easier to
1186s build something in a separate corner and
1187s then some other pieces build on that
1189s eventually tentacles come out of it and
1190s everything gets woven together right and
1193s keep in mind this is over the course of
1194s 20 years right like this is not
1196s something somebody went you know what
1197s i'm going to do i'm going to connect
1198s every domain into a massive monolith
1200s that's not what people were doing
1202s it's much a natural evolution of things
1206s and then this gets into data ownership
1208s and this goes back into the database and
1209s how the database works
1211s when we have things like hot tables or
1213s poorly planned queries
1215s it's because of some other services and
1217s what they might be querying about that
1218s information
1220s but that's kind of broken because they
1221s shouldn't be sharing that that shouldn't
1222s be part of the problem
1224s and data leaking out or being consumed
1226s by anything else should be a problem
1227s either like for example
1229s when we have the
1231s the off services they were actually
1232s dipping into the same db and connecting
1234s character information and user
1235s information
1236s well that meant that anybody that wanted
1237s to do anything crazy with characters
1239s couldn't because now it affected another
1240s system that they didn't have any actual
1243s agency over
1244s and this is why now you're seeing more
1246s and more changes come through like with
1248s what ccp nomad was saying earlier about
1250s like the skills and what we want to
1252s change in effect there
1253s there's we're trying to make it easier
1255s to define those boundaries so that
1257s people can make more surgical
1259s foundational changes instead of just
1261s kind of adding on and trying to sidestep
1264s what's already there
1266s and then we get to the cognitive load
1268s which is more about how the developer is
1270s working
1271s and this is what i mean by
1273s we've conditioned our engineers
1275s to keep all of this in their head
1278s when they work on anything
1280s and if any of you have worked at the
1281s different like worked on a project that
1283s has automated testing versus doesn't
1285s have automated testing there's a
1287s radically different mental experience
1289s and motivating factor if it doesn't have
1291s automated testing
1292s i'm not incentivized to make it better
1296s if it does have automated testing i'm
1297s more incentivized to make it better and
1299s even make broader sweeping changes
1301s and going back to the millions of lines
1303s of code that we talked about earlier
1305s there's a lot of missing automated
1306s testing in that but that's also
1307s something that we're working on right
1308s but it's still to create a system or
1310s connect those systems or
1312s make them manifest in the
1314s the experience or or
1317s i regret saying this word but like
1318s illusion of gameplay right because
1321s that's what we're kind of getting at
1322s right we're we're all living in this
1324s wonderful fantasy of flying a spaceship
1325s and those kind of things but to make
1326s that really connect you have to then
1329s deal with all of these other pieces when
1332s it really should just be hey let's
1334s change the way the spaceship flies
1341s so
1344s this kind of led us all of these
1346s different pieces kind of led us into
1347s what we were talking about
1348s with the original idea of quasar we
1350s didn't we didn't know we were going to
1351s build quasar by the way this was kind of
1353s an evolution of how things went
1356s um
1358s ultimately the origin was the eve
1359s swagger interface or the open api
1362s implementation
1364s and
1365s when we started working on that the the
1367s vehicle for that was the actual mobile
1369s application
1370s uh the eve companion app or eve portal
1372s right
1374s and when eve portal came out it was
1375s mostly the fear
1378s was that number one we had all of these
1379s new devices that would come online that
1381s weren't necessarily connected to eve and
1382s it was much easier to connect to all of
1383s this so we needed a way to protect the
1384s cluster which meant we couldn't scale
1387s horizontally with crest because that
1388s would heat up resources and it
1390s definitely wasn't going to be an xml api
1392s because
1393s xml um
1395s and so
1397s we kind of
1399s discovered
1401s a way to sidestep
1403s decades of technical debt by introducing
1406s a message bus and that's kind of the
1407s core piece of where quasar started and
1410s we didn't know this yet but
1411s ultimately you know if we look at you
1413s know going back to talking about eve's
1415s original design like at the core of it
1417s how it's designed there's roughly about
1420s if you if you
1421s so i'm trying to speak about this in the
1423s sense of like a restful api but if you
1425s take the core monolith of eve and try to
1427s actually dissect what's going on there
1429s there's roughly about a little over 300
1431s services internally to just the python
1433s code base it's talking to itself
1435s this is roughly 6 000 endpoints compared
1438s to the 190 that we have for easy and
1441s that's just to power everything that you
1443s see in the actual eve client
1445s and that's hard to keep track of
1448s when you don't have anything
1450s dictating what the domain boundaries are
1452s what the data ownership is
1454s and so that was the big reason why we
1455s chose things like swagger spec which
1457s eventually became open api because we
1459s wanted people to be able to have the
1460s conversation about
1462s what is it that you actually own what
1463s are you building against what's your
1465s contract that you're going to maintain
1466s for everyone else
1472s then ultimately we got into
1474s kubernetes in the cloud space with this
1477s um
1478s what ended up happening was
1481s we were trying to build things against
1483s our data center against heart like we
1485s at one point in time we were like pixie
1487s booting machines into ibm blade centers
1489s and running cube before v1
1492s um it worked it worked but it was not
1496s sustainable unfortunately
1497s um and then we kind of just one clicked
1500s into gke uh inside of google cloud
1503s that's where we kind of started our
1504s journey with with kubernetes
1507s um and that just allowed us to provision
1510s resources that we would never have
1511s access to
1512s to wield a lot of power that we would
1514s never have access to with
1516s things like
1517s we wouldn't have to worry about the link
1519s speed
1520s of what's coming into our data center uh
1522s versus just making a load balancer and
1524s everything coming in we eventually
1525s landed on amazon for various other
1528s reasons but
1529s that's kind of the journey that took us
1531s there
1532s and then ultimately the message bus was
1533s the core piece of this
1535s and
1536s we chose the message bus over a service
1539s service mesh architecture because the
1541s ideas that we had about how this would
1543s evolve
1546s up front a service mesh is very
1549s difficult to get all the right tooling
1551s in place to help people debug and
1553s maintain whereas a message bus gives you
1555s a bottleneck which seems
1557s counterintuitive in the grand scheme of
1558s things but gives you a dedicated
1560s bottleneck to own all the pieces that
1562s are flowing through there and allows
1563s your like your upfront cost as far as
1565s getting other teams on board
1567s to go faster sooner um so this is the
1571s distinction that we made originally this
1572s is also while the world was still
1573s figuring out things like istio
1575s linker d
1577s envoy ambassador all the other cool
1579s things that are out there now
1582s and we still we still talk about this
1583s heavily because we're now to the point
1585s where we're emulating a service mesh to
1587s a degree
1588s but we ultimately wanted the teams to
1590s not have to worry about what the ingress
1592s looked like we only wanted the teams to
1594s worry about their domain
1595s and the data they owned so how do we
1598s make it so that they only care about
1599s inputs and outputs that was our primary
1601s goal
1603s this led us to protobuf
1605s after
1606s doing everything in eve portal and with
1609s esi and there's still esi endpoints that
1610s do things through json sorry let me be
1612s clear
1613s all of you guys see json on the back end
1616s we see some endpoints that are doing
1618s protobuf and we see some endpoints that
1619s are doing json
1621s when we started to build and just
1622s basically blitz through the easy spec
1625s and started building everything we built
1626s it all in json
1628s we learned real fast that was going to
1630s be a problem when we didn't have a
1631s schema to really deal with wrangling in
1633s all the data and that's kind of why we
1635s started looking at things like protobuf
1637s and we started looking at things like
1638s protobuf to deal with
1640s performance as well once we realized oh
1642s protobuf has this nice
1643s uh c-plus plus uh mechanism where you
1646s can generate native code that can also
1648s do the serialization for protobuf and
1650s what that means is
1651s we're basically moving everything from
1654s our do one thing at a time stackless
1656s python of writing down messages
1658s and then just throwing that memory at c
1660s plus plus and saying you do this instead
1663s while python can go do the next single
1664s thing it can do which is a huge
1666s performance benefit for us
1669s naturally
1670s this led us to grpc
1673s because ultimately when we started doing
1674s this we started connecting to this as a
1676s server everything was going great this
1678s is how we established like
1679s a lot of what you're seeing from the
1681s data teams a lot of the newer pipeline
1683s around the definitions of those events
1685s what's being basically fire hosed out of
1687s the system is coming through
1689s protobuf into the message bus ecosystem
1692s but ultimately when we started talking
1693s about this more and more we realized uh
1695s we need a way to actually connect
1696s between these systems what makes sense
1698s there we didn't want to maintain a
1699s protocol for this there was no point in
1701s that there were so many to pick from
1703s and protobuf became kind of the the
1705s anchor for this because it was just a
1707s hop skip and a jump away and we could
1708s generate grpc endpoints
1710s um
1712s then we realized oh we can put this in
1713s the client
1714s and that's where the idea of quasar
1716s started when we realized wait we can
1718s close the loop on the entire ecosystem
1720s and sidestep the entire legacy code base
1724s and keep everything inside of cube
1726s inside of golang instead of the message
1727s bus
1728s and not have to deal with anything
1730s that's going i mean we do have to deal
1731s with it all the time like
1732s it's not all sunshine and rainbows we
1734s still have to go in and make sure things
1736s connect and actually manifest in the
1738s universe
1739s the way they're supposed to
1741s and this then got us into domain
1743s services
1744s and
1745s i'm i i really don't like the word micro
1748s services number one because no one knows
1750s what it means uh but also number two it
1752s defines an arbitrary scope to what
1755s you're designing
1757s and this is why we talk about domain
1758s services because it ties it back to the
1760s actual data model like what are you
1762s actually building and what should you
1764s own
1765s an example of this and one of the first
1767s kind of domain servers that we built for
1768s eve was skill plans
1770s skill plans owns everything that it does
1772s and it never touches the monolith at all
1775s other than sending out by by proxy of
1778s sending out like other events of like
1779s hey they want to train this skill now
1781s from from the skill plans
1783s um even that might be debatable it might
1785s be going through the client point is all
1787s of that data that you're sharing with
1789s your corporation with all those skill
1790s plans all those different pieces
1792s that's all completely going through
1794s quasar
1795s and we have some other services before
1796s that where they're going through quasar
1798s the activity tracker is another one of
1799s them but it wasn't quite doing the same
1801s thing
1802s and we can kind of point that out here
1805s um
1806s oh wait a minute
1809s yes
1811s um
1813s so this is kind of where we're at now
1815s those are tiny words um
1819s yeah
1820s and so this kind of represents where
1821s quasar is in in kind of the cloud
1823s provider that we have uh we ultimately
1825s have a service gateway which is the
1827s first piece of this puzzle and that is
1829s our authoritative domains this is like
1831s if there's an event inside of this
1833s uh domain it is a fact of the universe a
1836s ship exploded uh this guy bought
1838s something on the market whatever the
1840s case may be and this is what's normally
1841s referred to as east-west traffic in the
1844s terms of kind of your network topology
1846s this is usually within owned
1849s networks uh for that for that company
1851s um
1853s and you can kind of see here where we
1854s introduced the mobile client all these
1855s other pieces that
1856s eventually got pieces of quasar it
1858s wasn't known as quasar at that point in
1859s time but eventually got in
1861s the public gateway then represents our
1862s north-south traffic uh which is
1864s basically anything that egresses or
1866s ingresses between controlled networks so
1869s basically your guys's machine versus our
1871s guys's machine
1873s and those we treat radically differently
1875s because if we
1876s emit an event on the service gateway
1878s it's a fact if your client emits an
1881s event it needs to be statistically
1882s significant
1884s and what i mean by this like when we're
1886s tracking like how people use certain
1888s things within the client like opening or
1890s closing windows or the case may be we
1892s can't trust any of that data it's coming
1894s from an untrusted source and you know
1898s clients get modified every now and then
1900s it seems so we have to take into account
1902s like what is true and what is not so
1903s they have to be statistically
1904s significant
1907s and this is the part where we started
1909s talking about internally
1911s where ultimately the desktop client
1913s isn't the only client and this started
1916s opening the door for how we talk about
1918s the future of
1919s of how eve works and what happens and
1921s then what we build
1923s um where we started talking about eve
1924s portal the websites the third-party apps
1927s that you guys are constantly building um
1929s all of those pieces it it means that you
1931s could play potentially play eve
1934s from more than just the desktop client
1937s so part of this was proprietary to
1939s standards we're talking about things
1940s like our original like carbon io that
1943s proprietary python
1946s protocol going into things like protobuf
1948s grpc those kind of things over amqp or
1950s google pub sub or nats or whatever the
1952s case may be i skipped ahead that's the
1954s message bus one
1956s but that ecosystem is also the big part
1958s of this when we talk about things like
1960s what ccp no man was talking about with
1962s the air career program right
1965s all of those pieces that we're doing
1966s there aren't specifically for the air
1968s career program right now they are
1971s but all of those extra events all of
1972s those things that are being tracked
1973s those are all pieces that we can reuse
1975s within that ecosystem so the more and
1977s more pieces that we have this is kind of
1978s the original uh
1980s ignition of of the activity tracker
1982s while the activity tracker didn't act on
1984s these things it still tracked all of
1985s them and then it gave us all of this
1987s extra information on how to react and
1989s how to build upon that information
1991s that's already been throwing around the
1993s the uh the ecosystem
1995s and and we prototyped these a while back
1996s i think
1997s i think there was even a fan fest where
1999s we put up arbitrary data or that
2001s arbitrary data we put a prototype data
2003s on a kill mail system and everybody lost
2005s their minds over logic or logic info on
2007s the uh on the kill mails um
2010s that is also something that we're
2011s looking to do and proceed to but like
2013s that's part of this evolution and part
2014s of the performance pieces that we're
2016s talking about here
2018s this also gave us a ubiquitous language
2020s this was one of the biggest problems
2022s that we had internally
2023s you could build a service or any of
2025s those pieces and go to a separate team
2027s and then go look at that and go i can't
2028s use that when they really could but
2030s there was no ubiquitous language to
2031s communicate that so protobuf gives us
2034s that ubiquitous language in the entire
2035s ecosystem where we can go and say hey
2037s i'm going to make a call here and that
2039s service doesn't care who it is or what
2041s it's for it doesn't have to care about
2043s something inside of that python module
2045s mutating it to something else or
2047s changing something that shouldn't need
2048s to or somebody else deploys a different
2049s version of that our teams are now
2051s building around the concept of you own
2053s this api you need to keep this api
2055s working and if we want to change that
2057s that's a conversation around the actual
2059s language and the domain which then gets
2061s us to our domain services
2063s which is the piece more around
2065s what do you own
2067s what do you iterate on those kind of
2068s things an example of this is skill plans
2070s again
2071s where
2073s we were talking about modifying how
2076s skills work where it's not no longer a
2078s cue you're dumping skill points into
2080s uh like you're accruing skill points and
2083s then you do with those what you want you
2084s don't have to actually plan that out
2086s and the evolution of skill plans might
2088s be
2089s that it just becomes the domain service
2090s for skills that might be the natural
2092s evolution of that
2094s we've yet to see that because we're
2095s still learning these pieces and again
2097s these are the services that are still
2099s kind of the first ones of their kind
2101s inside of quasar
2104s so ultimately kind of what did we learn
2105s from this
2106s it gets into the micro versus domain
2108s um
2109s and this kind of gets into
2111s the delineation be
2113s because like
2115s the the biggest problem like if you
2117s think about it
2118s abstractly when you have a game engine
2120s involved in anything
2122s that is instantly a monolith the client
2123s for that for that game is a monolith
2125s there's not much you can do around that
2127s there's a lot of talks around that you
2129s hear about micro uis or micro front ends
2132s or those kind of things that might be
2133s the next evolution that we'll see
2136s but this is basically the difference
2137s between what we like we couldn't use any
2139s of these technologies in eve because
2141s all of those things were detached like
2142s if you use spotify the little bar at the
2144s bottom of it was its own
2146s http call that went to a separate
2147s service whereas an eve that connects to
2150s the proxy which goes with the soul nodes
2151s routes over information goes to the same
2152s database and comes back through
2154s everything was connected right and so
2156s that's the big difference for us and we
2157s want to concentrate on the domains uh
2159s not the individual mechanisms
2162s and then learning the difference between
2164s a message bus and a service mesh
2166s kind of getting to the nuances of
2168s dealing with connectivity ingress how
2170s players connected and kind of getting
2173s that off the table so our devs could
2175s concentrate on other things
2176s and then getting to api
2178s representing the team boundary
2180s not like i you know we kind of have
2182s evolved from this uh building features
2185s in the sense of i need to build all of
2187s these pieces because i need all the
2188s pieces of this feature in order to make
2189s this thing which the side effect over
2191s that over time is you have a lot of
2193s things that are very similar and you
2195s don't evolve the existing ones
2197s as opposed to we own the api for
2199s characters do you need more data need to
2201s change the way that something works then
2203s there's a team that can have a
2204s conversation with that and usually
2205s that's over a pr over protobuf
2208s and ultimately
2210s new technology is easy culture is not
2212s and i and i say that it's a relative
2214s statement like there's a lot of complex
2216s things that we're doing with technology
2218s but the thing that surprised us the most
2220s was kind of people's reaction to that
2221s new technology
2223s some people jumped right in other people
2224s it kind of reflected some deficiencies
2226s that we had and kind of the processes
2228s that we were doing were again going back
2230s to automated testing where we were
2231s pulling people into the spotlight of
2233s like cool where's your test and i'm
2234s going i don't have any you can't you
2236s have to add those things in this
2238s ecosystem and so evolving that culture
2241s to understand like what the progress of
2242s those types of things would be
2245s and this is the question i've gotten on
2247s different podcasts and streams that i've
2248s talked on
2250s why why concentrate on this why not
2252s build more features when i do all of
2253s this like
2255s this is a holistic approach to how we
2258s need to fundamentally fix a lot of
2260s different things in the ecosystem over
2262s time over 20 years of teams isolating
2265s and what the features that they only
2266s need to build
2267s and kind of the the turbulence and
2269s natural ups and downs of a company and
2272s people and people's lives in real life
2274s and those kind of things
2275s ultimately we need to fundamentally
2277s change
2278s how we're working and in order to do
2280s that we need to change the technology of
2282s what we're actually building upon
2284s because if we need to fundamentally we
2286s need to fundamentally change how eve
2287s works and we can't do that unless we
2289s change how we work
2291s um and so quasar is kind of the
2292s fundamental stepping stone uh that we're
2295s using to build more and maintain more of
2297s the the eve universe
2300s the end
2301s thanks
2302s [Applause]
2313s do we do questions here
2317s ah
2319s i couldn't tell if that was somebody who
2320s had the authority to say that or not
2326s yeah go ahead the old
2329s guys daddy
2331s [Laughter]
2340s ah
2341s yeah
2343s yeah so the question is like nadia the
2344s new graphical editor
2346s uh that's using to build a lot of the
2348s content is it using quasar it is not
2350s specifically using quasar because the
2351s majority of that is client-side
2353s experience mechanisms that's going on
2355s but it is hooked into a lot of the event
2357s loops that are flying into quasar so we
2360s can observe a lot of what's happening
2362s there and so as that team does more and
2364s more that's kind of outside a
2366s unique uh experience for a single player
2370s because that's ultimately like so far
2372s the np is that um once it starts it
2375s kind of going outside of that scope it
2377s will probably wander more into our
2379s territory as far as what we need to
2380s support
2384s you talked about uh
2396s i wouldn't say blameless
2398s no so i mean we so we try to do we try
2401s to do retros for that and i would argue
2403s that a lot of the the team that works on
2404s quasar and the infrastructure teams in
2406s general um
2408s there are
2409s elements of sre there
2412s where
2413s so we do we do rotations on call
2415s rotations but we kind of combine that
2417s with like if you're on call
2419s i'm not gonna care if you don't get your
2421s primary project done i want you to
2422s concentrate on like answering people's
2424s questions of course if there's alerts
2425s something melts down
2426s all those kind of things but if
2428s everything is quiet it's kind of one of
2430s those things of what's making the most
2431s noise make it stop making noise
2433s so we kind of have that sre mentality in
2435s that sense
2437s the other aspect of that is we're big
2438s fans of slos
2440s um and trying to keep track of those
2442s things and seeing things before they
2444s catch fire like being able to see the
2445s smoke before the fire is pretty powerful
2448s and that has to do with a lot of the
2449s tooling that we've introduced the
2450s ecosystem not only just quasar but the
2452s the original code base of people in line
2453s with things like
2455s sentry honeycomb grafana prometheus
2458s there's tons of stuff like that yeah
2461s so the
2463s simulation obviously still needs to run
2471s no thankfully not
2473s i checked
2475s so
2477s what like what
2479s is plan in terms of
2481s splitting out more and more of those
2483s non-simulation
2485s services into those domain services
2487s right so the question is is what is kind
2489s of the forward plan of simulation
2493s based services versus
2495s non-simulation-based services ultimately
2498s this was the original idea
2501s people don't necessarily agree with me
2502s on this but i don't
2504s usually call eve complex it's very dense
2507s there's just a lot there
2508s like if you look at email it's not
2510s complex
2511s but in the grand scheme of things it's
2513s it there's just a lot more going on what
2515s it can interact with and those types of
2516s things
2517s so the general idea was that we clear
2519s the table
2520s of all of these services so that people
2522s could think a bit bigger about what they
2524s could actually build and even that is
2525s kind of the core of what our team is for
2527s we're we're supposed to be a force
2528s multiplier for the developers and the
2531s more specifically the feature teams that
2532s are building all the stuff that you guys
2534s actually do on a daily basis like this
2536s is why i say cosmic plumber if you know
2538s that it's a problem then you don't ever
2540s think about the plumbing in your house
2541s until it's busted right
2542s um
2544s with the simulation pieces
2546s we have some theories about that that
2547s we're very interested in testing
2549s because when we talk about the
2550s performance piece of quasar that's kind
2552s of one of the one of the big pieces
2553s about grpc
2555s when we look at fleet fights
2556s like we're estimating around 30 percent
2559s of the performance there is spent on
2561s multiplexing serialization and
2563s transmission well iocp because it defers
2566s it to the kernel gets rid of a bit of
2568s that but it doesn't because it still
2569s needs to interact with the socket
2571s serialization is still in python which
2573s is very slow
2575s and multiplexing meaning
2577s 7000 people in a system one person goes
2579s bang and there's 6999 other messages
2582s that we need to send
2583s what we've done with quasar with grpc
2586s mechanisms in the server is that it's
2588s offloaded to a separate thread so
2589s basically because of the python to
2593s c-plus plus mechanism in protobuf that's
2595s just comes stock with it we just have to
2597s marshal
2598s memory over and then we have a separate
2600s thread
2601s crazy eve has a separate thread to do
2604s something where it's actually doing the
2606s serialization of the transmission so we
2608s get it for free
2611s and then we lean heavily on the message
2613s bus ecosystem which is where the dynamic
2616s mechanisms come in and there's a bigger
2617s conversation that we could have around
2618s like
2619s if you'd be surprised the the features
2622s that require the most complex routing
2624s mechanisms one of them that highlights
2626s this is shared bookmarks
2628s from a routing perspective that becomes
2629s a nightmare real fast
2631s and it's one of the few things that are
2633s actually implemented on the proxy side
2634s because it needs all of that information
2637s i don't answer your question
2641s i think more of it
2645s is that like would it make sense for a
2647s team to say you know what i want to get
2648s email
2653s right
2654s so would we would we preemptively move
2656s things over into quasar um yes if
2659s there's a vehicle for it we're just not
2662s there to go and like we're gonna
2663s refactor everything
2665s no one's gonna sign up for that right um
2669s so it comes with a vehicle like what are
2671s we doing like for example a lot of the
2673s work that we've done under the hood for
2675s chat to get it out of xmpp in the
2677s current state that it's in
2679s is that behind the scenes we've had to
2680s build a present service
2683s which has to know where people are in
2685s eve at all times authoritatively which
2688s hilariously we found is very difficult
2691s um so but we need something to motivate
2693s those types of changes then we'll go
2694s back into them like if if the if the
2696s skills revamp starts touching on things
2698s like characters and there's enough
2700s traffic there we might want to pull
2701s characters into a service instead of
2702s quasar but that would be
2705s significant open-heart surgery right
2707s so what you're saying is
2710s our services
2714s yes go that way
2719s uh i saw a talk a few years ago
2722s similarly titled
2723s and uh
2724s the speaker talked about adopting
2727s my memory
2730s yeah was it did it have an orange bus
2732s icon in this in the
2734s yeah that was being tuxford in vegas
2749s oh i could ramble about this for a while
2750s but the short version of that is
2751s basically beam is cube before cube was a
2754s thing
2755s um and the only difference really is
2757s kind of the api
2759s and this is kind of the trend that i'm
2761s seeing in technology in general is that
2763s the implementation doesn't really matter
2764s it's the apis that matter so prometheus
2766s for example everyone loves the apis for
2768s prometheus and how to aggregate data and
2770s how to transmit remote data those kind
2772s of things everyone also hates the
2774s implementation of prometheus because it
2776s eats all the ram and most people don't
2778s take into consideration cardinality and
2780s those types of things so for things like
2781s erlang elixir and beam like that whole
2783s ecosystem is actually quite amazing but
2785s it's not compatible with anything
2788s current in that sense and it's also
2790s doesn't provide a good
2792s external control plane with kind of what
2794s the rest of the world is used to i think
2795s that's the big difference like we tried
2797s running you know
2799s uh beam inside of cube but doesn't make
2800s sense because beam wants to own the
2802s hardware and then it clusters itself and
2804s all those nice things but that it
2805s requires your entire ecosystem to be
2807s inside of erlang or elixir and that's
2809s kind of where the success of cube came
2810s from because it gave it gave everyone
2812s primitives to to have that ubiquitous
2814s language to have a conversation across
2816s basically the entire globe that's why it
2818s caught
2829s fire one system and i know there's been
2832s significant hardware upgrades
2836s so much as the technology needs to
2838s develop what have you guys done or what
2841s you guys think needs to be done to
2849s so how do
2850s i'm oversimplifying but how do we make
2852s things go faster with quasar
2854s um
2854s ultimately this kind of goes back to
2856s we're talking about earlier when we were
2857s talking about the the effects of
2860s transmission serialization multiplexing
2861s those types of things um
2864s what i'm trying to do and we're toying
2866s with and then playing with the idea of
2868s is
2869s sending simulation frames over quasar
2871s because we know one that's already
2873s significantly faster just over the wire
2876s it's significantly faster
2879s theoretically we know we can then free
2881s up 30 of the processing time during a
2884s massive fleet fight that that is our
2885s upper bounds of what we could
2886s potentially bring to the table but that
2888s is a
2889s non-trivial project
2891s and literally reassembling the train as
2893s it's going down the tracks um so
2896s we haven't engaged in any of this yet
2898s and again this comes with the clearing
2899s the table concept of like we'll keep
2901s moving things off the table which
2903s in effect will give us some certain
2905s percentage of there's less things that
2906s this needs to do
2908s but in the grand scheme of things you at
2909s the end of the day you still wind up
2911s with a node dedicated to jeta
2913s and and it doesn't matter how many
2915s services we take away at that point
2916s there's still a node dedicated to jita
2918s even if it's only for the uh simulation
2921s aspect of it
2923s um so that's kind of why we're toying
2924s with the ideas behind like we could send
2926s simulation frames over this and get a 30
2929s bump in in how we're doing things and
2931s there might be some other things in
2932s there like the things that we've talked
2933s about in the past and this is all
2934s theoretical
2936s eve is 100 accurate
2938s but it doesn't necessarily need to be
2939s and i know that's a terrifying statement
2941s um
2942s because when you have seven thousand
2944s people shooting that one guy at some
2945s point time you gotta go he's dead like
2947s stop
2948s stop counting the bullets he's gone
2951s uh but eve keeps going yep still dead
2954s still still dead
2955s um so there's there's maybe some other
2957s like philosophical things that we could
2958s take on like how we deal with the rules
2960s engine and the simulation in that regard
2962s but this is all theory crafting because
2964s again it comes back to
2966s the the vehicle that we have to move
2967s forward with those things but i'm i am
2970s personally chomping at the beds to find
2971s something to hook that two that we can
2973s toy with that idea
2974s it might the first iteration of that
2976s might be something like we don't send
2979s like the data that comes in for the gate
2980s holograms
2981s like the state on the other side of the
2983s gate
2984s we might toy with the idea of routing
2986s that over quasar like start simple there
2988s instead kind of again the clear the
2990s table mentality of like how far deeply
2992s can we go into that that space
3006s right so what domain services have been
3008s put into quasar um
3009s so
3010s there is a chat service that we haven't
3011s rolled out yet it's kind of been a
3013s shadow service
3014s some players have already found that
3017s um
3018s and skill plans is another one activity
3021s tracker was the original one but
3022s activity tracker is not necessarily a
3024s 100 quasar service in the sense that
3026s uh it smuggles data through the original
3029s carbon io connections uh because it was
3031s built before we had the connectivity to
3033s the client
3034s so we're like oh we can consume and
3035s track all these events that are coming
3036s in but we can't tell anybody so we just
3039s sent it back through the server itself
3040s down to the client that's something that
3041s we could uh probably renovate or that
3043s will come with the changes that we're
3045s doing for the air career program um the
3047s air career program will be another one
3049s that's 100 uh quasar uh the one we're
3051s talking about earlier which isn't really
3052s player-facing but the
3054s um
3056s presence management which we normally
3058s we're doing in xmpp
3060s which
3062s fun fact
3063s 90 of the traffic in xmpp for us is
3065s presence not chat
3067s it's just telling everyone where they
3069s are that's that's the biggest
3070s multiplexing problem that we have
3072s um there's probably some other ones i'm
3074s forgetting but those are the
3076s i think uh
3079s data
3081s oh yeah for like the data pipelines for
3084s uh for data and analytics um
3086s do we still do the recommendation stuff
3089s yeah
3090s the recommendation like so the
3091s recommendations that you get the three
3093s recommendations that you get if you get
3094s into is that still feature flagged i
3096s can't remember
3098s it's for everybody so yeah so you those
3099s three recommendations that come in
3100s that's actually closing the loop from
3102s the client to quasar to
3105s the data cube if you will warehouse lake
3108s i don't know any of the data terms um
3110s and then that's coming back through
3111s quasar the client of saying hey you want
3112s to do one of these three things based on
3114s what you've been doing in eve
3116s uh i think that's it you know my guys in
3119s here that i can
3120s yeah we'll just stop there
3122s last question
3124s that guys
3149s right
3150s so the question is like if we have all
3152s these things emitting events and parts
3153s of them come down and go back up how do
3155s we deal with integrity in that regard so
3158s there are massive papers that you can
3160s read on that that are really boring but
3161s event sourcing is the answer to that
3163s question
3164s um
3165s a lot of how we deal with that is mostly
3168s a little bit more than best effort
3170s delivery and what i mean by that is best
3172s effort is usually like it's on the
3173s socket good luck um
3175s so we also have a little bit more than
3177s that where we do a lot of disk queuing
3178s and mechanisms for like publishing
3180s confirms with rabbitmq so we basically
3182s say hey rabbitmq send this to people and
3184s tell me when the first guy got it and if
3187s that doesn't happen it goes to disk and
3188s we retry so this usually manifests in
3191s that fail state as as a thundering herd
3193s or a stampede basically um where i think
3197s we've talked about this on twitter
3198s during certain interesting situations
3200s where it's like yep we're now draining
3201s 50 million events because something fell
3204s over
3205s but that's the big difference between
3206s like i was saying earlier with uh the
3208s the events in the universe that are
3210s facts
3211s those are the ones that we treat uh uh
3214s with more respect i guess if you will uh
3216s those are the ones that we trust if that
3218s thing comes through it's true whereas
3219s the events that are coming from the
3220s client you have to be statistically
3222s significant because if it falls over we
3224s don't care
3227s they do that's true yes yes they do yeah
3234s indeed slo is service level objective uh
3237s there's also slis which are indicators
3240s and these are different from alerts in
3241s general this gets into sre stuff but
3244s ultimately it's more about
3246s knowing that your system is trending
3247s poorly versus something terrible has
3249s already happened
3251s yeah i think you said that was the last
3252s one all right thanks guys appreciate it
3254s we also have a roundtable after this