Darren

Darren

0-day streak
This is a simulator for a Magic: The Gathering-like card game my friends and me worked on, with emphasis on abstraction and modding. It's currently in a very rough draft, but I got most of the game logic to work, and you can play with it yourself! Shipped at replit.com/@terezipyrope/mtgalike Source code & README at github.com/SuffixAutomata/mtgalike
https://scrapbook-into-the-redwoods.s3.amazonaws.com/f3451e04-1ed2-4763-a7e7-11a953dac329-image.pnghttps://imgutil.s3.us-east-2.amazonaws.com/bad72fbc3e28ed404ab9990338c09f930ede41c4b13c91cd3bb74b12ea2674ce/f9072bda-9e53-4466-b06f-08668f903c5c.png
This is a waltz I composed recently. The idea was to have a piece that started out in 3/4 waltz time and then became a hybrid of two waltzes on top of each other, but it grew a bit, well, dense. The current ending is a provision - this will probably be a first revision of a larger-scale piece I'm going to publish in the future, but I'm also satisfied with it as is, even if I'm going to work on it more. Project repository at gitlab.com/terezi/waltztest/-/tree/main?ref_type=heads and score at gitlab.com/terezi/waltztest/-/blob/main/waltz.pdf?ref_type=heads
https://scrapbook-into-the-redwoods.s3.amazonaws.com/5695e589-b1ed-4c58-aafb-e3d31f7e9d01-image.png
This is the release 2.3 for my scourge cellular automaton search program, implementing a novel search order that addresses the ludicrous memory requirements of previous versions by removing left-right matching redundancy of nodes in the search tree. A lot of the work mainly involved getting Mongoose to work with the new and more efficient binary transmission format, and changing the way the worker clients handle different trees. This lays the groundwork for multiple simultaneous running trees in future versions. I haven't tested this thoroughly yet due to some server issues, but from what I can run so far its efficiency is an amazing improvement over 2.2. Next steps include ensuring that the scaling is actually effective, among other things. gitlab.com/terezi/scourge/-/tree/bisectTest?ref_type=heads
https://scrapbook-into-the-redwoods.s3.amazonaws.com/2ed43132-193e-40ec-8034-601f713c84ae-screenshot_2024-08-30_17-50-02.png
summer-of-making emoji
this concludes the work I've done on my Distributed Tokenized Attention project during Arcade I learnt NCCL and successfully implemented the Tokenized Attention multicore forwards pass, distributing across multiple GPUs; testing is still underway. I also thoroughly optimized the kernels of the single GPU version, achieving 110x speedup and 200x speedup over the CPU for forwards and backwards respectively, and I plan to continue these optimizations in the multicore GPU version. I also wrote a presentation, both for the Scrapbook and plans for a talk I was invited to. The repository is at gitlab.com/terezi/DTA/-/tree/main?ref_type=heads and the presentation is at gitlab.com/terezi/DTA/-/tree/main?ref_type=heads.
https://scrapbook-into-the-redwoods.s3.amazonaws.com/20b06503-1761-4fa9-9222-7e9a10ad5b49-image.png
This is the third phase of my Tokenized Attention project. I have implemented and tested the single GPU CUDA implementation of the forwards pass, and implemented (but not yet tested) the implementation of the backwards pass. The biggest difficulty I faced was my extremely old GTX 750 GPU, which caused default invocations of the compiler to completely fail to trigger any kernel, without warning me at all, which was fun to debug. I also learnt the internal representation of the PyTorch MultiHeadAttention module and implemented a naive version using PyTorch - exposing the fact that I somehow forgot a negative sign in my Q and K gradient calculations - and to my great dismay I found was 30 times faster than my CPU implementation, though I may take solace in the knowledge that my implementation is significantly more memory-optimized than PyTorch's. The next step will be testing the CUDA backwards pass and writing the multi-GPU forwards and backwards pass, pretty much entirely based on the CPU multithreading code, as well as writing sparse alternatives to cuBLAS. gitlab.com/terezi/DTA
https://scrapbook-into-the-redwoods.s3.amazonaws.com/b33dc958-419f-46be-80f8-2730cd584a4a-image.png
summer-of-making emoji
goose-honk-technologist emoji
dino-debugging emoji
This is a cadenza I wrote for a piano concerto collab with my friends. I made a visualizer using MIDIVisualizer for everyone to enjoy too! GitLab for log purposes is at gitlab.com/terezi/altcadenza (note for reviewers - i wrote down the notes and the performance outline today, but the themes come from some years ago and many ideas on textures came from assorted improvisations. I only logged the 5 hours that I could demonstrate were spent working on the music.) Honestly it was so refreshing to work on something other than code, it helped me relax a lot.
This is a continuation of my distributed Tokenized Attention project - these updates were dedicated to thoroughly testing the three (naive, single thread packed, multi-thread MPI packed) current CPU implementations of the Tokenized Attention forward and backward pass against each other, and finally ensuring that they pass thorough mathematical and analytical checks. A demonstration of gradient descent was also added to further verify the correctness of the forwards and backwards calculation as well as to show how this code might be integrated with PyTorch DistributedDataParallel. This session also marks the completion of the backwards multithreaded calculation. Most of the time after that was spent fixing a vivid assortment of bugs from my last scrapbook, which I introduced by pushing forward and not spending any time doing verification. Next steps will be implementing forwards+backwards single GPU packed on CUDA + HierarchicalKV, and forwards+backwards multi-GPU packed on CUDA + NCCL + HierarchicalKV. gitlab.com/terezi/DTA
https://scrapbook-into-the-redwoods.s3.amazonaws.com/05f690ba-e6a7-4738-94a6-08b4372c442e-image.pnghttps://scrapbook-into-the-redwoods.s3.amazonaws.com/6a5f196c-c45c-4a02-af66-1cf0f56f9446-image.pnghttps://scrapbook-into-the-redwoods.s3.amazonaws.com/69a8b0c6-15ab-4baa-ad7a-19c17e0fb26f-image.png
goose-honk-technologist emoji
summer-of-making emoji
This project details my attempt to implement an efficient distributed algorithm for the Attention calculation of modern Transformers deep learning models - specifically, those used in recommendation models whose input are tokens over a vast dictionary set, with parts of the embedding distributed across many GPUs. This involved rederiving the entire backpropagation equation for the Attention calculation, as well as comparing three different distribution techniques, on top of dealing with special constraints involving the unique token amount. So far I've implemented the CPU version of both single-threaded packed and distributed packed attention forward and backward (backward multi-threaded is not yet completely finished); where implemented, all the tests have passed. I'm constantly updating the technical report detailing the math behind this implementation. This has been my first time working with MPI, and the next step including an efficient CUDA implementation as well as thorough benchmarking will be my first attempt working with NCCL. gitlab.com/terezi/DTA
https://scrapbook-into-the-redwoods.s3.amazonaws.com/dd75986c-1550-4ad8-a3cc-05e2d6ca579f-image.png
summer-of-making emoji
Hi! This is my second scrapbook post, and a continuation of my first scrapbook post. This continues my project scourge, (gitlab.com/terezi/scourge, github github.com/SuffixAutomata/scourge) a program for searching for configurations that satisfy very special mathematical conditions in Conway's Game of Life. By digging through various rabbit holes pertaining to network latency, error handling, and how the kernel handles sockets, I optimized my project from barely handling 300 connected cores to easily handling 10000+. This concludes the second phase of my project, and demonstrates the robustness of my implementation of a lightweight extremely low latency high-throughput computing framework, which is already far beyond my initial expectations. The next phase will involve moving some algorithmic developments I did pre-Arcade onto this framework, and testing its scaling. This part will be even more theoretical and involve lots of adjustments. Sorry for the sessions that look like I did nothing / sent scraps like five hours after the session ended, the work this session was mainly on remote servers and login nodes for distributed computing clusters, and I really easily get hyperfixated and forget I'm supposed to log and send scraps. I did my best with including the computation results of what I was adjusting each session though.
https://imgutil.s3.us-east-2.amazonaws.com/3aba8b1cd93af1cbf78375541eb4b04ad9b840e55fbfde79a6b48fc7177b266e/11710dca-6b6b-491a-ad1a-b94041d34bad.png
Hi! this is my first scrapbook. I've spent 25 hours completely refactoring an old version of my program scourge (gitlab.com/terezi/scourge, github github.com/SuffixAutomata/scourge) for searching for configurations in Conway's Game of Life and transforming it into a distributed system that I aim to allow to connect to tens of thousands of computers through platforms like Charity Engine to conduct a massively parallel search. I used mongoose as a server which was quite of a steep curve to learn as I had no prior experience in any of JS, web protocols, or server programming, but I managed to get everything running! Next steps will be optimizing through algorithms I've previously developed, and making sure it scales well. I attached a video too (web server will not be public due to privacy and safety reasons)
https://imgutil.s3.us-east-2.amazonaws.com/3aba8b1cd93af1cbf78375541eb4b04ad9b840e55fbfde79a6b48fc7177b266e/949ccae8-8ade-416a-88c1-078573f5b4b7.png