I've been working on a few enhancements to the LivySessionController as part of my ticket NIFI-6175 : Livy - Add Support for All Missing Session Startup Features.
I've been running into a lot of... interesting? features in the code, and I was hoping someone (Matt?) who was involved with it from the beginning could shed some light on these questions before I change any default behavior/fix any perceived bugs.
* If NiFi is running in a cluster then a race condition is entered where you can't really predict how many Livy sessions will be created.
* If the controller service was recently running, and you just restarted the controller service, then everything is fine. The existing Livy sessions will be found and used. But even in this scenario it's not working correctly, because if I asked the controller service to use two instances, it will use all available sessions in Livy for the "Kind" (more on this farther down).
* The logic checks if any sessions exist on controller startup (and on each update interval), since all instances of the controller service start up roughly at the same time, you might end up with full duplication across the cluster, or you might end up with 50% duplication, or anything in between, depending on how quickly session create requests get sent in.
* The controller service will "steal" any open Livy session that it can see, so long as the "Kind" of session matches the configuration. It will also over reach in session allocation if more sessions are available then it needs.
* If there are 10 Livy sessions open, it will load all 10 as available for use, even if I only wanted 2. If some of those sessions die off it does not create new ones, but it will keep using them as long as they are available.
* If you have multiple Livy Controller Services, it's very hard (impossible?) to keep sessions separate if they are running under the same account (and maybe even if they are not, have not spent much time testing the separate account option).
* The code does not block the sessions/mark it as used. It relies on the Livy Session state value of "Idle" to designate a session as available. This is another race condition where running multiple threads of ExecuteSparkInteractive, either because your in a cluster, or because you just have multiple threads, would easily dual assign a Livy session instead of using the expected WAIT relationship.
* The Controller Service is unable to delete existing sessions. So even though there is a Controller Service shutdown hook in the code, it does not clean up it's open sessions and they have to time-out.
I don't have resolutions for all these issues. But one thing I was thinking about doing was using the Livy Session Name parameter to tag the session when it's created so it's associated with a specific Controller Service instance by UUID (so it would work across a cluster too), and maybe only manage Livy sessions from the master node? (not sure how to find out if your on Master, but a thought).